Skip to content

Instantly share code, notes, and snippets.

@m0o0scar
Created July 26, 2024 07:41
Show Gist options
  • Save m0o0scar/1ceffd24c807aec44d2bcc419b42ae73 to your computer and use it in GitHub Desktop.
Save m0o0scar/1ceffd24c807aec44d2bcc419b42ae73 to your computer and use it in GitHub Desktop.
stanford-oval/storm. Continue this conversation at http://localhost:3000?gist=1ceffd24c807aec44d2bcc419b42ae73

[github] stanford-oval/storm

Source

Python / 5.5K lines of code. An LLM-powered knowledge curation system that researches a topic and generates a full-length report with citations.

URL: https://github.com/stanford-oval/storm

Conversation

β–² Explain the major function of the project and how it works by referencing the most important and relevant code snippets.

The project, named STORM, is a system that automatically generates Wikipedia-like articles from scratch based on internet search. It aims to assist users in the pre-writing stage of article creation by conducting research, collecting references, and generating an outline.

STORM works in two stages:

1. Pre-writing Stage:

  • Perspective-Guided Question Asking: The system identifies different perspectives on the topic by analyzing existing Wikipedia articles on similar topics. This helps generate more comprehensive and insightful questions.
  • Simulated Conversation: STORM simulates a conversation between a Wikipedia writer and a topic expert, grounded in internet sources. This allows the system to update its understanding of the topic and ask follow-up questions.
  • Information Gathering: Based on the questions generated, STORM conducts internet searches using a chosen search engine (e.g., You.com, Bing) and collects relevant information.

2. Writing Stage:

  • Outline Generation: The system uses the collected information and the generated questions to create a hierarchical outline for the article.
  • Article Generation: STORM populates the outline with the collected information, generating a full-length article with citations.
  • Article Polishing: The system refines the generated article by adding a summarization section and optionally removing duplicate content.

Here are some relevant code snippets that illustrate the key functionalities:

Pre-writing Stage:

  • Perspective-Guided Question Asking:

    # knowledge_storm/storm_wiki/modules/knowledge_curation.py
    def _get_considered_personas(self, topic: str, max_num_persona) -> List[str]:
        return self.persona_generator.generate_persona(topic=topic, max_num_persona=max_num_persona)

    This snippet shows how the system generates different perspectives (personas) for the topic.

  • Simulated Conversation:

    # knowledge_storm/storm_wiki/modules/knowledge_curation.py
    def _run_conversation(self, conv_simulator, topic, ground_truth_url, considered_personas,
                            callback_handler: BaseCallbackHandler) -> List[Tuple[str, List[DialogueTurn]]]:
        # ...
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
            future_to_persona = {executor.submit(run_conv, persona): persona for persona in considered_personas}
            # ...
            for future in as_completed(future_to_persona):
                persona = future_to_persona[future]
                conv = future.result()
                conversations.append((persona, ArticleTextProcessing.clean_up_citation(conv).dlg_history))
        return conversations

    This snippet demonstrates how the system simulates conversations with different personas to gather information.

Writing Stage:

  • Outline Generation:

    # knowledge_storm/storm_wiki/modules/outline_generation.py
    def generate_outline(self,
                         topic: str,
                         information_table: StormInformationTable,
                         old_outline: Optional[StormArticle] = None,
                         callback_handler: BaseCallbackHandler = None,
                         return_draft_outline=False) -> Union[StormArticle, Tuple[StormArticle, StormArticle]]:
        # ...
        result = self.write_outline(topic=topic, dlg_history=concatenated_dialogue_turns,
                                    callback_handler=callback_handler)
        article_with_outline_only = StormArticle.from_outline_str(topic=topic, outline_str=result.outline)
        # ...
        return article_with_outline_only, article_with_draft_outline_only

    This snippet shows how the system generates an outline for the article based on the collected information and conversation history.

  • Article Generation:

    # knowledge_storm/storm_wiki/modules/article_generation.py
    def generate_article(self,
                         topic: str,
                         information_table: StormInformationTable,
                         article_with_outline: StormArticle,
                         callback_handler: BaseCallbackHandler = None) -> StormArticle:
        # ...
        for section_output_dict in section_output_dict_collection:
            article.update_section(parent_section_name=topic,
                                    current_section_content=section_output_dict["section_content"],
                                    current_section_info_list=section_output_dict["collected_info"])
        article.post_processing()
        return article

    This snippet illustrates how the system generates the full-length article by populating the outline with the collected information.

  • Article Polishing:

    # knowledge_storm/storm_wiki/modules/article_polish.py
    def polish_article(self,
                         topic: str,
                         draft_article: StormArticle,
                         remove_duplicate: bool = False) -> StormArticle:
        # ...
        polish_result = self.polish_page(topic=topic, draft_page=article_text, polish_whole_page=remove_duplicate)
        lead_section = f"# summary\n{polish_result.lead_section}"
        polished_article = '\n\n'.join([lead_section, polish_result.page])
        # ...
        return polished_article

    This snippet shows how the system refines the generated article by adding a summary section and optionally removing duplicate content.

Overall, STORM is a modular system that leverages large language models (LLMs) and internet search to automate the research and pre-writing process for Wikipedia-like articles. It aims to provide a valuable tool for users who want to create informative and well-structured articles.

β–² Tell me more about the "Simulated Conversation", who are the participants and what are the input & output of this conversation?

The "Simulated Conversation" in STORM is a key part of the pre-writing stage. It involves two participants:

  1. Wikipedia Writer: This is a simulated persona that represents a Wikipedia editor who is writing an article on the given topic. The Wikipedia Writer is designed to ask questions that are relevant to the topic and that would be helpful for a Wikipedia editor to know.
  2. Topic Expert: This is a simulated persona that represents an expert on the topic. The Topic Expert is designed to answer the Wikipedia Writer's questions using information retrieved from the internet.

Here's a breakdown of the input and output of this conversation:

Input:

  • Topic: The topic of the article being written.
  • Persona: The Wikipedia Writer's persona, which might be a specific role or perspective related to the topic. This helps to guide the questions asked.
  • Dialogue History: The previous turns of the conversation, which helps the Wikipedia Writer to ask more relevant and insightful questions.

Output:

  • Dialogue History: A record of the conversation, including the questions asked by the Wikipedia Writer and the answers provided by the Topic Expert.
  • Search Results: A list of information retrieved from the internet by the Topic Expert in response to the Wikipedia Writer's questions. This information is used to generate the article outline and content.

Example:

Let's say the topic is "Artificial Intelligence". The Wikipedia Writer might have the persona of a "Computer Science Student". The conversation might go like this:

Wikipedia Writer: What are the main types of artificial intelligence?

Topic Expert: Artificial intelligence can be broadly categorized into three main types: narrow AI, general AI, and super AI. Narrow AI is designed to perform specific tasks, such as playing chess or recognizing faces. General AI refers to machines that can perform any intellectual task that a human can. Super AI refers to machines that surpass human intelligence in all aspects.

Wikipedia Writer: Can you tell me more about the history of artificial intelligence?

Topic Expert: The field of artificial intelligence has its roots in the 1950s, with the Dartmouth Summer Research Project on Artificial Intelligence being considered a landmark event. Early research focused on developing programs that could play games, translate languages, and solve problems.

This conversation would continue until the Wikipedia Writer has gathered enough information to start writing the article.

Code Snippet:

# knowledge_storm/storm_wiki/modules/knowledge_curation.py
class ConvSimulator(dspy.Module):
    # ...
    def forward(self, topic: str, persona: str, ground_truth_url: str, callback_handler: BaseCallbackHandler):
        # ...
        for _ in range(self.max_turn):
            user_utterance = self.wiki_writer(topic=topic, persona=persona, dialogue_turns=dlg_history).question
            # ...
            expert_output = self.topic_expert(topic=topic, question=user_utterance, ground_truth_url=ground_truth_url)
            dlg_turn = DialogueTurn(
                agent_utterance=expert_output.answer,
                user_utterance=user_utterance,
                search_queries=expert_output.queries,
                search_results=expert_output.searched_results
            )
            dlg_history.append(dlg_turn)
            callback_handler.on_dialogue_turn_end(dlg_turn=dlg_turn)
        return dspy.Prediction(dlg_history=dlg_history)

This snippet shows how the ConvSimulator module simulates the conversation between the Wikipedia Writer and the Topic Expert. The wiki_writer and topic_expert modules handle the question asking and answering, respectively. The dlg_history variable stores the conversation history, and the search_results are collected from the internet searches.

The simulated conversation is a key part of STORM's ability to generate comprehensive and informative articles. By simulating a dialogue between a Wikipedia writer and an expert, the system can gather a wide range of information and perspectives on the topic, which is then used to generate a well-structured and informative article.

{
"url": "https://github.com/stanford-oval/storm",
"type": "github",
"title": "stanford-oval/storm",
"picture": "https://avatars.githubusercontent.com/u/13667124?v=4",
"description": "Python / 5.5K lines of code.\nAn LLM-powered knowledge curation system that researches a topic and generates a full-length report with citations."
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment