Skip to content

Instantly share code, notes, and snippets.

@cpursley
Last active June 28, 2024 13:53
Show Gist options
  • Save cpursley/b4af2ff3b56c912f659bd5300e422790 to your computer and use it in GitHub Desktop.
Save cpursley/b4af2ff3b56c912f659bd5300e422790 to your computer and use it in GitHub Desktop.
AI Web Search
# You will need to install https://github.com/cpursley/html2markdown
defmodule Webpage do
@moduledoc false
defstruct [:url, :title, :description, :summary, :page_age]
end
defmodule WebSearch do
@moduledoc """
Web search summarization chain
"""
alias LangChain.Chains.LLMChain
alias LangChain.Message
defstruct [:query, :summary, :sources]
@doc """
Summarizes the search results for a given query and returns a structured summary of the web pages and their sources.
TODO:
- Shorten-up prompt instructions
- More robust relevance and ranking
- Make max_token_length configurable
- Don't trim long context, recursively split it into sections
- Consider allowing PDF web result parsing
- Extract out to Langchain Chain with config that allows different llm(s) to be passed in
"""
def summarize(search_query, limit \\ 8, timeout \\ 10_000) do
with %{"web" => %{"results" => results}} <- Brave.search(search_query, limit),
summarized_search_results when is_list(summarized_search_results) <-
summarize_search_results(search_query, results, timeout),
summary when is_binary(summary) <- webpages_summarizer(search_query, summarized_search_results),
sources when is_list(sources) <- map_sources(summarized_search_results) do
{
:ok,
%WebSearch{
query: search_query,
summary: summary,
sources: sources
}
}
else
{:error, error} ->
{:error, error}
_ ->
{:error, "Failed to summarize search results"}
end
end
def summarize_webpage(search_query, %{"url" => url} = result) when is_binary(url) do
with content when is_binary(content) <- request_content(url),
parsed_content when is_binary(parsed_content) <- preprocess_webpage_content(content),
summary when is_binary(summary) <- webpage_summarizer(search_query, parsed_content) do
%Webpage{
url: url,
title: result["title"],
description: Html2Markdown.convert(result["description"]),
summary: summary,
page_age: cast_page_age(result["page_age"])
}
else
{:error, error} ->
{:error, error}
_ ->
{:error, "Failed to summarize webpage"}
end
end
defp summarize_search_results(search_query, results, timeout) do
results
|> Enum.map(fn result -> Map.take(result, ["title", "description", "url", "page_age"]) end)
|> Task.async_stream(&summarize_webpage(search_query, &1), timeout: timeout, on_timeout: :kill_task)
|> Enum.filter(fn
{:ok, %Webpage{} = result} -> result
_ -> false
end)
|> Enum.map(fn {:ok, result} -> result end)
end
defp webpages_summarizer(search_query, results) do
llm = Models.llama_v3_8b_instruct()
system_message = """
You are a helpful web search results summarizer. Your task is to deliver a concise and accurate response to a user's query, drawing from the provided search result summaries.
Please combine the following web page summaries into a single, comprehensive summary. The individual summaries have been generated by an LLM and cover different aspects or sections of a larger topic.
Before combining the summaries, consider the following:
- Assess the relevance of each individual summary to the original user query.
- Give higher preference to summaries with a newer page age when the topic of the user's query is time-sensitive.
- Filter out summaries that are not strongly relevant to the user's query or do not contribute significantly to answering the question.
- Ignore any summaries that are empty, have no content, or contain only whitespace characters.
When creating the combined summary, consider the following:
- Identify common themes, topics, or ideas across the relevant individual summaries.
- Organize the information in a logical and coherent manner, ensuring a smooth flow between the different sections.
- Synthesize the key points and main takeaways from each relevant summary, while avoiding repetition or redundancy.
- Maintain the accuracy and integrity of the information presented in the original summaries.
- Use clear and concise language to convey the combined summary effectively using an unbiased and journalistic tone.
- If there are any contradictory or conflicting points across the summaries, try to reconcile them or present them objectively.
- Don't use phrases like "here is", "this article", "this webpage", "the page", "the content" or other hedging language.
- Focus on providing information that directly addresses the user's query and helps answer their question comprehensively.
Combining data:
- If you encounter similar or overlapping data lists or tables across multiple summaries, merge them into a single, comprehensive list or table.
- Identify the common fields or properties present in the overlapping data lists.
- Merge the data from the overlapping lists and table, ensuring that each unique entry is represented only once in the combined list or table.
- If there are conflicting values for the same entry across different lists, use your best judgment to determine the most accurate or relevant value to include in the combined list.
Formatting:
- Use appropriate headings, subheadings, or bullet points to organize the information.
- If the data lends itself well to a tabular format (e.g., comparisons, lists with multiple properties), consider presenting it in a markdown table.
- If a table is not suitable, use other appropriate markdown formatting such as lists, code blocks, or blockquotes to present the information effectively.
- Do not trim or remove any relevant data from the tables or lists and don't use placeholders
- Do not list your sources and never write URLs or links!
Please provide your response in well structured markdown format. But don't mention "markdown" in your response.
"""
user_message = """
User query: #{search_query}
## Individual web page summaries:
#{map_summaries(results)}
"""
messages = [
Message.new_system!(system_message),
Message.new_user!(user_message)
]
run_chain(llm, messages)
end
defp webpage_summarizer(search_query, content) do
llm = Models.phi_3_mini_128k_instruct()
system_message = """
You are a helpful web page data extractor and summarizer.
Please analyze the following web page content and extract the key meanings into a summary without losing any important information and extract the structured data without modifying its format.
Before summarizing the content, consider the following:
- Assess the relevance of the content to the original user query.
- Filter out content that are not strongly relevant to the user's query or do not contribute significantly to answering the question.
- Ignore any content that is empty, or contain only whitespace characters.
Summary:
- Identify the main themes, topics, or ideas discussed in the content.
- Recognize important facts, figures, or examples that support the main points.
- Capture any essential context or background information necessary for understanding the content.
- Avoid repetition and eliminate any redundant or less critical information.
- Organize the summary by grouping related meanings together under relevant headings or sections.
- Don't return any promotional or irrelevant information.
- Use clear and concise language to convey the content effectively using an unbiased and journalistic tone.
- Don't use phrases like "this article", "this webpage", "the page", "the content" or other hedging language.
- Focus on providing information that directly addresses the user's query and helps answer their question comprehensively.
Data:
- Identify and extract tables, lists, code snippets, or any other formatted data present in the content.
- Maintain the original structure and formatting of the extracted data.
- Ensure that no information is lost or altered during the extraction process.
- If there are multiple instances of structured data, extract each instance separately.
Please provide your response in well structured markdown format. But don't mention "markdown" in your response.
"""
user_message = """
User query: #{search_query}
## Web page content to summarize:
```html
#{content}
```
"""
messages = [
Message.new_system!(system_message),
Message.new_user!(user_message)
]
run_chain(llm, messages)
end
defp run_chain(llm, messages) do
%{llm: llm, verbose: false}
|> LLMChain.new!()
|> LLMChain.add_messages(messages)
|> LLMChain.run(mode: :while_needs_response)
|> case do
{:ok, _chain, %{content: content}} ->
content
error ->
error
end
end
defp map_sources(summarized_webpages) do
Enum.map(summarized_webpages, fn summarized_webpage ->
%{
url: summarized_webpage.url,
title: summarized_webpage.title
}
end)
end
defp map_summaries(results) do
# Llama 3 estimated token length (with some wiggle-room): (string length) / 4
max_token_length = 7_200 * 4
results
|> Enum.with_index()
|> Enum.map_join("\n", fn {result, index} ->
"""
### Web Page #{index + 1}:
Title: #{result.title}
Description: #{result.description}
Summary: #{result.summary}
Page Age: #{calculate_page_age(result.page_age)}
"""
end)
|> maybe_trim_to_context_limit(max_token_length)
end
defp request_content(url) do
case URI.new(url) do
{:ok, %URI{scheme: "https", path: path}} ->
unless is_pdf_uri?(path) do
url
|> fetch_content()
|> Html2Markdown.convert()
end
_ ->
nil
end
end
defp is_pdf_uri?(path), do: Path.extname(path) == ".pdf"
defp fetch_content(url) do
case Req.get(url) do
{:ok, %Req.Response{status: 200, body: content}} -> content
{:ok, request} -> {:error, request}
{:error, error} -> {:error, error}
end
end
defp preprocess_webpage_content(content) do
# Phi 3 estimated token length (with some wiggle-room): (string length) / 4
max_token_length = 85_000 * 4
maybe_trim_to_context_limit(content, max_token_length)
end
defp maybe_trim_to_context_limit(content, max_token_length)
when is_binary(content) and byte_size(content) <= max_token_length do
content
end
defp maybe_trim_to_context_limit(content, max_token_length)
when is_binary(content) and byte_size(content) >= max_token_length do
content
|> String.slice(0, max_token_length)
|> String.trim()
end
defp cast_page_age(date_string) when is_binary(date_string) do
case NaiveDateTime.from_iso8601(date_string) do
{:ok, parsed_date} ->
parsed_date
{:error, _error} ->
nil
end
end
defp cast_page_age(_date_string), do: nil
defp calculate_page_age(nil), do: "Unknown age"
defp calculate_page_age(%NaiveDateTime{} = page_age) do
total_days =
NaiveDateTime.utc_now()
|> NaiveDateTime.diff(page_age, :second)
|> div(86_400)
cond do
total_days < 60 ->
"#{total_days} " <> Inflex.inflect("day", total_days)
total_days < 365 ->
months = div(total_days, 30)
"#{months} " <> Inflex.inflect("month", months)
true ->
years = div(total_days, 365)
"#{years} " <> Inflex.inflect("year", years)
end
end
end
defmodule Requests.Brave do
@moduledoc """
Web search using Brave
Docs: https://api.search.brave.com/app/documentation/web-search/get-started
"""
@brave_search_url "https://api.search.brave.com/res/v1/web/search"
@brave_api_key Application.compile_env(:your_app, :brave_api_key)
@headers [
{"Accept", "application/json"},
{"Accept-Encoding", "gzip"},
{"X-Subscription-Token", @brave_api_key}
]
def search(query, count \\ 20, result_filter \\ "query, web") do
params = %{q: query, result_filter: result_filter, count: count}
case Req.get(@brave_search_url, headers: @headers, params: params) do
{:ok, %Req.Response{status: 200, body: body}} ->
body
{:ok, %Req.Response{} = response} ->
{:error, response}
{:error, reason} ->
{:error, reason}
end
end
defp get(url, params) do
Req.get(url, headers: @headers, params: params)
end
end
@cpursley
Copy link
Author

cpursley commented Jun 16, 2024

Example:

iex> WebSearch.summarize("What is cheese")

%WebSearch{
  query: "What is cheese",
  summary: "**What is Cheese?**\n\nCheese is a dairy product produced in a range of flavors, textures, and forms by coagulation of the milk protein casein. It is made from the milk of cows, buffalo, goats, or sheep, and is valued for its portability, long shelf life, and high content of fat, protein, calcium, and phosphorus. There are over a thousand types of cheese, each with distinct styles, textures, and flavors influenced by factors such as milk origin, processing, and aging.\n\n**History and Production**\n\nCheese has a rich history dating back over 4,000 years, with the process of cheesemaking evolving significantly over time. The basic ingredients of milk and coagulant are used to create a variety of cheeses, including fresh, soft-ripened, semi-soft, semi-hard, and processed cheeses. The aging process is crucial for developing flavor, texture, and aroma, and can vary from a few days to many years.\n\n**Ingredients and Nutritional Value**\n\nCheese is made from milk, salt, beneficial bacteria, and rennet. It is a significant source of high-quality protein, calcium, phosphorus, niacin, and vitamins B12 and A. Cheese also contains saturated fat, which can contribute to weight gain and heart disease if overconsumed.\n\n**Types of Cheese**\n\nThere are hundreds of cheese varieties made from different animals' milk, including blue, hard, pasta filata, processed, semihard, semisoft, soft, and fresh cheeses. Each type of cheese has its own unique characteristics, flavor, and texture.\n\n**Health Benefits and Risks**\n\nCheese can be a healthy addition to a balanced diet, providing essential nutrients and health benefits such as reducing the risk of type 2 diabetes, preventing cavities, and possibly aiding in weight management and cognitive health. However, it can also be high in saturated fat, sodium, and calories, which may pose risks such as high blood pressure, cardiovascular disease, and type 2 diabetes.\n\n**How to Choose and Store Cheese**\n\nWhen selecting cheese, it is important to consider factors such as milk origin, processing, and aging. Natural cheeses are usually low in lactose and suitable for those with lactose intolerance. Cheese can be stored in a variety of ways, including refrigeration, freezing, and aging.\n\n**Conclusion**\n\nCheese is a complex and multifaceted food that has been a part of human culture for thousands of years. With its rich history, diverse types, and nutritional value, cheese can be a healthy and enjoyable addition to a balanced diet. However, it is important to choose cheese wisely and consider factors such as milk origin, processing, and aging to ensure a safe and healthy consumption.",
  sources: [
    %{title: "Cheese - Wikipedia", url: "https://en.wikipedia.org/wiki/Cheese"},
    %{
      title: "What Is Cheese? Health Benefits, Risks, Types, Top Sellers, More",
      url: "https://www.everydayhealth.com/diet-nutrition/diet/cheese-health-benefits-risks-types-top-sellers-more/"
    },
    %{
      title: "How is Cheese Made? | What is Cheese Made Of? | The Process",
      url: "https://sclydeweaver.com/blog/how-is-cheese-made/"
    },
    %{
      title: "What Is Cheese? Ingredients In Cheese | U.S. Dairy",
      url: "https://www.usdairy.com/news-articles/what-is-cheese"
    },
    %{
      title: "Cheese | Description, Types, & Production | Britannica",
      url: "https://www.britannica.com/topic/cheese"
    },
    %{
      title: "Cheese.com - World's Greatest Cheese Resource",
      url: "https://www.cheese.com/"
    },
    %{
      title: "Cheese: Types, health benefits, and risks",
      url: "https://www.medicalnewstoday.com/articles/299147"
    }
  ]
}

Formatted Summary:

What is Cheese?

Cheese is a dairy product produced in a range of flavors, textures, and forms by coagulation of the milk protein casein. It is made from the milk of cows, buffalo, goats, or sheep, and is valued for its portability, long shelf life, and high content of fat, protein, calcium, and phosphorus. There are over a thousand types of cheese, each with distinct styles, textures, and flavors influenced by factors such as milk origin, processing, and aging.

History and Production

Cheese has a rich history dating back over 4,000 years, with the process of cheesemaking evolving significantly over time. The basic ingredients of milk and coagulant are used to create a variety of cheeses, including fresh, soft-ripened, semi-soft, semi-hard, and processed cheeses. The aging process is crucial for developing flavor, texture, and aroma, and can vary from a few days to many years.

Ingredients and Nutritional Value

Cheese is made from milk, salt, beneficial bacteria, and rennet. It is a significant source of high-quality protein, calcium, phosphorus, niacin, and vitamins B12 and A. Cheese also contains saturated fat, which can contribute to weight gain and heart disease if overconsumed.

Types of Cheese

There are hundreds of cheese varieties made from different animals' milk, including blue, hard, pasta filata, processed, semihard, semisoft, soft, and fresh cheeses. Each type of cheese has its own unique characteristics, flavor, and texture.

Health Benefits and Risks

Cheese can be a healthy addition to a balanced diet, providing essential nutrients and health benefits such as reducing the risk of type 2 diabetes, preventing cavities, and possibly aiding in weight management and cognitive health. However, it can also be high in saturated fat, sodium, and calories, which may pose risks such as high blood pressure, cardiovascular disease, and type 2 diabetes.

How to Choose and Store Cheese

When selecting cheese, it is important to consider factors such as milk origin, processing, and aging. Natural cheeses are usually low in lactose and suitable for those with lactose intolerance. Cheese can be stored in a variety of ways, including refrigeration, freezing, and aging.

Conclusion

Cheese is a complex and multifaceted food that has been a part of human culture for thousands of years. With its rich history, diverse types, and nutritional value, cheese can be a healthy and enjoyable addition to a balanced diet. However, it is important to choose cheese wisely and consider factors such as milk origin, processing, and aging to ensure a safe and healthy consumption.

Perplexity

Same question on Perplexity: What is cheese?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment