Created
August 31, 2023 00:06
-
-
Save virattt/080446be8df07eb47c4fa55a51550e28 to your computer and use it in GitHub Desktop.
sec_filing_weaviate_tutorial.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/virattt/080446be8df07eb47c4fa55a51550e28/sec_filing_weaviate_tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "6kJ9r4Gkgufk" | |
}, | |
"source": [ | |
"# Overview\n", | |
"\n", | |
"This notebook teaches you how to chat with an SEC filing using Weaviate.\n", | |
"\n", | |
"In my example, I am using Airbnb's quarterly earnings report (10-Q) from Q2 2023. You can update the URL value of `sec_filing_pdf` below to be any report that you want.\n", | |
"\n", | |
"I hope you find this code useful.\n", | |
"\n", | |
"Please feel free to message me on [Twitter](https://twitter.com/virattt) if you want more tutorials like this.\n", | |
"\n", | |
"Happy learning! :)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Step 0. Install dependencies" | |
], | |
"metadata": { | |
"id": "S2mGQxA958dW" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"pip install openai" | |
], | |
"metadata": { | |
"id": "2bY0NapN_z98" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "lEQQJHH9gufm" | |
}, | |
"outputs": [], | |
"source": [ | |
"!pip install -U weaviate-client" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"pip install langchain" | |
], | |
"metadata": { | |
"id": "ygccK6lm54VT" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"pip install tiktoken" | |
], | |
"metadata": { | |
"id": "K5KyVC5O7Elw" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"pip install pypdf" | |
], | |
"metadata": { | |
"id": "_o1MOUo07GBO" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Step 1. Set up Weaviate" | |
], | |
"metadata": { | |
"id": "bR6Iagsz6EE8" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"In order to use Weaviate, you need to sign up for an account [here](https://weaviate.io/). Once you have created an account, you can create a **free** sandbox cluster. [This](https://weaviate.io/developers/wcs/quickstart) is a great guide on how to set up your Weaviate Cloud Services cluster. Once your free sandbox cluster is created, set your `url` and `api_key` below. Additionally, set your `openai_api_key`, which you can get from [here](https://platform.openai.com/account/api-keys)." | |
], | |
"metadata": { | |
"id": "U17e7uN0_UfP" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "evv-gLj-gufn" | |
}, | |
"outputs": [], | |
"source": [ | |
"import weaviate\n", | |
"import json\n", | |
"import os\n", | |
"\n", | |
"# Weaviate config\n", | |
"weaviate_cluster_url = \"YOUR_SANDBOX_CLUSTER_URL\"\n", | |
"weaviate_api_key = \"YOUR_WEAVIATE_API_KEY\"\n", | |
"weaviate_index_name = \"YOUR_WEAVIATE_INDEX_NAME\"\n", | |
"\n", | |
"# OpenAI config\n", | |
"openai_api_key = \"YOUR_OPENAI_API_KEY\"\n", | |
"\n", | |
"# Create Weaviate Cloud Services client\n", | |
"client = weaviate.Client(\n", | |
" url = weaviate_cluster_url, # Replace with your cluster URL\n", | |
" auth_client_secret=weaviate.AuthApiKey(api_key=weaviate_api_key), # Replace w/ your Weaviate instance API key\n", | |
" additional_headers = {\n", | |
" \"X-OpenAI-Api-Key\": openai_api_key # Replace with your inference API key\n", | |
" }\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Step 2. Prepare the data" | |
], | |
"metadata": { | |
"id": "sz639zFf6JoK" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"### 2.1. Load and chunk your PDF document" | |
], | |
"metadata": { | |
"id": "RDi3DaCu6u80" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from langchain.document_loaders import PyPDFLoader\n", | |
"\n", | |
"# Load $ABNB's financial report. This may take 1-2 minutes since the PDF is large\n", | |
"sec_filing_pdf = \"https://s26.q4cdn.com/656283129/files/doc_financials/2023/q2/3aec2916-f24a-4a9e-8a59-bdbcabe8c4bb.pdf\"\n", | |
"\n", | |
"# Create your PDF loader\n", | |
"loader = PyPDFLoader(sec_filing_pdf)\n", | |
"\n", | |
"# Load the PDF document\n", | |
"documents = loader.load()" | |
], | |
"metadata": { | |
"id": "rIO5t-j7611h" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n", | |
"\n", | |
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", | |
"\n", | |
"# Chunk the financial report\n", | |
"docs = text_splitter.split_documents(documents)" | |
], | |
"metadata": { | |
"id": "-MN_TirA63fl" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"texts = [d.page_content for d in documents]\n", | |
"metadatas = [d.metadata for d in documents]" | |
], | |
"metadata": { | |
"id": "erCkNwMP7MeE" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Inspect the first text chunk to make sure it looks OK\n", | |
"print(texts[0])" | |
], | |
"metadata": { | |
"id": "aLYEiTdo_etl" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"### 2.2 Generate vector embeddings using OpenAI" | |
], | |
"metadata": { | |
"id": "iaYSqxiMLUGb" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from langchain.embeddings.openai import OpenAIEmbeddings\n", | |
"\n", | |
"embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)" | |
], | |
"metadata": { | |
"id": "ftucuzAtLXHV" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Generate vector embeddings\n", | |
"embeddings = embedding.embed_documents(texts) if embedding else None\n", | |
"attributes = list(metadatas[0].keys()) if metadatas else None" | |
], | |
"metadata": { | |
"id": "QVZevdc-Md4N" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Check the dimensions of the vector embedding (default is 1536)\n", | |
"len(embeddings[1])" | |
], | |
"metadata": { | |
"id": "H0-4fGw0UF-w" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"schema = {\n", | |
" \"class\": weaviate_index_name,\n", | |
" \"vectorizer\": \"text2vec-openai\", # If set to \"none\" you must always provide vectors yourself. Could be any other \"text2vec-*\" also.\n", | |
" \"moduleConfig\": {\n", | |
" \"text2vec-openai\": {},\n", | |
" \"generative-openai\": {} # Ensure the `generative-openai` module is used for generative queries\n", | |
" }\n", | |
"}\n", | |
"\n", | |
"# Create the \"DB schema\" in Weaviate\n", | |
"if not client.schema.exists(weaviate_index_name):\n", | |
" client.schema.create_class(schema)" | |
], | |
"metadata": { | |
"id": "ehxJdzmNFGYj" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"### 2.3 Store data in Weaviate" | |
], | |
"metadata": { | |
"id": "Xpm5HOT3M7LN" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# The ticker of the company that we uploaded the SEC filing for\n", | |
"ticker = \"ABNB\"\n", | |
"\n", | |
"# Batch upload all of your text to Weaviate\n", | |
"with client.batch(batch_size=100) as batch:\n", | |
" # Iteratively upload each text\n", | |
" for i, text in enumerate(texts):\n", | |
" properties = {\n", | |
" \"text\": text,\n", | |
" \"ticker\": ticker,\n", | |
" }\n", | |
"\n", | |
" custom_vector = embeddings[i]\n", | |
" client.batch.add_data_object(\n", | |
" properties,\n", | |
" weaviate_index_name,\n", | |
" vector=custom_vector\n", | |
" )" | |
], | |
"metadata": { | |
"id": "XUL0d0QLGWQp" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "iUsF2E6vgufr" | |
}, | |
"source": [ | |
"# Step 4. Query the data from Weaviate" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"query = f\"What was ABNB's net income in Q1 2023?\"" | |
], | |
"metadata": { | |
"id": "_hzgrVGdBl_b" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "2QUAAKH2gufs" | |
}, | |
"source": [ | |
"#### Generative search (single prompt)\n", | |
"\n", | |
"Next, let's try a generative search, where search results are processed with a large language model (LLM).\n", | |
"\n", | |
"Here, we use a `single prompt` query, and the model to explain each answer in plain terms." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"nearText = {\"concepts\": [query]}\n", | |
"\n", | |
"response = (\n", | |
" client.query\n", | |
" .get(weaviate_index_name, [\"text\", \"ticker\"])\n", | |
" .with_near_text(nearText)\n", | |
" .with_generate(single_prompt=\"Using {text} please answer the following query: \" + query)\n", | |
" .with_limit(1)\n", | |
" .do()\n", | |
")\n", | |
"\n", | |
"print(response[\"data\"][\"Get\"][weaviate_index_name.capitalize()][0][\"_additional\"][\"generate\"][\"singleResult\"])" | |
], | |
"metadata": { | |
"id": "JLODXBAFBJzF" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "base", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.12" | |
}, | |
"orig_nbformat": 4, | |
"colab": { | |
"provenance": [], | |
"collapsed_sections": [ | |
"S2mGQxA958dW" | |
], | |
"include_colab_link": true | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment