Skip to content

Instantly share code, notes, and snippets.

@virattt
Created August 31, 2023 00:06
Show Gist options
  • Save virattt/080446be8df07eb47c4fa55a51550e28 to your computer and use it in GitHub Desktop.
Save virattt/080446be8df07eb47c4fa55a51550e28 to your computer and use it in GitHub Desktop.
sec_filing_weaviate_tutorial.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/virattt/080446be8df07eb47c4fa55a51550e28/sec_filing_weaviate_tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6kJ9r4Gkgufk"
},
"source": [
"# Overview\n",
"\n",
"This notebook teaches you how to chat with an SEC filing using Weaviate.\n",
"\n",
"In my example, I am using Airbnb's quarterly earnings report (10-Q) from Q2 2023. You can update the URL value of `sec_filing_pdf` below to be any report that you want.\n",
"\n",
"I hope you find this code useful.\n",
"\n",
"Please feel free to message me on [Twitter](https://twitter.com/virattt) if you want more tutorials like this.\n",
"\n",
"Happy learning! :)"
]
},
{
"cell_type": "markdown",
"source": [
"# Step 0. Install dependencies"
],
"metadata": {
"id": "S2mGQxA958dW"
}
},
{
"cell_type": "code",
"source": [
"pip install openai"
],
"metadata": {
"id": "2bY0NapN_z98"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "lEQQJHH9gufm"
},
"outputs": [],
"source": [
"!pip install -U weaviate-client"
]
},
{
"cell_type": "code",
"source": [
"pip install langchain"
],
"metadata": {
"id": "ygccK6lm54VT"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"pip install tiktoken"
],
"metadata": {
"id": "K5KyVC5O7Elw"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"pip install pypdf"
],
"metadata": {
"id": "_o1MOUo07GBO"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Step 1. Set up Weaviate"
],
"metadata": {
"id": "bR6Iagsz6EE8"
}
},
{
"cell_type": "markdown",
"source": [
"In order to use Weaviate, you need to sign up for an account [here](https://weaviate.io/). Once you have created an account, you can create a **free** sandbox cluster. [This](https://weaviate.io/developers/wcs/quickstart) is a great guide on how to set up your Weaviate Cloud Services cluster. Once your free sandbox cluster is created, set your `url` and `api_key` below. Additionally, set your `openai_api_key`, which you can get from [here](https://platform.openai.com/account/api-keys)."
],
"metadata": {
"id": "U17e7uN0_UfP"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "evv-gLj-gufn"
},
"outputs": [],
"source": [
"import weaviate\n",
"import json\n",
"import os\n",
"\n",
"# Weaviate config\n",
"weaviate_cluster_url = \"YOUR_SANDBOX_CLUSTER_URL\"\n",
"weaviate_api_key = \"YOUR_WEAVIATE_API_KEY\"\n",
"weaviate_index_name = \"YOUR_WEAVIATE_INDEX_NAME\"\n",
"\n",
"# OpenAI config\n",
"openai_api_key = \"YOUR_OPENAI_API_KEY\"\n",
"\n",
"# Create Weaviate Cloud Services client\n",
"client = weaviate.Client(\n",
" url = weaviate_cluster_url, # Replace with your cluster URL\n",
" auth_client_secret=weaviate.AuthApiKey(api_key=weaviate_api_key), # Replace w/ your Weaviate instance API key\n",
" additional_headers = {\n",
" \"X-OpenAI-Api-Key\": openai_api_key # Replace with your inference API key\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"source": [
"# Step 2. Prepare the data"
],
"metadata": {
"id": "sz639zFf6JoK"
}
},
{
"cell_type": "markdown",
"source": [
"### 2.1. Load and chunk your PDF document"
],
"metadata": {
"id": "RDi3DaCu6u80"
}
},
{
"cell_type": "code",
"source": [
"from langchain.document_loaders import PyPDFLoader\n",
"\n",
"# Load $ABNB's financial report. This may take 1-2 minutes since the PDF is large\n",
"sec_filing_pdf = \"https://s26.q4cdn.com/656283129/files/doc_financials/2023/q2/3aec2916-f24a-4a9e-8a59-bdbcabe8c4bb.pdf\"\n",
"\n",
"# Create your PDF loader\n",
"loader = PyPDFLoader(sec_filing_pdf)\n",
"\n",
"# Load the PDF document\n",
"documents = loader.load()"
],
"metadata": {
"id": "rIO5t-j7611h"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"\n",
"# Chunk the financial report\n",
"docs = text_splitter.split_documents(documents)"
],
"metadata": {
"id": "-MN_TirA63fl"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"texts = [d.page_content for d in documents]\n",
"metadatas = [d.metadata for d in documents]"
],
"metadata": {
"id": "erCkNwMP7MeE"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Inspect the first text chunk to make sure it looks OK\n",
"print(texts[0])"
],
"metadata": {
"id": "aLYEiTdo_etl"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### 2.2 Generate vector embeddings using OpenAI"
],
"metadata": {
"id": "iaYSqxiMLUGb"
}
},
{
"cell_type": "code",
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"\n",
"embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)"
],
"metadata": {
"id": "ftucuzAtLXHV"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Generate vector embeddings\n",
"embeddings = embedding.embed_documents(texts) if embedding else None\n",
"attributes = list(metadatas[0].keys()) if metadatas else None"
],
"metadata": {
"id": "QVZevdc-Md4N"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Check the dimensions of the vector embedding (default is 1536)\n",
"len(embeddings[1])"
],
"metadata": {
"id": "H0-4fGw0UF-w"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"schema = {\n",
" \"class\": weaviate_index_name,\n",
" \"vectorizer\": \"text2vec-openai\", # If set to \"none\" you must always provide vectors yourself. Could be any other \"text2vec-*\" also.\n",
" \"moduleConfig\": {\n",
" \"text2vec-openai\": {},\n",
" \"generative-openai\": {} # Ensure the `generative-openai` module is used for generative queries\n",
" }\n",
"}\n",
"\n",
"# Create the \"DB schema\" in Weaviate\n",
"if not client.schema.exists(weaviate_index_name):\n",
" client.schema.create_class(schema)"
],
"metadata": {
"id": "ehxJdzmNFGYj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### 2.3 Store data in Weaviate"
],
"metadata": {
"id": "Xpm5HOT3M7LN"
}
},
{
"cell_type": "code",
"source": [
"# The ticker of the company that we uploaded the SEC filing for\n",
"ticker = \"ABNB\"\n",
"\n",
"# Batch upload all of your text to Weaviate\n",
"with client.batch(batch_size=100) as batch:\n",
" # Iteratively upload each text\n",
" for i, text in enumerate(texts):\n",
" properties = {\n",
" \"text\": text,\n",
" \"ticker\": ticker,\n",
" }\n",
"\n",
" custom_vector = embeddings[i]\n",
" client.batch.add_data_object(\n",
" properties,\n",
" weaviate_index_name,\n",
" vector=custom_vector\n",
" )"
],
"metadata": {
"id": "XUL0d0QLGWQp"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "iUsF2E6vgufr"
},
"source": [
"# Step 4. Query the data from Weaviate"
]
},
{
"cell_type": "code",
"source": [
"query = f\"What was ABNB's net income in Q1 2023?\""
],
"metadata": {
"id": "_hzgrVGdBl_b"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "2QUAAKH2gufs"
},
"source": [
"#### Generative search (single prompt)\n",
"\n",
"Next, let's try a generative search, where search results are processed with a large language model (LLM).\n",
"\n",
"Here, we use a `single prompt` query, and the model to explain each answer in plain terms."
]
},
{
"cell_type": "code",
"source": [
"nearText = {\"concepts\": [query]}\n",
"\n",
"response = (\n",
" client.query\n",
" .get(weaviate_index_name, [\"text\", \"ticker\"])\n",
" .with_near_text(nearText)\n",
" .with_generate(single_prompt=\"Using {text} please answer the following query: \" + query)\n",
" .with_limit(1)\n",
" .do()\n",
")\n",
"\n",
"print(response[\"data\"][\"Get\"][weaviate_index_name.capitalize()][0][\"_additional\"][\"generate\"][\"singleResult\"])"
],
"metadata": {
"id": "JLODXBAFBJzF"
},
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"orig_nbformat": 4,
"colab": {
"provenance": [],
"collapsed_sections": [
"S2mGQxA958dW"
],
"include_colab_link": true
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment