virattt/sec_filing_weaviate_tutorial.ipynb

## sec_filing_weaviate_tutorial.ipynb
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/virattt/080446be8df07eb47c4fa55a51550e28/sec_filing_weaviate_tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6kJ9r4Gkgufk"
      },
      "source": [
        "# Overview\n",
        "\n",
        "This notebook teaches you how to chat with an SEC filing using Weaviate.\n",
        "\n",
        "In my example, I am using Airbnb's quarterly earnings report (10-Q) from Q2 2023.  You can update the URL value of `sec_filing_pdf` below to be any report that you want.\n",
        "\n",
        "I hope you find this code useful.\n",
        "\n",
        "Please feel free to message me on [Twitter](https://twitter.com/virattt) if you want more tutorials like this.\n",
        "\n",
        "Happy learning! :)"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Step 0.  Install dependencies"
      ],
      "metadata": {
        "id": "S2mGQxA958dW"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "pip install openai"
      ],
      "metadata": {
        "id": "2bY0NapN_z98"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lEQQJHH9gufm"
      },
      "outputs": [],
      "source": [
        "!pip install -U weaviate-client"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "pip install langchain"
      ],
      "metadata": {
        "id": "ygccK6lm54VT"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "pip install tiktoken"
      ],
      "metadata": {
        "id": "K5KyVC5O7Elw"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "pip install pypdf"
      ],
      "metadata": {
        "id": "_o1MOUo07GBO"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Step 1.  Set up Weaviate"
      ],
      "metadata": {
        "id": "bR6Iagsz6EE8"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "In order to use Weaviate, you need to sign up for an account [here](https://weaviate.io/).  Once you have created an account, you can create a **free** sandbox cluster.  [This](https://weaviate.io/developers/wcs/quickstart) is a great guide on how to set up your Weaviate Cloud Services cluster.  Once your free sandbox cluster is created, set your `url` and `api_key` below.  Additionally, set your `openai_api_key`, which you can get from [here](https://platform.openai.com/account/api-keys)."
      ],
      "metadata": {
        "id": "U17e7uN0_UfP"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "evv-gLj-gufn"
      },
      "outputs": [],
      "source": [
        "import weaviate\n",
        "import json\n",
        "import os\n",
        "\n",
        "# Weaviate config\n",
        "weaviate_cluster_url = \"YOUR_SANDBOX_CLUSTER_URL\"\n",
        "weaviate_api_key = \"YOUR_WEAVIATE_API_KEY\"\n",
        "weaviate_index_name = \"YOUR_WEAVIATE_INDEX_NAME\"\n",
        "\n",
        "# OpenAI config\n",
        "openai_api_key = \"YOUR_OPENAI_API_KEY\"\n",
        "\n",
        "# Create Weaviate Cloud Services client\n",
        "client = weaviate.Client(\n",
        "    url = weaviate_cluster_url,                                           # Replace with your cluster URL\n",
        "    auth_client_secret=weaviate.AuthApiKey(api_key=weaviate_api_key),     # Replace w/ your Weaviate instance API key\n",
        "    additional_headers = {\n",
        "        \"X-OpenAI-Api-Key\": openai_api_key                                # Replace with your inference API key\n",
        "    }\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Step 2.  Prepare the data"
      ],
      "metadata": {
        "id": "sz639zFf6JoK"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 2.1. Load and chunk your PDF document"
      ],
      "metadata": {
        "id": "RDi3DaCu6u80"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain.document_loaders import PyPDFLoader\n",
        "\n",
        "# Load $ABNB's financial report. This may take 1-2 minutes since the PDF is large\n",
        "sec_filing_pdf = \"https://s26.q4cdn.com/656283129/files/doc_financials/2023/q2/3aec2916-f24a-4a9e-8a59-bdbcabe8c4bb.pdf\"\n",
        "\n",
        "# Create your PDF loader\n",
        "loader = PyPDFLoader(sec_filing_pdf)\n",
        "\n",
        "# Load the PDF document\n",
        "documents = loader.load()"
      ],
      "metadata": {
        "id": "rIO5t-j7611h"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
        "\n",
        "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
        "\n",
        "# Chunk the financial report\n",
        "docs = text_splitter.split_documents(documents)"
      ],
      "metadata": {
        "id": "-MN_TirA63fl"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "texts = [d.page_content for d in documents]\n",
        "metadatas = [d.metadata for d in documents]"
      ],
      "metadata": {
        "id": "erCkNwMP7MeE"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Inspect the first text chunk to make sure it looks OK\n",
        "print(texts[0])"
      ],
      "metadata": {
        "id": "aLYEiTdo_etl"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 2.2 Generate vector embeddings using OpenAI"
      ],
      "metadata": {
        "id": "iaYSqxiMLUGb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain.embeddings.openai import OpenAIEmbeddings\n",
        "\n",
        "embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)"
      ],
      "metadata": {
        "id": "ftucuzAtLXHV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Generate vector embeddings\n",
        "embeddings = embedding.embed_documents(texts) if embedding else None\n",
        "attributes = list(metadatas[0].keys()) if metadatas else None"
      ],
      "metadata": {
        "id": "QVZevdc-Md4N"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Check the dimensions of the vector embedding (default is 1536)\n",
        "len(embeddings[1])"
      ],
      "metadata": {
        "id": "H0-4fGw0UF-w"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "schema = {\n",
        "    \"class\": weaviate_index_name,\n",
        "    \"vectorizer\": \"text2vec-openai\",  # If set to \"none\" you must always provide vectors yourself. Could be any other \"text2vec-*\" also.\n",
        "    \"moduleConfig\": {\n",
        "        \"text2vec-openai\": {},\n",
        "        \"generative-openai\": {}  # Ensure the `generative-openai` module is used for generative queries\n",
        "    }\n",
        "}\n",
        "\n",
        "# Create the \"DB schema\" in Weaviate\n",
        "if not client.schema.exists(weaviate_index_name):\n",
        "    client.schema.create_class(schema)"
      ],
      "metadata": {
        "id": "ehxJdzmNFGYj"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 2.3 Store data in Weaviate"
      ],
      "metadata": {
        "id": "Xpm5HOT3M7LN"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# The ticker of the company that we uploaded the SEC filing for\n",
        "ticker = \"ABNB\"\n",
        "\n",
        "# Batch upload all of your text to Weaviate\n",
        "with client.batch(batch_size=100) as batch:\n",
        "    # Iteratively upload each text\n",
        "    for i, text in enumerate(texts):\n",
        "        properties = {\n",
        "            \"text\": text,\n",
        "            \"ticker\": ticker,\n",
        "        }\n",
        "\n",
        "        custom_vector =  embeddings[i]\n",
        "        client.batch.add_data_object(\n",
        "            properties,\n",
        "            weaviate_index_name,\n",
        "            vector=custom_vector\n",
        "        )"
      ],
      "metadata": {
        "id": "XUL0d0QLGWQp"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iUsF2E6vgufr"
      },
      "source": [
        "# Step 4.  Query the data from Weaviate"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "query = f\"What was ABNB's net income in Q1 2023?\""
      ],
      "metadata": {
        "id": "_hzgrVGdBl_b"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2QUAAKH2gufs"
      },
      "source": [
        "#### Generative search (single prompt)\n",
        "\n",
        "Next, let's try a generative search, where search results are processed with a large language model (LLM).\n",
        "\n",
        "Here, we use a `single prompt` query, and the model to explain each answer in plain terms."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "nearText = {\"concepts\": [query]}\n",
        "\n",
        "response = (\n",
        "    client.query\n",
        "    .get(weaviate_index_name, [\"text\", \"ticker\"])\n",
        "    .with_near_text(nearText)\n",
        "    .with_generate(single_prompt=\"Using {text} please answer the following query: \" + query)\n",
        "    .with_limit(1)\n",
        "    .do()\n",
        ")\n",
        "\n",
        "print(response[\"data\"][\"Get\"][weaviate_index_name.capitalize()][0][\"_additional\"][\"generate\"][\"singleResult\"])"
      ],
      "metadata": {
        "id": "JLODXBAFBJzF"
      },
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "base",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    },
    "orig_nbformat": 4,
    "colab": {
      "provenance": [],
      "collapsed_sections": [
        "S2mGQxA958dW"
      ],
      "include_colab_link": true
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/virattt/080446be8df07eb47c4fa55a51550e28/sec_filing_weaviate_tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "6kJ9r4Gkgufk"
	},
	"source": [
	"# Overview\n",
	"\n",
	"This notebook teaches you how to chat with an SEC filing using Weaviate.\n",
	"\n",
	"In my example, I am using Airbnb's quarterly earnings report (10-Q) from Q2 2023. You can update the URL value of `sec_filing_pdf` below to be any report that you want.\n",
	"\n",
	"I hope you find this code useful.\n",
	"\n",
	"Please feel free to message me on [Twitter](https://twitter.com/virattt) if you want more tutorials like this.\n",
	"\n",
	"Happy learning! :)"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Step 0. Install dependencies"
	],
	"metadata": {
	"id": "S2mGQxA958dW"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"pip install openai"
	],
	"metadata": {
	"id": "2bY0NapN_z98"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "lEQQJHH9gufm"
	},
	"outputs": [],
	"source": [
	"!pip install -U weaviate-client"
	]
	},
	{
	"cell_type": "code",
	"source": [
	"pip install langchain"
	],
	"metadata": {
	"id": "ygccK6lm54VT"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"pip install tiktoken"
	],
	"metadata": {
	"id": "K5KyVC5O7Elw"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"pip install pypdf"
	],
	"metadata": {
	"id": "_o1MOUo07GBO"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Step 1. Set up Weaviate"
	],
	"metadata": {
	"id": "bR6Iagsz6EE8"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"In order to use Weaviate, you need to sign up for an account [here](https://weaviate.io/). Once you have created an account, you can create a free sandbox cluster. [This](https://weaviate.io/developers/wcs/quickstart) is a great guide on how to set up your Weaviate Cloud Services cluster. Once your free sandbox cluster is created, set your `url` and `api_key` below. Additionally, set your `openai_api_key`, which you can get from [here](https://platform.openai.com/account/api-keys)."
	],
	"metadata": {
	"id": "U17e7uN0_UfP"
	}
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "evv-gLj-gufn"
	},
	"outputs": [],
	"source": [
	"import weaviate\n",
	"import json\n",
	"import os\n",
	"\n",
	"# Weaviate config\n",
	"weaviate_cluster_url = \"YOUR_SANDBOX_CLUSTER_URL\"\n",
	"weaviate_api_key = \"YOUR_WEAVIATE_API_KEY\"\n",
	"weaviate_index_name = \"YOUR_WEAVIATE_INDEX_NAME\"\n",
	"\n",
	"# OpenAI config\n",
	"openai_api_key = \"YOUR_OPENAI_API_KEY\"\n",
	"\n",
	"# Create Weaviate Cloud Services client\n",
	"client = weaviate.Client(\n",
	" url = weaviate_cluster_url, # Replace with your cluster URL\n",
	" auth_client_secret=weaviate.AuthApiKey(api_key=weaviate_api_key), # Replace w/ your Weaviate instance API key\n",
	" additional_headers = {\n",
	" \"X-OpenAI-Api-Key\": openai_api_key # Replace with your inference API key\n",
	" }\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Step 2. Prepare the data"
	],
	"metadata": {
	"id": "sz639zFf6JoK"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"### 2.1. Load and chunk your PDF document"
	],
	"metadata": {
	"id": "RDi3DaCu6u80"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"from langchain.document_loaders import PyPDFLoader\n",
	"\n",
	"# Load $ABNB's financial report. This may take 1-2 minutes since the PDF is large\n",
	"sec_filing_pdf = \"https://s26.q4cdn.com/656283129/files/doc_financials/2023/q2/3aec2916-f24a-4a9e-8a59-bdbcabe8c4bb.pdf\"\n",
	"\n",
	"# Create your PDF loader\n",
	"loader = PyPDFLoader(sec_filing_pdf)\n",
	"\n",
	"# Load the PDF document\n",
	"documents = loader.load()"
	],
	"metadata": {
	"id": "rIO5t-j7611h"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
	"\n",
	"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
	"\n",
	"# Chunk the financial report\n",
	"docs = text_splitter.split_documents(documents)"
	],
	"metadata": {
	"id": "-MN_TirA63fl"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"texts = [d.page_content for d in documents]\n",
	"metadatas = [d.metadata for d in documents]"
	],
	"metadata": {
	"id": "erCkNwMP7MeE"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"# Inspect the first text chunk to make sure it looks OK\n",
	"print(texts[0])"
	],
	"metadata": {
	"id": "aLYEiTdo_etl"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"### 2.2 Generate vector embeddings using OpenAI"
	],
	"metadata": {
	"id": "iaYSqxiMLUGb"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"from langchain.embeddings.openai import OpenAIEmbeddings\n",
	"\n",
	"embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)"
	],
	"metadata": {
	"id": "ftucuzAtLXHV"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"# Generate vector embeddings\n",
	"embeddings = embedding.embed_documents(texts) if embedding else None\n",
	"attributes = list(metadatas[0].keys()) if metadatas else None"
	],
	"metadata": {
	"id": "QVZevdc-Md4N"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"# Check the dimensions of the vector embedding (default is 1536)\n",
	"len(embeddings[1])"
	],
	"metadata": {
	"id": "H0-4fGw0UF-w"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"schema = {\n",
	" \"class\": weaviate_index_name,\n",
	" \"vectorizer\": \"text2vec-openai\", # If set to \"none\" you must always provide vectors yourself. Could be any other \"text2vec-*\" also.\n",
	" \"moduleConfig\": {\n",
	" \"text2vec-openai\": {},\n",
	" \"generative-openai\": {} # Ensure the `generative-openai` module is used for generative queries\n",
	" }\n",
	"}\n",
	"\n",
	"# Create the \"DB schema\" in Weaviate\n",
	"if not client.schema.exists(weaviate_index_name):\n",
	" client.schema.create_class(schema)"
	],
	"metadata": {
	"id": "ehxJdzmNFGYj"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"### 2.3 Store data in Weaviate"
	],
	"metadata": {
	"id": "Xpm5HOT3M7LN"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# The ticker of the company that we uploaded the SEC filing for\n",
	"ticker = \"ABNB\"\n",
	"\n",
	"# Batch upload all of your text to Weaviate\n",
	"with client.batch(batch_size=100) as batch:\n",
	" # Iteratively upload each text\n",
	" for i, text in enumerate(texts):\n",
	" properties = {\n",
	" \"text\": text,\n",
	" \"ticker\": ticker,\n",
	" }\n",
	"\n",
	" custom_vector = embeddings[i]\n",
	" client.batch.add_data_object(\n",
	" properties,\n",
	" weaviate_index_name,\n",
	" vector=custom_vector\n",
	" )"
	],
	"metadata": {
	"id": "XUL0d0QLGWQp"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "iUsF2E6vgufr"
	},
	"source": [
	"# Step 4. Query the data from Weaviate"
	]
	},
	{
	"cell_type": "code",
	"source": [
	"query = f\"What was ABNB's net income in Q1 2023?\""
	],
	"metadata": {
	"id": "_hzgrVGdBl_b"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "2QUAAKH2gufs"
	},
	"source": [
	"#### Generative search (single prompt)\n",
	"\n",
	"Next, let's try a generative search, where search results are processed with a large language model (LLM).\n",
	"\n",
	"Here, we use a `single prompt` query, and the model to explain each answer in plain terms."
	]
	},
	{
	"cell_type": "code",
	"source": [
	"nearText = {\"concepts\": [query]}\n",
	"\n",
	"response = (\n",
	" client.query\n",
	" .get(weaviate_index_name, [\"text\", \"ticker\"])\n",
	" .with_near_text(nearText)\n",
	" .with_generate(single_prompt=\"Using {text} please answer the following query: \" + query)\n",
	" .with_limit(1)\n",
	" .do()\n",
	")\n",
	"\n",
	"print(response[\"data\"][\"Get\"][weaviate_index_name.capitalize()][0][\"_additional\"][\"generate\"][\"singleResult\"])"
	],
	"metadata": {
	"id": "JLODXBAFBJzF"
	},
	"execution_count": null,
	"outputs": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "base",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.10.12"
	},
	"orig_nbformat": 4,
	"colab": {
	"provenance": [],
	"collapsed_sections": [
	"S2mGQxA958dW"
	],
	"include_colab_link": true
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}