Skip to content

Instantly share code, notes, and snippets.

@virattt
Last active June 27, 2023 06:29
Show Gist options
  • Save virattt/8e951b49d04ee43b60455ab70a72b177 to your computer and use it in GitHub Desktop.
Save virattt/8e951b49d04ee43b60455ab70a72b177 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "66e9b2da",
"metadata": {},
"source": [
"# Step 1 - Data Ingestion"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6480136a",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import PyPDFLoader\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter"
]
},
{
"cell_type": "markdown",
"id": "b9cb9810",
"metadata": {},
"source": [
"### Step 1.1 - Load Airbnb's annual reports (PDFs)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "729c0176",
"metadata": {},
"outputs": [],
"source": [
"pdfs = [\n",
" \"pdfs/abnb_10k_2020.pdf\", # FY 2020\n",
" \"pdfs/abnb_10k_2021.pdf\", # FY 2021\n",
" \"pdfs/abnb_10k_2022.pdf\", # FY 2022\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "b9b4adf7",
"metadata": {},
"outputs": [],
"source": [
"annual_reports = []\n",
"for pdf in pdfs:\n",
" loader = PyPDFLoader(pdf)\n",
" # Load the PDF document\n",
" document = loader.load() \n",
" # Add the loaded document to our list\n",
" annual_reports.append(document)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "b19a282a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"chunked_annual_report length: 916\n",
"chunked_annual_report length: 851\n",
"chunked_annual_report length: 1359\n"
]
}
],
"source": [
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"\n",
"chunked_annual_reports = []\n",
"for annual_report in annual_reports:\n",
" # Chunk the annual_report\n",
" texts = text_splitter.split_documents(annual_report)\n",
" # Add the chunks to chunked_annual_reports, which is a list of lists\n",
" chunked_annual_reports.append(texts)\n",
" print(f\"chunked_annual_report length: {len(texts)}\")"
]
},
{
"cell_type": "markdown",
"id": "96be6a10",
"metadata": {},
"source": [
"### Step 1.2 - Upsert annual report vector embeddings to Pinecone"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "f3741ec5",
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores import Chroma, Pinecone\n",
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"import pinecone\n",
"import os"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "178e2c72",
"metadata": {},
"outputs": [],
"source": [
"OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')\n",
"PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')\n",
"PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "41fd3b34",
"metadata": {},
"outputs": [],
"source": [
"embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "7422ec4d",
"metadata": {},
"outputs": [],
"source": [
"# Initialize Pinecone\n",
"pinecone.init(\n",
" api_key=PINECONE_API_KEY,\n",
" environment=PINECONE_API_ENV\n",
")\n",
"index_name = \"PINECONE_INDEX_NAME\""
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "9bf7a4c5",
"metadata": {},
"outputs": [],
"source": [
"# Upsert annual reports to Pinecone via LangChain.\n",
"# There's likely a better way to do this instead of Pinecone.from_texts()\n",
"for chunks in chunked_annual_reports:\n",
" Pinecone.from_texts([chunk.page_content for chunk in chunks], embeddings, index_name=index_name)"
]
},
{
"cell_type": "markdown",
"id": "77e60a7f",
"metadata": {},
"source": [
"# Step 2 - Data Retrieval"
]
},
{
"cell_type": "markdown",
"id": "13e3dfda",
"metadata": {},
"source": [
"### Step 2.1 - Retrieve the annual report vector embeddings from Pinecone"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "237ada8b",
"metadata": {},
"outputs": [],
"source": [
"vectorstore = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)"
]
},
{
"cell_type": "markdown",
"id": "5402f982",
"metadata": {},
"source": [
"# Step 3 - Chat Q&A"
]
},
{
"cell_type": "markdown",
"id": "bda4a3af",
"metadata": {},
"source": [
"### Step 3.1 - Ask questions about the annual reports!"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "460e1f5b",
"metadata": {},
"outputs": [],
"source": [
"from langchain.llms import OpenAI\n",
"from langchain.chains.question_answering import load_qa_chain"
]
},
{
"cell_type": "code",
"execution_count": 64,
"id": "965ec7cb",
"metadata": {},
"outputs": [],
"source": [
"llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)\n",
"chain = load_qa_chain(llm)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "ea6d8ad9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' The most recent annual report from Airbnb is for the year ended December 31, 2022.'"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"What year is Airbnb's most recent annual report from?\"\n",
"docs = vectorstore.similarity_search(query, include_metadata=True)\n",
"chain.run(input_documents=docs, question=query)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "c15cbeee",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\" The overall sentiment of Airbnb's most recent annual report is positive. They reported 4 million hosts, 800 million guest arrivals, and 100,000 cities in almost every country and region across the globe. They also reported that their community support team requires significant time and resources and significant investment in staffing, technology, including automation and machine learning.\""
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"What is the overall sentiment of Airbnb's most recent annual report? Provide some numbers.\"\n",
"docs = vectorstore.similarity_search(query)\n",
"chain.run(input_documents=docs, question=query)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "61f8ca7f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\" Airbnb's annual report in 2020 showed a decrease in revenue due to the COVID-19 pandemic, but also showed resilience and adaptability as domestic travel quickly rebounded on Airbnb around the world and stays of longer than a few days started increasing.\""
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"What is the overall sentiment of Airbnb's annual report in 2020? Provide some numbers.\"\n",
"docs = vectorstore.similarity_search(query)\n",
"chain.run(input_documents=docs, question=query)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "3f33fe17",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\" 1. Events beyond Airbnb's control such as the ongoing COVID-19 pandemic, other pandemics and health concerns, restrictions on travel, immigration, trade disputes, economic downturns, and the impact of climate change on travel. 2. Political, social, or economic instability. 3. Competition for hosts and guests.\""
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"What are Airbnb's top 3 risk factors? Provide numbers.\"\n",
"docs = vectorstore.similarity_search(query, include_metadata=True)\n",
"chain.run(input_documents=docs, question=query)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "3a95e060",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' Free Cash Flow increased from $2.3 billion in 2021 to $3.4 billion in 2022, representing a growth of 47%.'"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"Compute free cash flow growth from 2021 to 2022\"\n",
"docs = vectorstore.similarity_search(query)\n",
"chain.run(input_documents=docs, question=query)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "3f39da2d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\" In 2018, Airbnb's free cash flow was $504.9 million. In 2019, it was $97.3 million. In 2020, it was $(667.1) million.\""
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"What was Airbnb's free cash flow in each of the past 3 years? Please make sure that your answer is correct.\"\n",
"docs = vectorstore.similarity_search(query\n",
"chain.run(input_documents=docs, question=query)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "e0aca588",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' Airbnb had $7,378 million in cash and $1,987 million in debt on its balance sheet in 2022.'"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"How much cash and debt did Airbnb have on its balance sheet in 2022?\"\n",
"docs = vectorstore.similarity_search(query, include_metadata=True)\n",
"chain.run(input_documents=docs, question=query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36368231",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment