virattt/chatbot_memory_pdfs.ipynb

## chatbot_memory_pdfs.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8371ed35",
   "metadata": {},
   "source": [
    "# Chatbot with Conversational Memory and PDF Data Knowledge\n",
    "\n",
    "This notebook contains code for creating a Chatbot that:\n",
    "- \"Reads\" from your own PDF docs\n",
    "- \"Has\" conversational memory\n",
    "\n",
    "In the example below, the Chatbot reads from Airbnb's past 3 annual reports (PDFs) and I ask it questions about information that is contained in the annual reports.  The cool thing is that I can ask it questions as if I were talking to a human because the Chatbot has \"memory\" (a chat history).\n",
    "\n",
    "I hope you find this code useful.  Please follow me on https://twitter.com/virattt for more tutorials like this.\n",
    "\n",
    "Happy learning! :)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66e9b2da",
   "metadata": {},
   "source": [
    "# Step 1 - PDF Document Ingestion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "6480136a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import PyPDFLoader\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9cb9810",
   "metadata": {},
   "source": [
    "### Step 1.1 - Load annual reports\n",
    "\n",
    "The PDF paths below are local to my computer. You can create a \"pdfs\" directory adjacent to this current notebook on your machine and then add whatever PDFs you want into it!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "729c0176",
   "metadata": {},
   "outputs": [],
   "source": [
    "pdfs = [\n",
    "    \"pdfs/abnb_10k_2020.pdf\", # FY 2020 for Airbnb\n",
    "    \"pdfs/abnb_10k_2021.pdf\", # FY 2021\n",
    "    \"pdfs/abnb_10k_2022.pdf\", # FY 2022\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "b9b4adf7",
   "metadata": {},
   "outputs": [],
   "source": [
    "annual_reports = []\n",
    "for pdf in pdfs:\n",
    "    loader = PyPDFLoader(pdf)\n",
    "    # Load the PDF document\n",
    "    document = loader.load()        \n",
    "    # Add the loaded document to our list\n",
    "    annual_reports.append(document)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09ca93a7",
   "metadata": {},
   "source": [
    "### Step 1.2 - Split the annual reports into smaller pieces\n",
    "\n",
    "The annual report PDFs are quite large, oftentimes hundreds of pages long.  We need to split each document into smaller pieces before storing them in a database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "b19a282a",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
    "\n",
    "chunked_annual_reports = []\n",
    "for annual_report in annual_reports:\n",
    "    # Chunk the annual_report\n",
    "    texts = text_splitter.split_documents(annual_report)\n",
    "    # Add the chunks to chunked_annual_reports, which is a list of lists\n",
    "    chunked_annual_reports.append(texts)\n",
    "    print(f\"chunked_annual_report length: {len(texts)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96be6a10",
   "metadata": {},
   "source": [
    "### Step 1.3 - Initialize Pinecone\n",
    "\n",
    "Pinecone is a vector database that lets you store the embeddings of your PDFs, so that you can retrieve them later/."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "f3741ec5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.vectorstores import Chroma, Pinecone\n",
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "from tqdm.autonotebook import tqdm\n",
    "import pinecone\n",
    "import os"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "178e2c72",
   "metadata": {},
   "outputs": [],
   "source": [
    "OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')\n",
    "PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')\n",
    "PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV')'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "41fd3b34",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7422ec4d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize Pinecone\n",
    "pinecone.init(\n",
    "    api_key=PINECONE_API_KEY,\n",
    "    environment=PINECONE_API_ENV\n",
    ")\n",
    "index_name = \"YOUR_INDEX_NAME\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a16a8a56",
   "metadata": {},
   "source": [
    "### Step 1.4 - Upsert document chunks to Pinecone\n",
    "\n",
    "Upload your PDF document chunks to Pinecone.  Note, you only need to do this once!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "9bf7a4c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Upsert annual reports to Pinecone via LangChain.\n",
    "# There's likely a better way to do this instead of Pinecone.from_texts()\n",
    "for chunks in chunked_annual_reports:\n",
    "    Pinecone.from_texts([chunk.page_content for chunk in chunks], embeddings, index_name=index_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77e60a7f",
   "metadata": {},
   "source": [
    "# Step 2 - Document Retrieval"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13e3dfda",
   "metadata": {},
   "source": [
    "### Step 2.1 - Retrieve the annual report vector embeddings from Pinecone"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "237ada8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "vectorstore = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39945b10",
   "metadata": {},
   "source": [
    "# Step 3 - Converse with Chatbot\n",
    "\n",
    "More information about Chat Indexes here: https://python.langchain.com/en/latest/modules/chains/index_examples/chat_vector_db.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "32771598",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains import ConversationalRetrievalChain\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.llms import OpenAI"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b0a7e1b",
   "metadata": {},
   "source": [
    "### Step 3.1 Create a ConversationalRetrievalChain\n",
    "\n",
    "A ConversationalRetrievalChain is similar to a RetrievalQAChain, except that the ConversationalRetrievalChain allows for passing in of a chat history which can be used to allow for follow up questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "45a4fbef",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the chain\n",
    "qa = ConversationalRetrievalChain.from_llm(\n",
    "    llm=OpenAI(temperature=0), \n",
    "    retriever=vectorstore.as_retriever(),\n",
    "    return_source_documents=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "1f52b3a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize chat history list\n",
    "chat_history = []"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4fc3abe",
   "metadata": {},
   "source": [
    "### Step 3.2 - Begin conversing!\n",
    "\n",
    "This is where the fun begins."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "5949e4e7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' $1.9 billion'"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"What was Airbnb's net income in 2022?\"\n",
    "result = qa({\"question\": query, \"chat_history\": chat_history})\n",
    "result[\"answer\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "50a08844",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add the answer to the chat history\n",
    "chat_history.append((query, result[\"answer\"]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "23263ff6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\" Airbnb's net income in 2020 was -$4.6 billion.\""
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Notice that the Chatbot knows that \"two years before [that]\" is \"two years before [2022]\"\n",
    "query = \"What was it two years before that?\"\n",
    "result = qa({\"question\": query, \"chat_history\": chat_history})\n",
    "result[\"answer\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "463213b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add the answer to the chat history\n",
    "chat_history.append((query, result[\"answer\"]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "ed2fdb9e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' Airbnb went public on December 14, 2020.'"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"When did they IPO?\"\n",
    "result = qa({\"question\": query, \"chat_history\": chat_history})\n",
    "result[\"answer\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "83a8f7fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add the answer to the chat history\n",
    "chat_history.append((query, result[\"answer\"]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "9a89e322",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' 55,000,000 shares were issued when Airbnb went public.'"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"How many shares were issued?\"\n",
    "result = qa({\"question\": query, \"chat_history\": chat_history})\n",
    "result[\"answer\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90cfdf25",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "8371ed35",
	"metadata": {},
	"source": [
	"# Chatbot with Conversational Memory and PDF Data Knowledge\n",
	"\n",
	"This notebook contains code for creating a Chatbot that:\n",
	"- \"Reads\" from your own PDF docs\n",
	"- \"Has\" conversational memory\n",
	"\n",
	"In the example below, the Chatbot reads from Airbnb's past 3 annual reports (PDFs) and I ask it questions about information that is contained in the annual reports. The cool thing is that I can ask it questions as if I were talking to a human because the Chatbot has \"memory\" (a chat history).\n",
	"\n",
	"I hope you find this code useful. Please follow me on https://twitter.com/virattt for more tutorials like this.\n",
	"\n",
	"Happy learning! :)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "66e9b2da",
	"metadata": {},
	"source": [
	"# Step 1 - PDF Document Ingestion"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"id": "6480136a",
	"metadata": {},
	"outputs": [],
	"source": [
	"from langchain.document_loaders import PyPDFLoader\n",
	"from langchain.text_splitter import RecursiveCharacterTextSplitter"
	]
	},
	{
	"cell_type": "markdown",
	"id": "b9cb9810",
	"metadata": {},
	"source": [
	"### Step 1.1 - Load annual reports\n",
	"\n",
	"The PDF paths below are local to my computer. You can create a \"pdfs\" directory adjacent to this current notebook on your machine and then add whatever PDFs you want into it!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"id": "729c0176",
	"metadata": {},
	"outputs": [],
	"source": [
	"pdfs = [\n",
	" \"pdfs/abnb_10k_2020.pdf\", # FY 2020 for Airbnb\n",
	" \"pdfs/abnb_10k_2021.pdf\", # FY 2021\n",
	" \"pdfs/abnb_10k_2022.pdf\", # FY 2022\n",
	"]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"id": "b9b4adf7",
	"metadata": {},
	"outputs": [],
	"source": [
	"annual_reports = []\n",
	"for pdf in pdfs:\n",
	" loader = PyPDFLoader(pdf)\n",
	" # Load the PDF document\n",
	" document = loader.load() \n",
	" # Add the loaded document to our list\n",
	" annual_reports.append(document)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "09ca93a7",
	"metadata": {},
	"source": [
	"### Step 1.2 - Split the annual reports into smaller pieces\n",
	"\n",
	"The annual report PDFs are quite large, oftentimes hundreds of pages long. We need to split each document into smaller pieces before storing them in a database."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 24,
	"id": "b19a282a",
	"metadata": {},
	"outputs": [],
	"source": [
	"text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
	"\n",
	"chunked_annual_reports = []\n",
	"for annual_report in annual_reports:\n",
	" # Chunk the annual_report\n",
	" texts = text_splitter.split_documents(annual_report)\n",
	" # Add the chunks to chunked_annual_reports, which is a list of lists\n",
	" chunked_annual_reports.append(texts)\n",
	" print(f\"chunked_annual_report length: {len(texts)}\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "96be6a10",
	"metadata": {},
	"source": [
	"### Step 1.3 - Initialize Pinecone\n",
	"\n",
	"Pinecone is a vector database that lets you store the embeddings of your PDFs, so that you can retrieve them later/."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"id": "f3741ec5",
	"metadata": {},
	"outputs": [],
	"source": [
	"from langchain.vectorstores import Chroma, Pinecone\n",
	"from langchain.embeddings.openai import OpenAIEmbeddings\n",
	"from tqdm.autonotebook import tqdm\n",
	"import pinecone\n",
	"import os"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"id": "178e2c72",
	"metadata": {},
	"outputs": [],
	"source": [
	"OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')\n",
	"PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')\n",
	"PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV')'"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"id": "41fd3b34",
	"metadata": {},
	"outputs": [],
	"source": [
	"embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"id": "7422ec4d",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Initialize Pinecone\n",
	"pinecone.init(\n",
	" api_key=PINECONE_API_KEY,\n",
	" environment=PINECONE_API_ENV\n",
	")\n",
	"index_name = \"YOUR_INDEX_NAME\""
	]
	},
	{
	"cell_type": "markdown",
	"id": "a16a8a56",
	"metadata": {},
	"source": [
	"### Step 1.4 - Upsert document chunks to Pinecone\n",
	"\n",
	"Upload your PDF document chunks to Pinecone. Note, you only need to do this once!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"id": "9bf7a4c5",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Upsert annual reports to Pinecone via LangChain.\n",
	"# There's likely a better way to do this instead of Pinecone.from_texts()\n",
	"for chunks in chunked_annual_reports:\n",
	" Pinecone.from_texts([chunk.page_content for chunk in chunks], embeddings, index_name=index_name)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "77e60a7f",
	"metadata": {},
	"source": [
	"# Step 2 - Document Retrieval"
	]
	},
	{
	"cell_type": "markdown",
	"id": "13e3dfda",
	"metadata": {},
	"source": [
	"### Step 2.1 - Retrieve the annual report vector embeddings from Pinecone"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"id": "237ada8b",
	"metadata": {},
	"outputs": [],
	"source": [
	"vectorstore = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "39945b10",
	"metadata": {},
	"source": [
	"# Step 3 - Converse with Chatbot\n",
	"\n",
	"More information about Chat Indexes here: https://python.langchain.com/en/latest/modules/chains/index_examples/chat_vector_db.html"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"id": "32771598",
	"metadata": {},
	"outputs": [],
	"source": [
	"from langchain.chains import ConversationalRetrievalChain\n",
	"from langchain.text_splitter import CharacterTextSplitter\n",
	"from langchain.llms import OpenAI"
	]
	},
	{
	"cell_type": "markdown",
	"id": "9b0a7e1b",
	"metadata": {},
	"source": [
	"### Step 3.1 Create a ConversationalRetrievalChain\n",
	"\n",
	"A ConversationalRetrievalChain is similar to a RetrievalQAChain, except that the ConversationalRetrievalChain allows for passing in of a chat history which can be used to allow for follow up questions."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"id": "45a4fbef",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Create the chain\n",
	"qa = ConversationalRetrievalChain.from_llm(\n",
	" llm=OpenAI(temperature=0), \n",
	" retriever=vectorstore.as_retriever(),\n",
	" return_source_documents=True,\n",
	")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"id": "1f52b3a5",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Initialize chat history list\n",
	"chat_history = []"
	]
	},
	{
	"cell_type": "markdown",
	"id": "f4fc3abe",
	"metadata": {},
	"source": [
	"### Step 3.2 - Begin conversing!\n",
	"\n",
	"This is where the fun begins."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 27,
	"id": "5949e4e7",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"' $1.9 billion'"
	]
	},
	"execution_count": 27,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"query = \"What was Airbnb's net income in 2022?\"\n",
	"result = qa({\"question\": query, \"chat_history\": chat_history})\n",
	"result[\"answer\"]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"id": "50a08844",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Add the answer to the chat history\n",
	"chat_history.append((query, result[\"answer\"]))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"id": "23263ff6",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"\" Airbnb's net income in 2020 was -$4.6 billion.\""
	]
	},
	"execution_count": 29,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Notice that the Chatbot knows that \"two years before [that]\" is \"two years before [2022]\"\n",
	"query = \"What was it two years before that?\"\n",
	"result = qa({\"question\": query, \"chat_history\": chat_history})\n",
	"result[\"answer\"]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"id": "463213b1",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Add the answer to the chat history\n",
	"chat_history.append((query, result[\"answer\"]))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 31,
	"id": "ed2fdb9e",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"' Airbnb went public on December 14, 2020.'"
	]
	},
	"execution_count": 31,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"query = \"When did they IPO?\"\n",
	"result = qa({\"question\": query, \"chat_history\": chat_history})\n",
	"result[\"answer\"]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 32,
	"id": "83a8f7fd",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Add the answer to the chat history\n",
	"chat_history.append((query, result[\"answer\"]))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 33,
	"id": "9a89e322",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"' 55,000,000 shares were issued when Airbnb went public.'"
	]
	},
	"execution_count": 33,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"query = \"How many shares were issued?\"\n",
	"result = qa({\"question\": query, \"chat_history\": chat_history})\n",
	"result[\"answer\"]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "90cfdf25",
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.10.0"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}