YanSte/semi_structured_rag_v1.ipynb

## semi_structured_rag_v1.ipynb
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/esenthil2018/2270330891ff4e78f83783e42c4293f5/semi_structured_rag_v1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "b6d466cc-aa8b-4baf-a80a-fef01921ca8d",
      "metadata": {
        "id": "b6d466cc-aa8b-4baf-a80a-fef01921ca8d"
      },
      "source": [
        "## Semi-structured RAG\n",
        "\n",
        "Many documents contain a mixture of content types, including text and tables.\n",
        "\n",
        "Semi-structured data can be challenging for conventional RAG for at least two reasons:\n",
        "\n",
        "* Text splitting may break up tables, corrupting the data in retrieval\n",
        "* Embedding tables may pose challenges for semantic similarity search\n",
        "\n",
        "This cookbook shows how to perform RAG on documents with semi-structured data:\n",
        "\n",
        "* We will use [Unstructured](https://unstructured.io/) to parse both text and tables from documents (PDFs).\n",
        "* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with table summaries better suited for retrieval.\n",
        "* We will use [LCEL](https://python.langchain.com/docs/expression_language/) to implement the chains used.\n",
        "\n",
        "\n",
        "## Packages"
      ]
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "Hrs92FFz9P0Z"
      },
      "id": "Hrs92FFz9P0Z",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5740fc70-c513-4ff4-9d72-cfc098f85fef",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "5740fc70-c513-4ff4-9d72-cfc098f85fef",
        "outputId": "e951d9b0-e19c-408d-8cb1-9782a5f7a7b0"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Requirement already satisfied: fastapi in /usr/local/lib/python3.10/dist-packages (0.109.0)\n",
            "Requirement already satisfied: kaleido in /usr/local/lib/python3.10/dist-packages (0.2.1)\n",
            "Requirement already satisfied: uvicorn in /usr/local/lib/python3.10/dist-packages (0.25.0)\n",
            "Requirement already satisfied: chromadb in /usr/local/lib/python3.10/dist-packages (0.4.22)\n",
            "Requirement already satisfied: pydantic!=1.8,!=1.8.1,!=2.0.0,!=2.0.1,!=2.1.0,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from fastapi) (1.10.13)\n",
            "Requirement already satisfied: starlette<0.36.0,>=0.35.0 in /usr/local/lib/python3.10/dist-packages (from fastapi) (0.35.1)\n",
            "Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from fastapi) (4.9.0)\n",
            "Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.10/dist-packages (from uvicorn) (8.1.7)\n",
            "Requirement already satisfied: h11>=0.8 in /usr/local/lib/python3.10/dist-packages (from uvicorn) (0.14.0)\n",
            "Requirement already satisfied: build>=1.0.3 in /usr/local/lib/python3.10/dist-packages (from chromadb) (1.0.3)\n",
            "Requirement already satisfied: requests>=2.28 in /usr/local/lib/python3.10/dist-packages (from chromadb) (2.31.0)\n",
            "Requirement already satisfied: chroma-hnswlib==0.7.3 in /usr/local/lib/python3.10/dist-packages (from chromadb) (0.7.3)\n",
            "Requirement already satisfied: numpy>=1.22.5 in /usr/local/lib/python3.10/dist-packages (from chromadb) (1.23.5)\n",
            "Requirement already satisfied: posthog>=2.4.0 in /usr/local/lib/python3.10/dist-packages (from chromadb) (3.3.1)\n",
            "Requirement already satisfied: pulsar-client>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from chromadb) (3.4.0)\n",
            "Requirement already satisfied: onnxruntime>=1.14.1 in /usr/local/lib/python3.10/dist-packages (from chromadb) (1.15.1)\n",
            "Requirement already satisfied: opentelemetry-api>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from chromadb) (1.22.0)\n",
            "Requirement already satisfied: opentelemetry-exporter-otlp-proto-grpc>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from chromadb) (1.22.0)\n",
            "Requirement already satisfied: opentelemetry-instrumentation-fastapi>=0.41b0 in /usr/local/lib/python3.10/dist-packages (from chromadb) (0.43b0)\n",
            "Requirement already satisfied: opentelemetry-sdk>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from chromadb) (1.22.0)\n",
            "Requirement already satisfied: tokenizers>=0.13.2 in /usr/local/lib/python3.10/dist-packages (from chromadb) (0.15.0)\n",
            "Requirement already satisfied: pypika>=0.48.9 in /usr/local/lib/python3.10/dist-packages (from chromadb) (0.48.9)\n",
            "Requirement already satisfied: tqdm>=4.65.0 in /usr/local/lib/python3.10/dist-packages (from chromadb) (4.66.1)\n",
            "Requirement already satisfied: overrides>=7.3.1 in /usr/local/lib/python3.10/dist-packages (from chromadb) (7.4.0)\n",
            "Requirement already satisfied: importlib-resources in /usr/local/lib/python3.10/dist-packages (from chromadb) (6.1.1)\n",
            "Requirement already satisfied: grpcio>=1.58.0 in /usr/local/lib/python3.10/dist-packages (from chromadb) (1.60.0)\n",
            "Requirement already satisfied: bcrypt>=4.0.1 in /usr/local/lib/python3.10/dist-packages (from chromadb) (4.1.2)\n",
            "Requirement already satisfied: typer>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from chromadb) (0.9.0)\n",
            "Requirement already satisfied: kubernetes>=28.1.0 in /usr/local/lib/python3.10/dist-packages (from chromadb) (29.0.0)\n",
            "Requirement already satisfied: tenacity>=8.2.3 in /usr/local/lib/python3.10/dist-packages (from chromadb) (8.2.3)\n",
            "Requirement already satisfied: PyYAML>=6.0.0 in /usr/local/lib/python3.10/dist-packages (from chromadb) (6.0.1)\n",
            "Requirement already satisfied: mmh3>=4.0.1 in /usr/local/lib/python3.10/dist-packages (from chromadb) (4.1.0)\n",
            "Requirement already satisfied: packaging>=19.0 in /usr/local/lib/python3.10/dist-packages (from build>=1.0.3->chromadb) (23.2)\n",
            "Requirement already satisfied: pyproject_hooks in /usr/local/lib/python3.10/dist-packages (from build>=1.0.3->chromadb) (1.0.0)\n",
            "Requirement already satisfied: tomli>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from build>=1.0.3->chromadb) (2.0.1)\n",
            "Requirement already satisfied: certifi>=14.05.14 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb) (2023.11.17)\n",
            "Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb) (1.16.0)\n",
            "Requirement already satisfied: python-dateutil>=2.5.3 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb) (2.8.2)\n",
            "Requirement already satisfied: google-auth>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb) (2.17.3)\n",
            "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb) (1.7.0)\n",
            "Requirement already satisfied: requests-oauthlib in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb) (1.3.1)\n",
            "Requirement already satisfied: oauthlib>=3.2.2 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb) (3.2.2)\n",
            "Requirement already satisfied: urllib3>=1.24.2 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb) (2.0.7)\n",
            "Requirement already satisfied: coloredlogs in /usr/local/lib/python3.10/dist-packages (from onnxruntime>=1.14.1->chromadb) (15.0.1)\n",
            "Requirement already satisfied: flatbuffers in /usr/local/lib/python3.10/dist-packages (from onnxruntime>=1.14.1->chromadb) (23.5.26)\n",
            "Requirement already satisfied: protobuf in /usr/local/lib/python3.10/dist-packages (from onnxruntime>=1.14.1->chromadb) (3.20.3)\n",
            "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from onnxruntime>=1.14.1->chromadb) (1.12)\n",
            "Requirement already satisfied: deprecated>=1.2.6 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-api>=1.2.0->chromadb) (1.2.14)\n",
            "Requirement already satisfied: importlib-metadata<7.0,>=6.0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-api>=1.2.0->chromadb) (6.11.0)\n",
            "Requirement already satisfied: backoff<3.0.0,>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-exporter-otlp-proto-grpc>=1.2.0->chromadb) (2.2.1)\n",
            "Requirement already satisfied: googleapis-common-protos~=1.52 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-exporter-otlp-proto-grpc>=1.2.0->chromadb) (1.62.0)\n",
            "Requirement already satisfied: opentelemetry-exporter-otlp-proto-common==1.22.0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-exporter-otlp-proto-grpc>=1.2.0->chromadb) (1.22.0)\n",
            "Requirement already satisfied: opentelemetry-proto==1.22.0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-exporter-otlp-proto-grpc>=1.2.0->chromadb) (1.22.0)\n",
            "Requirement already satisfied: opentelemetry-instrumentation-asgi==0.43b0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-instrumentation-fastapi>=0.41b0->chromadb) (0.43b0)\n",
            "Requirement already satisfied: opentelemetry-instrumentation==0.43b0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-instrumentation-fastapi>=0.41b0->chromadb) (0.43b0)\n",
            "Requirement already satisfied: opentelemetry-semantic-conventions==0.43b0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-instrumentation-fastapi>=0.41b0->chromadb) (0.43b0)\n",
            "Requirement already satisfied: opentelemetry-util-http==0.43b0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-instrumentation-fastapi>=0.41b0->chromadb) (0.43b0)\n",
            "Requirement already satisfied: setuptools>=16.0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-instrumentation==0.43b0->opentelemetry-instrumentation-fastapi>=0.41b0->chromadb) (67.7.2)\n",
            "Requirement already satisfied: wrapt<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-instrumentation==0.43b0->opentelemetry-instrumentation-fastapi>=0.41b0->chromadb) (1.14.1)\n",
            "Requirement already satisfied: asgiref~=3.0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-instrumentation-asgi==0.43b0->opentelemetry-instrumentation-fastapi>=0.41b0->chromadb) (3.7.2)\n",
            "Requirement already satisfied: monotonic>=1.5 in /usr/local/lib/python3.10/dist-packages (from posthog>=2.4.0->chromadb) (1.6)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.28->chromadb) (3.3.2)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.28->chromadb) (3.6)\n",
            "Requirement already satisfied: anyio<5,>=3.4.0 in /usr/local/lib/python3.10/dist-packages (from starlette<0.36.0,>=0.35.0->fastapi) (3.7.1)\n",
            "Requirement already satisfied: huggingface_hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from tokenizers>=0.13.2->chromadb) (0.20.2)\n",
            "Requirement already satisfied: httptools>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from uvicorn) (0.6.1)\n",
            "Requirement already satisfied: python-dotenv>=0.13 in /usr/local/lib/python3.10/dist-packages (from uvicorn) (1.0.0)\n",
            "Requirement already satisfied: uvloop!=0.15.0,!=0.15.1,>=0.14.0 in /usr/local/lib/python3.10/dist-packages (from uvicorn) (0.19.0)\n",
            "Requirement already satisfied: watchfiles>=0.13 in /usr/local/lib/python3.10/dist-packages (from uvicorn) (0.21.0)\n",
            "Requirement already satisfied: websockets>=10.4 in /usr/local/lib/python3.10/dist-packages (from uvicorn) (12.0)\n",
            "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.4.0->starlette<0.36.0,>=0.35.0->fastapi) (1.3.0)\n",
            "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.4.0->starlette<0.36.0,>=0.35.0->fastapi) (1.2.0)\n",
            "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth>=1.0.1->kubernetes>=28.1.0->chromadb) (5.3.2)\n",
            "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth>=1.0.1->kubernetes>=28.1.0->chromadb) (0.3.0)\n",
            "Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth>=1.0.1->kubernetes>=28.1.0->chromadb) (4.9)\n",
            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface_hub<1.0,>=0.16.4->tokenizers>=0.13.2->chromadb) (3.13.1)\n",
            "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub<1.0,>=0.16.4->tokenizers>=0.13.2->chromadb) (2023.6.0)\n",
            "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata<7.0,>=6.0->opentelemetry-api>=1.2.0->chromadb) (3.17.0)\n",
            "Requirement already satisfied: humanfriendly>=9.1 in /usr/local/lib/python3.10/dist-packages (from coloredlogs->onnxruntime>=1.14.1->chromadb) (10.0)\n",
            "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->onnxruntime>=1.14.1->chromadb) (1.3.0)\n",
            "Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth>=1.0.1->kubernetes>=28.1.0->chromadb) (0.5.1)\n",
            "Requirement already satisfied: langchain_openai in /usr/local/lib/python3.10/dist-packages (0.0.2.post1)\n",
            "Requirement already satisfied: cohere in /usr/local/lib/python3.10/dist-packages (4.42)\n",
            "Requirement already satisfied: langchain-core<0.2,>=0.1.7 in /usr/local/lib/python3.10/dist-packages (from langchain_openai) (0.1.10)\n",
            "Requirement already satisfied: numpy<2,>=1 in /usr/local/lib/python3.10/dist-packages (from langchain_openai) (1.23.5)\n",
            "Requirement already satisfied: openai<2.0.0,>=1.6.1 in /usr/local/lib/python3.10/dist-packages (from langchain_openai) (1.7.2)\n",
            "Requirement already satisfied: tiktoken<0.6.0,>=0.5.2 in /usr/local/lib/python3.10/dist-packages (from langchain_openai) (0.5.2)\n",
            "Requirement already satisfied: aiohttp<4.0,>=3.0 in /usr/local/lib/python3.10/dist-packages (from cohere) (3.9.1)\n",
            "Requirement already satisfied: backoff<3.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from cohere) (2.2.1)\n",
            "Requirement already satisfied: fastavro<2.0,>=1.8 in /usr/local/lib/python3.10/dist-packages (from cohere) (1.9.3)\n",
            "Requirement already satisfied: importlib_metadata<7.0,>=6.0 in /usr/local/lib/python3.10/dist-packages (from cohere) (6.11.0)\n",
            "Requirement already satisfied: requests<3.0.0,>=2.25.0 in /usr/local/lib/python3.10/dist-packages (from cohere) (2.31.0)\n",
            "Requirement already satisfied: urllib3<3,>=1.26 in /usr/local/lib/python3.10/dist-packages (from cohere) (2.0.7)\n",
            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0,>=3.0->cohere) (23.2.0)\n",
            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0,>=3.0->cohere) (6.0.4)\n",
            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0,>=3.0->cohere) (1.9.4)\n",
            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0,>=3.0->cohere) (1.4.1)\n",
            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0,>=3.0->cohere) (1.3.1)\n",
            "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0,>=3.0->cohere) (4.0.3)\n",
            "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib_metadata<7.0,>=6.0->cohere) (3.17.0)\n",
            "Requirement already satisfied: PyYAML>=5.3 in /usr/local/lib/python3.10/dist-packages (from langchain-core<0.2,>=0.1.7->langchain_openai) (6.0.1)\n",
            "Requirement already satisfied: anyio<5,>=3 in /usr/local/lib/python3.10/dist-packages (from langchain-core<0.2,>=0.1.7->langchain_openai) (3.7.1)\n",
            "Requirement already satisfied: jsonpatch<2.0,>=1.33 in /usr/local/lib/python3.10/dist-packages (from langchain-core<0.2,>=0.1.7->langchain_openai) (1.33)\n",
            "Requirement already satisfied: langsmith<0.1.0,>=0.0.63 in /usr/local/lib/python3.10/dist-packages (from langchain-core<0.2,>=0.1.7->langchain_openai) (0.0.80)\n",
            "Requirement already satisfied: packaging<24.0,>=23.2 in /usr/local/lib/python3.10/dist-packages (from langchain-core<0.2,>=0.1.7->langchain_openai) (23.2)\n",
            "Requirement already satisfied: pydantic<3,>=1 in /usr/local/lib/python3.10/dist-packages (from langchain-core<0.2,>=0.1.7->langchain_openai) (1.10.13)\n",
            "Requirement already satisfied: tenacity<9.0.0,>=8.1.0 in /usr/local/lib/python3.10/dist-packages (from langchain-core<0.2,>=0.1.7->langchain_openai) (8.2.3)\n",
            "Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai<2.0.0,>=1.6.1->langchain_openai) (1.7.0)\n",
            "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from openai<2.0.0,>=1.6.1->langchain_openai) (0.26.0)\n",
            "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai<2.0.0,>=1.6.1->langchain_openai) (1.3.0)\n",
            "Requirement already satisfied: tqdm>4 in /usr/local/lib/python3.10/dist-packages (from openai<2.0.0,>=1.6.1->langchain_openai) (4.66.1)\n",
            "Requirement already satisfied: typing-extensions<5,>=4.7 in /usr/local/lib/python3.10/dist-packages (from openai<2.0.0,>=1.6.1->langchain_openai) (4.9.0)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.25.0->cohere) (3.3.2)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.25.0->cohere) (3.6)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.25.0->cohere) (2023.11.17)\n",
            "Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.10/dist-packages (from tiktoken<0.6.0,>=0.5.2->langchain_openai) (2023.6.3)\n",
            "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3->langchain-core<0.2,>=0.1.7->langchain_openai) (1.2.0)\n",
            "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai<2.0.0,>=1.6.1->langchain_openai) (1.0.2)\n",
            "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai<2.0.0,>=1.6.1->langchain_openai) (0.14.0)\n",
            "Requirement already satisfied: jsonpointer>=1.9 in /usr/local/lib/python3.10/dist-packages (from jsonpatch<2.0,>=1.33->langchain-core<0.2,>=0.1.7->langchain_openai) (2.4)\n",
            "Requirement already satisfied: langchain in /usr/local/lib/python3.10/dist-packages (0.1.0)\n",
            "Requirement already satisfied: unstructured[all-docs] in /usr/local/lib/python3.10/dist-packages (0.12.0)\n",
            "Requirement already satisfied: pydantic in /usr/local/lib/python3.10/dist-packages (1.10.13)\n",
            "Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (4.9.4)\n",
            "Requirement already satisfied: langchainhub in /usr/local/lib/python3.10/dist-packages (0.1.14)\n",
            "Requirement already satisfied: PyYAML>=5.3 in /usr/local/lib/python3.10/dist-packages (from langchain) (6.0.1)\n",
            "Requirement already satisfied: SQLAlchemy<3,>=1.4 in /usr/local/lib/python3.10/dist-packages (from langchain) (2.0.24)\n",
            "Requirement already satisfied: aiohttp<4.0.0,>=3.8.3 in /usr/local/lib/python3.10/dist-packages (from langchain) (3.9.1)\n",
            "Requirement already satisfied: async-timeout<5.0.0,>=4.0.0 in /usr/local/lib/python3.10/dist-packages (from langchain) (4.0.3)\n",
            "Requirement already satisfied: dataclasses-json<0.7,>=0.5.7 in /usr/local/lib/python3.10/dist-packages (from langchain) (0.6.3)\n",
            "Requirement already satisfied: jsonpatch<2.0,>=1.33 in /usr/local/lib/python3.10/dist-packages (from langchain) (1.33)\n",
            "Requirement already satisfied: langchain-community<0.1,>=0.0.9 in /usr/local/lib/python3.10/dist-packages (from langchain) (0.0.12)\n",
            "Requirement already satisfied: langchain-core<0.2,>=0.1.7 in /usr/local/lib/python3.10/dist-packages (from langchain) (0.1.10)\n",
            "Requirement already satisfied: langsmith<0.1.0,>=0.0.77 in /usr/local/lib/python3.10/dist-packages (from langchain) (0.0.80)\n",
            "Requirement already satisfied: numpy<2,>=1 in /usr/local/lib/python3.10/dist-packages (from langchain) (1.23.5)\n",
            "Requirement already satisfied: requests<3,>=2 in /usr/local/lib/python3.10/dist-packages (from langchain) (2.31.0)\n",
            "Requirement already satisfied: tenacity<9.0.0,>=8.1.0 in /usr/local/lib/python3.10/dist-packages (from langchain) (8.2.3)\n",
            "Requirement already satisfied: chardet in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (5.2.0)\n",
            "Requirement already satisfied: filetype in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (1.2.0)\n",
            "Requirement already satisfied: python-magic in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (0.4.27)\n",
            "Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (3.8.1)\n",
            "Requirement already satisfied: tabulate in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (0.9.0)\n",
            "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (4.11.2)\n",
            "Requirement already satisfied: emoji in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (2.9.0)\n",
            "Requirement already satisfied: python-iso639 in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (2024.1.2)\n",
            "Requirement already satisfied: langdetect in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (1.0.9)\n",
            "Requirement already satisfied: rapidfuzz in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (3.6.1)\n",
            "Requirement already satisfied: backoff in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (2.2.1)\n",
            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (4.9.0)\n",
            "Requirement already satisfied: unstructured-client in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (0.15.2)\n",
            "Requirement already satisfied: wrapt in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (1.14.1)\n",
            "Requirement already satisfied: unstructured.pytesseract>=0.3.12 in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (0.3.12)\n",
            "Requirement already satisfied: pdfminer.six in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (20221105)\n",
            "Requirement already satisfied: markdown in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (3.5.1)\n",
            "Requirement already satisfied: python-pptx<=0.6.23 in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (0.6.23)\n",
            "Requirement already satisfied: pypandoc in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (1.12)\n",
            "Requirement already satisfied: unstructured-inference==0.7.21 in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (0.7.21)\n",
            "Requirement already satisfied: pdf2image in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (1.17.0)\n",
            "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (3.2.1)\n",
            "Requirement already satisfied: pypdf in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (3.17.4)\n",
            "Requirement already satisfied: msg-parser in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (1.2.0)\n",
            "Requirement already satisfied: xlrd in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (2.0.1)\n",
            "Requirement already satisfied: python-docx in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (1.1.0)\n",
            "Requirement already satisfied: openpyxl in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (3.1.2)\n",
            "Requirement already satisfied: onnx in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (1.15.0)\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (1.5.3)\n",
            "Requirement already satisfied: pikepdf in /usr/local/lib/python3.10/dist-packages (from unstructured[all-docs]) (8.11.2)\n",
            "Requirement already satisfied: layoutparser[layoutmodels,tesseract] in /usr/local/lib/python3.10/dist-packages (from unstructured-inference==0.7.21->unstructured[all-docs]) (0.3.4)\n",
            "Requirement already satisfied: python-multipart in /usr/local/lib/python3.10/dist-packages (from unstructured-inference==0.7.21->unstructured[all-docs]) (0.0.6)\n",
            "Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (from unstructured-inference==0.7.21->unstructured[all-docs]) (0.20.2)\n",
            "Requirement already satisfied: opencv-python!=4.7.0.68 in /usr/local/lib/python3.10/dist-packages (from unstructured-inference==0.7.21->unstructured[all-docs]) (4.8.0.76)\n",
            "Requirement already satisfied: onnxruntime<1.16 in /usr/local/lib/python3.10/dist-packages (from unstructured-inference==0.7.21->unstructured[all-docs]) (1.15.1)\n",
            "Requirement already satisfied: transformers>=4.25.1 in /usr/local/lib/python3.10/dist-packages (from unstructured-inference==0.7.21->unstructured[all-docs]) (4.35.2)\n",
            "Requirement already satisfied: types-requests<3.0.0.0,>=2.31.0.2 in /usr/local/lib/python3.10/dist-packages (from langchainhub) (2.31.0.20240106)\n",
            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (23.2.0)\n",
            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (6.0.4)\n",
            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.9.4)\n",
            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.4.1)\n",
            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.3.1)\n",
            "Requirement already satisfied: marshmallow<4.0.0,>=3.18.0 in /usr/local/lib/python3.10/dist-packages (from dataclasses-json<0.7,>=0.5.7->langchain) (3.20.2)\n",
            "Requirement already satisfied: typing-inspect<1,>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from dataclasses-json<0.7,>=0.5.7->langchain) (0.9.0)\n",
            "Requirement already satisfied: jsonpointer>=1.9 in /usr/local/lib/python3.10/dist-packages (from jsonpatch<2.0,>=1.33->langchain) (2.4)\n",
            "Requirement already satisfied: anyio<5,>=3 in /usr/local/lib/python3.10/dist-packages (from langchain-core<0.2,>=0.1.7->langchain) (3.7.1)\n",
            "Requirement already satisfied: packaging<24.0,>=23.2 in /usr/local/lib/python3.10/dist-packages (from langchain-core<0.2,>=0.1.7->langchain) (23.2)\n",
            "Requirement already satisfied: Pillow>=3.3.2 in /usr/local/lib/python3.10/dist-packages (from python-pptx<=0.6.23->unstructured[all-docs]) (10.2.0)\n",
            "Requirement already satisfied: XlsxWriter>=0.5.7 in /usr/local/lib/python3.10/dist-packages (from python-pptx<=0.6.23->unstructured[all-docs]) (3.1.9)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2->langchain) (3.3.2)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2->langchain) (3.6)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2->langchain) (2.0.7)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2->langchain) (2023.11.17)\n",
            "Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.10/dist-packages (from SQLAlchemy<3,>=1.4->langchain) (3.0.3)\n",
            "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->unstructured[all-docs]) (2.5)\n",
            "Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from langdetect->unstructured[all-docs]) (1.16.0)\n",
            "Requirement already satisfied: olefile>=0.46 in /usr/local/lib/python3.10/dist-packages (from msg-parser->unstructured[all-docs]) (0.47)\n",
            "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->unstructured[all-docs]) (8.1.7)\n",
            "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->unstructured[all-docs]) (1.3.2)\n",
            "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk->unstructured[all-docs]) (2023.6.3)\n",
            "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk->unstructured[all-docs]) (4.66.1)\n",
            "Requirement already satisfied: protobuf>=3.20.2 in /usr/local/lib/python3.10/dist-packages (from onnx->unstructured[all-docs]) (3.20.3)\n",
            "Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.10/dist-packages (from openpyxl->unstructured[all-docs]) (1.1.0)\n",
            "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->unstructured[all-docs]) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->unstructured[all-docs]) (2023.3.post1)\n",
            "Requirement already satisfied: cryptography>=36.0.0 in /usr/local/lib/python3.10/dist-packages (from pdfminer.six->unstructured[all-docs]) (41.0.7)\n",
            "Requirement already satisfied: Deprecated in /usr/local/lib/python3.10/dist-packages (from pikepdf->unstructured[all-docs]) (1.2.14)\n",
            "Requirement already satisfied: jsonpath-python>=1.0.6 in /usr/local/lib/python3.10/dist-packages (from unstructured-client->unstructured[all-docs]) (1.0.6)\n",
            "Requirement already satisfied: mypy-extensions>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from unstructured-client->unstructured[all-docs]) (1.0.0)\n",
            "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3->langchain-core<0.2,>=0.1.7->langchain) (1.3.0)\n",
            "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3->langchain-core<0.2,>=0.1.7->langchain) (1.2.0)\n",
            "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/dist-packages (from cryptography>=36.0.0->pdfminer.six->unstructured[all-docs]) (1.16.0)\n",
            "Requirement already satisfied: coloredlogs in /usr/local/lib/python3.10/dist-packages (from onnxruntime<1.16->unstructured-inference==0.7.21->unstructured[all-docs]) (15.0.1)\n",
            "Requirement already satisfied: flatbuffers in /usr/local/lib/python3.10/dist-packages (from onnxruntime<1.16->unstructured-inference==0.7.21->unstructured[all-docs]) (23.5.26)\n",
            "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from onnxruntime<1.16->unstructured-inference==0.7.21->unstructured[all-docs]) (1.12)\n",
            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers>=4.25.1->unstructured-inference==0.7.21->unstructured[all-docs]) (3.13.1)\n",
            "Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.25.1->unstructured-inference==0.7.21->unstructured[all-docs]) (0.15.0)\n",
            "Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.25.1->unstructured-inference==0.7.21->unstructured[all-docs]) (0.4.1)\n",
            "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub->unstructured-inference==0.7.21->unstructured[all-docs]) (2023.6.0)\n",
            "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (1.11.4)\n",
            "Requirement already satisfied: iopath in /usr/local/lib/python3.10/dist-packages (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (0.1.10)\n",
            "Requirement already satisfied: pdfplumber in /usr/local/lib/python3.10/dist-packages (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (0.10.3)\n",
            "Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (2.1.0+cu121)\n",
            "Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (0.16.0+cu121)\n",
            "Requirement already satisfied: effdet in /usr/local/lib/python3.10/dist-packages (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (0.4.1)\n",
            "Requirement already satisfied: pytesseract in /usr/local/lib/python3.10/dist-packages (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (0.3.10)\n",
            "Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer.six->unstructured[all-docs]) (2.21)\n",
            "Requirement already satisfied: humanfriendly>=9.1 in /usr/local/lib/python3.10/dist-packages (from coloredlogs->onnxruntime<1.16->unstructured-inference==0.7.21->unstructured[all-docs]) (10.0)\n",
            "Requirement already satisfied: timm>=0.9.2 in /usr/local/lib/python3.10/dist-packages (from effdet->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (0.9.12)\n",
            "Requirement already satisfied: pycocotools>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from effdet->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (2.0.7)\n",
            "Requirement already satisfied: omegaconf>=2.0 in /usr/local/lib/python3.10/dist-packages (from effdet->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (2.3.0)\n",
            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (3.1.2)\n",
            "Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (2.1.0)\n",
            "Requirement already satisfied: portalocker in /usr/local/lib/python3.10/dist-packages (from iopath->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (2.8.2)\n",
            "Requirement already satisfied: pypdfium2>=4.18.0 in /usr/local/lib/python3.10/dist-packages (from pdfplumber->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (4.26.0)\n",
            "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->onnxruntime<1.16->unstructured-inference==0.7.21->unstructured[all-docs]) (1.3.0)\n",
            "Requirement already satisfied: antlr4-python3-runtime==4.9.* in /usr/local/lib/python3.10/dist-packages (from omegaconf>=2.0->effdet->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (4.9.3)\n",
            "Requirement already satisfied: matplotlib>=2.1.0 in /usr/local/lib/python3.10/dist-packages (from pycocotools>=2.0.2->effdet->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (3.7.1)\n",
            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (2.1.3)\n",
            "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools>=2.0.2->effdet->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (1.2.0)\n",
            "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools>=2.0.2->effdet->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (0.12.1)\n",
            "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools>=2.0.2->effdet->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (4.47.0)\n",
            "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools>=2.0.2->effdet->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (1.4.5)\n",
            "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools>=2.0.2->effdet->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.21->unstructured[all-docs]) (3.1.1)\n"
          ]
        }
      ],
      "source": [
        "!pip install fastapi kaleido uvicorn chromadb\n",
        "!pip install langchain_openai cohere\n",
        "!pip install langchain unstructured[all-docs] pydantic lxml langchainhub"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "[link text](https://)WARNING: The following packages were previously imported in this runtime:\n",
        "  [PIL]\n",
        "You must restart the runtime in order to use newly installed versions.--> Make sure to restart the runtime"
      ],
      "metadata": {
        "id": "KB33CbcxM3DE"
      },
      "id": "KB33CbcxM3DE"
    },
    {
      "cell_type": "markdown",
      "id": "44349a83-e1dc-4eed-ba75-587f309d8c88",
      "metadata": {
        "id": "44349a83-e1dc-4eed-ba75-587f309d8c88"
      },
      "source": [
        "The PDF partitioning used by Unstructured will use:\n",
        "\n",
        "* `tesseract` for Optical Character Recognition (OCR)\n",
        "*  `poppler` for PDF rendering and processing"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f7880871-4949-4ea2-aed8-540a09188a41",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "f7880871-4949-4ea2-aed8-540a09188a41",
        "outputId": "a7eb8881-4ffd-4f2b-cbe0-48757f4bdd09"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "/bin/bash: line 1: brew: command not found\n",
            "/bin/bash: line 1: brew: command not found\n"
          ]
        }
      ],
      "source": [
        "#For Mac\n",
        "! brew install tesseract\n",
        "! brew install poppler"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "#For Google colab-Linux\n",
        "!sudo apt-get install tesseract-ocr\n",
        "!sudo apt-get install poppler-utils\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "AK4MED4rsFY1",
        "outputId": "819f0a4c-f97d-424d-c5b3-6eec8c09ad61"
      },
      "id": "AK4MED4rsFY1",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Reading package lists... Done\n",
            "Building dependency tree... Done\n",
            "Reading state information... Done\n",
            "tesseract-ocr is already the newest version (4.1.1-2.1build1).\n",
            "0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.\n",
            "Reading package lists... Done\n",
            "Building dependency tree... Done\n",
            "Reading state information... Done\n",
            "poppler-utils is already the newest version (22.02.0-2ubuntu0.3).\n",
            "0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install pytesseract\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "163r_PXcsNMJ",
        "outputId": "bf3bba72-4ac2-4a5d-9b8f-d77ff004629f"
      },
      "id": "163r_PXcsNMJ",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Requirement already satisfied: pytesseract in /usr/local/lib/python3.10/dist-packages (0.3.10)\n",
            "Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from pytesseract) (23.2)\n",
            "Requirement already satisfied: Pillow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from pytesseract) (10.2.0)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install pdf2image\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "U2QZRrZwsURP",
        "outputId": "80cca944-7824-4910-9ccb-cacf3c37c2eb"
      },
      "id": "U2QZRrZwsURP",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Requirement already satisfied: pdf2image in /usr/local/lib/python3.10/dist-packages (1.17.0)\n",
            "Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from pdf2image) (10.2.0)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "from google.colab import drive\n",
        "drive.mount('/content/drive')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "luxfopMRssYT",
        "outputId": "84f31443-2d7f-49ed-a506-65c6f3f21299"
      },
      "id": "luxfopMRssYT",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Mounted at /content/drive\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7c24efa9-b6f6-4dc2-bfe3-70819ba3ef75",
      "metadata": {
        "id": "7c24efa9-b6f6-4dc2-bfe3-70819ba3ef75"
      },
      "source": [
        "## Data Loading\n",
        "\n",
        "### Partition PDF tables and text\n",
        "\n",
        "Apply to the [`LLaMA2`](https://arxiv.org/pdf/2307.09288.pdf) paper.\n",
        "\n",
        "We use the Unstructured [`partition_pdf`](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-pdf), which segments a PDF document by using a layout model.\n",
        "\n",
        "This layout model makes it possible to extract elements, such as tables, from pdfs.\n",
        "\n",
        "We also can use `Unstructured` chunking, which:\n",
        "\n",
        "* Tries to identify document sections (e.g., Introduction, etc)\n",
        "* Then, builds text blocks that maintain sections while also honoring user-defined chunk sizes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "62cf502b-407d-4645-a72c-24498fd55130",
      "metadata": {
        "id": "62cf502b-407d-4645-a72c-24498fd55130"
      },
      "outputs": [],
      "source": [
        "path = '/content/drive/MyDrive/LLM/Multi/'\n",
        "outputpath = '/content/drive/MyDrive/LLM/Multi/'"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Imports\n",
        "- **`from typing import Any`**: This line imports the `Any` type from Python's `typing` module. `Any` is used to indicate that a variable or return type can be of any type.\n",
        "\n",
        "- **`from pydantic import BaseModel`**: This imports `BaseModel` from the `pydantic` library. `BaseModel` is used to create classes that automatically handle data validation and parsing. It's a common practice in Python applications for creating data objects with automatic validation checks.\n",
        "\n",
        "- **`from unstructured.partition.pdf import partition_pdf`**: Here, the `partition_pdf` function is imported from the `unstructured.partition.pdf` module. This function is likely used for partitioning or processing PDF files.\n",
        "\n",
        "### Processing PDF File\n",
        "- **`raw_pdf_elements = partition_pdf(...)`**: This line calls the `partition_pdf` function and stores the result in `raw_pdf_elements`. The `partition_pdf` function is being passed several arguments that determine how the PDF file, referred to as \"LLaMA2.pdf\", will be processed:\n",
        "  - **Filename**: `filename=path + \"LLaMA2.pdf\"` specifies the path and filename of the PDF to be processed.\n",
        "  - **Image Extraction**: `extract_images_in_pdf=False` indicates that images embedded in the PDF should not be extracted.\n",
        "  - **Table Structure Inference**: `infer_table_structure=True` enables the function to use a layout model (possibly YOLOX) to infer the structure of tables within the PDF.\n",
        "  - **Chunking Strategy**: `chunking_strategy=\"by_title\"` sets the strategy for aggregating text to be based on titles within the document.\n",
        "  - **Chunking Parameters**: `max_characters=4000, new_after_n_chars=3800, combine_text_under_n_chars=2000` - These parameters control how text chunks are formed. A new chunk is attempted to be created after 3800 characters, chunks are attempted to be kept above 2000 characters, and the maximum number of characters in a chunk is set to 4000.\n",
        "  - **Image Output Directory**: `image_output_dir_path=path` specifies the directory path where any output images (if extracted) should be stored.\n"
      ],
      "metadata": {
        "id": "NEo6rgwnAl0w"
      },
      "id": "NEo6rgwnAl0w"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3867a654-61ba-4759-9a64-de953a429ced",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 187,
          "referenced_widgets": [
            "2414b346e197486fa3ec261eaafda277",
            "c44ad3dfaa1d43b5b88a3443508d37c9",
            "2dfe21d7b0d541da86fdceb86303aa4b",
            "3d9d125614844a50857210c5c76db292",
            "a45028a5ec0347c99f780a83b9c3b1a2",
            "6122a72d4b0b4a06a60e03cfba2828e5",
            "f539fe338e1c4df4afa61c060e6168c6",
            "036c66235c7644f181d021e64da90afd",
            "529a898eff1648538816e73d8e39a678",
            "632910e4507e4f51b3cfa6c9c877e42d",
            "4f3e28c3cdb64e72b3f9bfd5f29a30b6",
            "978c6b3be9484f42850fa61702319a53",
            "baa7ba9b80bd44f5afcaea7784f44c13",
            "1862cdfbd1f94bba881b7bab356d6435",
            "cba9ca4311844f3d8c0609382c1cbd04",
            "e8de0a5cb25d46e6875f9c3297fa9bcd",
            "73d4003bf1a34370a57b9e12b823602b",
            "b1f644ccd2274c8abb7517d09ffc0f55",
            "12ae4e014abc4e29811b85a432773585",
            "24c252c09a0247d78759f47601925606",
            "b4151ced4b3044d9b78239e0d8d8f84e",
            "0781be1837704abbb819a32954116868",
            "57b0d6e3c92a4c18b007691ebf2dfe1d",
            "1232eb4459124199b4f89e39a53e2676",
            "9b3f9d1288a14edfb427cfded0afce69",
            "7d78a77b0006457a89fb3e888200299c",
            "35c1ac9c1c0948938dbf245634c32c11",
            "864da1504d0847919eaae3936273232d",
            "738425dc04df44a6a8eee244560d3f36",
            "9bcf29de053145c0936b610b4168b968",
            "59252baca46e4262a4583731ccd4c8fb",
            "914cfc1097f0469db308542d35b0baa8",
            "ea1d06499e3e4690aaba0f867a2bebd9"
          ]
        },
        "id": "3867a654-61ba-4759-9a64-de953a429ced",
        "outputId": "4361926c-c21d-4a4f-8d31-1e4915d6a070"
      },
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]"
            ],
            "application/vnd.jupyter.widget-view+json": {
              "version_major": 2,
              "version_minor": 0,
              "model_id": "2414b346e197486fa3ec261eaafda277"
            }
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]"
            ],
            "application/vnd.jupyter.widget-view+json": {
              "version_major": 2,
              "version_minor": 0,
              "model_id": "978c6b3be9484f42850fa61702319a53"
            }
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]"
            ],
            "application/vnd.jupyter.widget-view+json": {
              "version_major": 2,
              "version_minor": 0,
              "model_id": "57b0d6e3c92a4c18b007691ebf2dfe1d"
            }
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']\n",
            "- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
            "- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
          ]
        }
      ],
      "source": [
        "from typing import Any\n",
        "\n",
        "from pydantic import BaseModel\n",
        "from unstructured.partition.pdf import partition_pdf\n",
        "\n",
        "# Get elements\n",
        "raw_pdf_elements = partition_pdf(\n",
        "    filename=path + \"LLAMA2.pdf\",\n",
        "    # Unstructured first finds embedded image blocks\n",
        "    extract_images_in_pdf=False,\n",
        "    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles\n",
        "    # Titles are any sub-section of the document\n",
        "    infer_table_structure=True,\n",
        "    # Post processing to aggregate text once we have the title\n",
        "    chunking_strategy=\"by_title\",\n",
        "    # Chunking params to aggregate text blocks\n",
        "    # Attempt to create a new chunk 3800 chars\n",
        "    # Attempt to keep chunks > 2000 chars\n",
        "    max_characters=4000,\n",
        "    new_after_n_chars=3800,\n",
        "    combine_text_under_n_chars=2000,\n",
        "    image_output_dir_path=path,\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "raw_pdf_elements"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "gQk-AX4-Rwht",
        "outputId": "ec6f48f6-ae1b-4d3e-a0c7-9379b1300d26"
      },
      "id": "gQk-AX4-Rwht",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[<unstructured.documents.elements.CompositeElement at 0x7a45a41eb610>,\n",
              " <unstructured.documents.elements.Table at 0x7a45a7b5fac0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45a7b5dae0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45a7db90f0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45a7dbb0d0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45a7db86d0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45a7dbadd0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45a7db89a0>,\n",
              " <unstructured.documents.elements.Table at 0x7a458f9c0130>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45a7db8fa0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a458f9c33d0>,\n",
              " <unstructured.documents.elements.Table at 0x7a458f9c3940>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a458f9c0190>,\n",
              " <unstructured.documents.elements.Table at 0x7a458f9c0d60>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a458f9c3fd0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a458f9c0700>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a458f9c3c10>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a458f9c2770>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a458f9c0880>,\n",
              " <unstructured.documents.elements.Table at 0x7a458f9c2bc0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b814cdf0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b814d390>,\n",
              " <unstructured.documents.elements.Table at 0x7a45b814c3d0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b814c760>,\n",
              " <unstructured.documents.elements.Table at 0x7a45b814d480>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b814d090>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b814f820>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b814e740>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b814fee0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b81c5480>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b81c4220>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b81c5840>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b81c6b30>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b81c5b70>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b81c7010>,\n",
              " <unstructured.documents.elements.Table at 0x7a45b814ec50>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45b81c6e60>,\n",
              " <unstructured.documents.elements.Table at 0x7a45b81c7040>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1157be0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f1156e60>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1157f40>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f11578b0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1154310>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1154f10>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1156f50>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1154fd0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1154af0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1157fa0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1157dc0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1155660>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f11548b0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1155210>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f11571f0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f11542e0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f1156d40>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1154c40>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1155e10>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1155990>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1155cf0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f11545b0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f11548e0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1157610>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1155690>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1155d50>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1157790>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1156b00>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1154130>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1157af0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1157760>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1155840>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1155e70>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1155240>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1157d90>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1156fb0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1154910>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f1156380>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f11567d0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f7fbd180>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f7fbd270>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f78cd7e0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f1157160>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45a48b3880>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f78e7760>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f78e7940>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4b85d80>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4a1b9a0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4b854b0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4b84280>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4b86350>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4b879d0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4b87a30>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4b86530>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4b85ae0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4b85cf0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4b87e20>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4b857e0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4b85030>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4b87160>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4b85d20>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4b84e50>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4b84760>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4b852d0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4b868f0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f1267c10>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4b86710>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f7cf9ba0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4b878b0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4b87010>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4b85780>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f138c310>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45bc108c40>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45bc108bb0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45bc4490f0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f76c5ff0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f76c7d60>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f76c6050>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f76c60b0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f76c6170>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f79bfb20>,\n",
              " <unstructured.documents.elements.TableChunk at 0x7a45f79bfe50>,\n",
              " <unstructured.documents.elements.TableChunk at 0x7a45f79bfc70>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f79bf3d0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f79bf640>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f79bfaf0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f79bfcd0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a46fdf9e170>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45bd53e440>,\n",
              " <unstructured.documents.elements.Table at 0x7a45bd53e470>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a46f4c29360>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a46f4c2a410>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a46f4c2a020>,\n",
              " <unstructured.documents.elements.Table at 0x7a46f4c2a320>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a46f4c298d0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45bbfe8eb0>,\n",
              " <unstructured.documents.elements.Table at 0x7a46f4c28af0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a46f4c2a1d0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45bbfe8df0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45bbfe8e50>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4fc0130>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4fc2d70>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4fc3d00>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4fc3dc0>,\n",
              " <unstructured.documents.elements.TableChunk at 0x7a45f4fc3d30>,\n",
              " <unstructured.documents.elements.TableChunk at 0x7a45f4fc3cd0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4fc3ac0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4fc0df0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4cf3460>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4cf3160>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4cf0520>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4cf0bb0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4cf2170>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4cf1fc0>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f4cf0ac0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4cf2ce0>,\n",
              " <unstructured.documents.elements.TableChunk at 0x7a45f4cf1510>,\n",
              " <unstructured.documents.elements.TableChunk at 0x7a45f501e1d0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f4cf2bf0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f501d4b0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f501cb20>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f501ff40>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f501d4e0>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45f501eb30>,\n",
              " <unstructured.documents.elements.Table at 0x7a45f501db70>,\n",
              " <unstructured.documents.elements.CompositeElement at 0x7a45a41e8eb0>]"
            ]
          },
          "metadata": {},
          "execution_count": 27
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "raw_pdf_elements[8].metadata.text_as_html"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 107
        },
        "id": "g4K6RtTBSaOD",
        "outputId": "cb6fddb3-51b6-41b8-a875-d4a7745cb1ce"
      },
      "id": "g4K6RtTBSaOD",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'<table><thead><th></th><th></th><th>Time (GPU hours)</th><th>Power Consumption (W)</th><th>Carbon Emitted (tCOzeq)</th></thead><tr><td rowspan=\"4\">L 2 TAMA 2</td><td>7B</td><td>184320</td><td>400</td><td>31.22</td></tr><tr><td></td><td>13B</td><td>368640</td><td>400</td><td>62.44</td></tr><tr><td></td><td>34B</td><td>1038336</td><td>350</td><td>153.90</td></tr><tr><td></td><td>70B</td><td>1720320</td><td>400</td><td>291.42</td></tr><tr><td>Total</td><td></td><td>3311616</td><td></td><td>539.00</td></tr></table>'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 28
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "<table><thead><th></th><th></th><th>Time (GPU hours)</th><th>Power Consumption (W)</th><th>Carbon Emitted (tCOzeq)</th></thead><tr><td rowspan=\"4\">L 2 TAMA 2</td><td>7B</td><td>184320</td><td>400</td><td>31.22</td></tr><tr><td></td><td>13B</td><td>368640</td><td>400</td><td>62.44</td></tr><tr><td></td><td>34B</td><td>1038336</td><td>350</td><td>153.90</td></tr><tr><td></td><td>70B</td><td>1720320</td><td>400</td><td>291.42</td></tr><tr><td>Total</td><td></td><td>3311616</td><td></td><td>539.00</td></tr></table>"
      ],
      "metadata": {
        "id": "aMwUFHWSHZfx"
      },
      "id": "aMwUFHWSHZfx"
    },
    {
      "cell_type": "code",
      "source": [
        "raw_pdf_elements[-2].metadata.text_as_html"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 178
        },
        "id": "7Vs2tygSHl7F",
        "outputId": "4b90f7b2-fa48-4010-801f-c36bdda40ce7"
      },
      "id": "7Vs2tygSHl7F",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'<table><thead><th>Model Developers</th><th>Meta AI</th></thead><tr><td>Variations</td><td>Liama 2 comes in a range of parameter sizes—7B, 13B, and 70B—as well as pretrained and fine-tuned variations.</td></tr><tr><td>Input</td><td>Models input text only.</td></tr><tr><td>Output</td><td>Models generate text only.</td></tr><tr><td>Model Architecture</td><td>LLAMA 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforce- ment learning with human feedback (RLHF) to align to human preferences for helpfulness and safety.</td></tr><tr><td>Model Dates</td><td>Liama 2 was trained between January 2023 and July 2023.</td></tr><tr><td>Status</td><td>This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.</td></tr><tr><td>License</td><td>A custom commercial license is available at: ai.meta.com/resources/ models-and-libraries/1lama-downloads/</td></tr><tr><td>Where to send com- ments</td><td>Instructions on how to provide feedback or comments on the model can be found in the model README, or by opening an issue in the GitHub repository (https: //github.com/facebookresearch/1lama/).</td></tr><tr><td colspan=\"2\">Intended Use</td></tr><tr><td>Intended Use Cases</td><td>Lama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks.</td></tr><tr><td>Out-of-Scope Uses</td><td>Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for LLaMa 2.</td></tr><tr><td></td><td>Hardware and Software (Section 2.2)</td></tr><tr><td>Training Factors</td><td>We used custom training libraries, Meta’s Research Super Cluster, and produc- tion clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute.</td></tr><tr><td>Carbon Footprint</td><td>Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). Estimated total emissions were 539 tCO2eq, 100% of which were offset by Meta’s sustainability program.</td></tr><tr><td colspan=\"2\">Training Data (Sections 2.1 and 3)</td></tr><tr><td>Overview</td><td>LiaMa 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data.</td></tr><tr><td>Data Freshness</td><td>The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023.</td></tr><tr><td colspan=\"2\">Evaluation Results</td></tr><tr><td colspan=\"2\">See evaluations for pretraining (Section 2); fine-tuning (Section 3); and safety (Section 4).</td></tr><tr><td></td><td>Ethical Considerations and Limitations (Section 5.2)</td></tr></table>'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 29
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "<table><thead><th>Model Developers</th><th>Meta AI</th></thead><tr><td>Variations</td><td>Liama 2 comes in a range of parameter sizes—7B, 13B, and 70B—as well as pretrained and fine-tuned variations.</td></tr><tr><td>Input</td><td>Models input text only.</td></tr><tr><td>Output</td><td>Models generate text only.</td></tr><tr><td>Model Architecture</td><td>LLAMA 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforce- ment learning with human feedback (RLHF) to align to human preferences for helpfulness and safety.</td></tr><tr><td>Model Dates</td><td>Liama 2 was trained between January 2023 and July 2023.</td></tr><tr><td>Status</td><td>This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.</td></tr><tr><td>License</td><td>A custom commercial license is available at: ai.meta.com/resources/ models-and-libraries/1lama-downloads/</td></tr><tr><td>Where to send com- ments</td><td>Instructions on how to provide feedback or comments on the model can be found in the model README, or by opening an issue in the GitHub repository (https: //github.com/facebookresearch/1lama/).</td></tr><tr><td colspan=\"2\">Intended Use</td></tr><tr><td>Intended Use Cases</td><td>Lama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks.</td></tr><tr><td>Out-of-Scope Uses</td><td>Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for LLaMa 2.</td></tr><tr><td></td><td>Hardware and Software (Section 2.2)</td></tr><tr><td>Training Factors</td><td>We used custom training libraries, Meta’s Research Super Cluster, and produc- tion clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute.</td></tr><tr><td>Carbon Footprint</td><td>Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). Estimated total emissions were 539 tCO2eq, 100% of which were offset by Meta’s sustainability program.</td></tr><tr><td colspan=\"2\">Training Data (Sections 2.1 and 3)</td></tr><tr><td>Overview</td><td>LiaMa 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data.</td></tr><tr><td>Data Freshness</td><td>The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023.</td></tr><tr><td colspan=\"2\">Evaluation Results</td></tr><tr><td colspan=\"2\">See evaluations for pretraining (Section 2); fine-tuning (Section 3); and safety (Section 4).</td></tr><tr><td></td><td>Ethical Considerations and Limitations (Section 5.2)</td></tr></table>"
      ],
      "metadata": {
        "id": "KHkau5oGHrLC"
      },
      "id": "KHkau5oGHrLC"
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "jca0GIMUHWfj"
      },
      "id": "jca0GIMUHWfj"
    },
    {
      "cell_type": "code",
      "source": [
        "raw_pdf_elements[15].metadata"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "JHGXMX46S7yd",
        "outputId": "a08f56c9-40d5-4743-c8f0-e81cd1d77330"
      },
      "id": "JHGXMX46S7yd",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<unstructured.documents.elements.ElementMetadata at 0x7a458f9c0fd0>"
            ]
          },
          "metadata": {},
          "execution_count": 30
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "text_element = raw_pdf_elements[16]\n",
        "\n",
        "# Print the type of the element\n",
        "print(\"Type of element:\", type(text_element))\n",
        "\n",
        "\n",
        "# If text is stored in a different attribute, try accessing it\n",
        "if hasattr(text_element, 'text'):\n",
        "    print(\"Text content:\", text_element.text)\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "QNs-RozS_hDK",
        "outputId": "19a3e35b-9d22-4ce3-cbba-dad084f86f30"
      },
      "id": "QNs-RozS_hDK",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Type of element: <class 'unstructured.documents.elements.CompositeElement'>\n",
            "Text content: 3.2 Reinforcement Learning with Human Feedback (RLHF)\n",
            "\n",
            "RLHF is a model training procedure that is applied to a fine-tuned language model to further align model behavior with human preferences and instruction following. We collect data that represents empirically\n",
            "\n",
            "9\n",
            "\n",
            "sampled human preferences, whereby human annotators select which of two model outputs they prefer. This human feedback is subsequently used to train a reward model, which learns patterns in the preferences of the human annotators and can then automate preference decisions.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "RLHF is a model training procedure that is applied to a fine-tuned language model to further align model behavior with human preferences and instruction following. We collect data that represents empirically\n",
        "\n",
        "9\n",
        "\n",
        "sampled human preferences, whereby human annotators select which of two model outputs they prefer. This human feedback is subsequently used to train a reward model, which learns patterns in the preferences of the human annotators and can then automate preference decisions.\n"
      ],
      "metadata": {
        "id": "8jaUztCAH86c"
      },
      "id": "8jaUztCAH86c"
    },
    {
      "cell_type": "markdown",
      "id": "b09cd727-aeab-49af-8a51-0dc377321e7c",
      "metadata": {
        "id": "b09cd727-aeab-49af-8a51-0dc377321e7c"
      },
      "source": [
        "We can examine the elements extracted by `partition_pdf`.\n",
        "\n",
        "`CompositeElement` are aggregated chunks."
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Initialization of `category_counts` Dictionary\n",
        "- **`category_counts = {}`**: This line initializes an empty dictionary named `category_counts`. This dictionary will be used to store the counts of each type of element found in the PDF.\n",
        "\n",
        "### Iterating Over Elements in `raw_pdf_elements`\n",
        "- The `for` loop (`for element in raw_pdf_elements:`) iterates over each element in the `raw_pdf_elements` list. These elements were obtained from the previous code snippet where the PDF file was partitioned.\n",
        "\n",
        "### Determining the Category (Type) of Each Element\n",
        "- **`category = str(type(element))`**: For each element in `raw_pdf_elements`, the type of the element is determined using the `type` function and then converted to a string. This type (or category) could be things like text blocks, images, tables, etc., depending on how the PDF was partitioned.\n",
        "\n",
        "### Counting the Occurrences of Each Category\n",
        "- The `if` statement checks if the category (type) of the current element already exists as a key in the `category_counts` dictionary.\n",
        "  - If the category exists (`if category in category_counts:`), the code increments the count of that category by 1.\n",
        "  - If the category does not exist in the dictionary (`else:`), it adds the category to the dictionary with an initial count of 1.\n",
        "- This process results in `category_counts` holding the total count of each distinct type of element found in the PDF.\n",
        "\n",
        "### Creating a Set of Unique Categories\n",
        "- **`unique_categories = set(category_counts.keys())`**: This line creates a set named `unique_categories`, which contains all the unique keys (categories) from the `category_counts` dictionary. A set is a collection in Python that cannot have duplicate elements, so `unique_categories` will have each category listed exactly once.\n",
        "\n",
        "### Result - `category_counts`\n",
        "- Finally, `category_counts` (not explicitly printed in the code but mentioned as the last line) will contain the count of each type of element found in the PDF. Each key in this dictionary is a unique category, and the value is the number of times that category appears in `raw_pdf_elements`.\n"
      ],
      "metadata": {
        "id": "zcURdn9vCJyC"
      },
      "id": "zcURdn9vCJyC"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "628abfc6-4057-434b-b880-d88e3ba44657",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "628abfc6-4057-434b-b880-d88e3ba44657",
        "outputId": "a459d056-6dd3-435d-8206-c9b2dc7896e3"
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{\"<class 'unstructured.documents.elements.CompositeElement'>\": 114,\n",
              " \"<class 'unstructured.documents.elements.Table'>\": 44,\n",
              " \"<class 'unstructured.documents.elements.TableChunk'>\": 6}"
            ]
          },
          "metadata": {},
          "execution_count": 32
        }
      ],
      "source": [
        "# Create a dictionary to store counts of each type\n",
        "category_counts = {}\n",
        "\n",
        "for element in raw_pdf_elements:\n",
        "    category = str(type(element))\n",
        "    if category in category_counts:\n",
        "        category_counts[category] += 1\n",
        "    else:\n",
        "        category_counts[category] = 1\n",
        "\n",
        "# Unique_categories will have unique elements\n",
        "unique_categories = set(category_counts.keys())\n",
        "category_counts"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Defining the `Element` Class\n",
        "- **`class Element(BaseModel):`**: This defines a new class called `Element`, which inherits from `BaseModel` provided by `pydantic`. This allows for automatic data validation and type annotations.\n",
        "  - Inside the class, two attributes are declared: `type` (a string) and `text` (which can be of any type, denoted by `Any`).\n",
        "\n",
        "### Initializing `categorized_elements` List\n",
        "- **`categorized_elements = []`**: This line initializes an empty list named `categorized_elements`, which will store the categorized elements.\n",
        "\n",
        "### Categorizing Elements from `raw_pdf_elements`\n",
        "- The `for` loop iterates over each element in `raw_pdf_elements`.\n",
        "  - It checks the type of each element (as a string) to determine if it's a table or a text block (or some other composite element):\n",
        "    - If the element type includes `\"unstructured.documents.elements.Table\"`, it is categorized as a table. An instance of `Element` with `type=\"table\"` and `text=str(element)` is appended to `categorized_elements`.\n",
        "    - If the element type includes `\"unstructured.documents.elements.CompositeElement\"`, it is categorized as text. An instance of `Element` with `type=\"text\"` and `text=str(element)` is appended to `categorized_elements`.\n",
        "\n",
        "### Extracting and Counting Table Elements\n",
        "- **`table_elements = [e for e in categorized_elements if e.type == \"table\"]`**: This list comprehension creates a new list, `table_elements`, containing only elements whose type is `\"table\"`.\n",
        "- **`print(len(table_elements))`**: This prints the number of table elements found in the PDF.\n",
        "\n",
        "### Extracting and Counting Text Elements\n",
        "- **`text_elements = [e for e in categorized_elements if e.type == \"text\"]`**: Similarly, this list comprehension creates a list, `text_elements`, containing only elements whose type is `\"text\"`.\n",
        "- **`print(len(text_elements))`**: This prints the number of text elements found in the PDF.\n"
      ],
      "metadata": {
        "id": "wl2vviNvCOWC"
      },
      "id": "wl2vviNvCOWC"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5462f29e-fd59-4e0e-9493-ea3b560e523e",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "5462f29e-fd59-4e0e-9493-ea3b560e523e",
        "outputId": "71bfcc07-a2bf-4ed1-e9f7-3f1036bcc90d"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "50\n",
            "114\n"
          ]
        }
      ],
      "source": [
        "class Element(BaseModel):\n",
        "    type: str\n",
        "    text: Any\n",
        "\n",
        "\n",
        "# Categorize by type\n",
        "categorized_elements = []\n",
        "for element in raw_pdf_elements:\n",
        "    if \"unstructured.documents.elements.Table\" in str(type(element)):\n",
        "        categorized_elements.append(Element(type=\"table\", text=str(element)))\n",
        "    elif \"unstructured.documents.elements.CompositeElement\" in str(type(element)):\n",
        "        categorized_elements.append(Element(type=\"text\", text=str(element)))\n",
        "\n",
        "# Tables\n",
        "table_elements = [e for e in categorized_elements if e.type == \"table\"]\n",
        "print(len(table_elements))\n",
        "\n",
        "# Text\n",
        "text_elements = [e for e in categorized_elements if e.type == \"text\"]\n",
        "print(len(text_elements))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "731b3dfc-7ddf-4a11-9a30-9a79b7c66e16",
      "metadata": {
        "id": "731b3dfc-7ddf-4a11-9a30-9a79b7c66e16"
      },
      "source": [
        "## Multi-vector retriever\n",
        "\n",
        "Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text.\n",
        "\n",
        "With the summary, we will also store the raw table elements.\n",
        "\n",
        "The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).\n",
        "\n",
        "The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer.  \n",
        "\n",
        "### Summaries"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Import Statements\n",
        "- **`from langchain_core.output_parsers import StrOutputParser`**: This imports `StrOutputParser` from `langchain_core.output_parsers`. It's likely used to parse string outputs from language model responses.\n",
        "- **`from langchain_core.prompts import ChatPromptTemplate`**: Imports `ChatPromptTemplate` from `langchain_core.prompts`, which is probably used to format or template chat prompts for language models.\n",
        "- **`from langchain_openai import ChatOpenAI`**: This imports `ChatOpenAI` from `langchain_openai`. `ChatOpenAI` is likely a wrapper or interface to interact with OpenAI's language models.\n",
        "\n",
        "### Processing Table Elements\n",
        "- **`tables = [i.text for i in table_elements]`**: This line creates a list of texts from `table_elements` (presumably obtained from a previous step where PDF elements were categorized). Each element `i` in `table_elements` is processed to extract its text content.\n",
        "- **`table_summaries = summarize_chain.batch(tables, {\"max_concurrency\": 5})`**: This line seems to be using a `summarize_chain` (not defined in the snippet but probably a part of the langchain setup) to create summaries of the texts in `tables`. The `batch` method processes multiple items at once, with `max_concurrency` set to 5, indicating that up to 5 items can be processed simultaneously.\n",
        "\n",
        "### Processing Text Elements\n",
        "- **`texts = [i.text for i in text_elements]`**: Similar to tables, this line creates a list of texts from `text_elements`.\n",
        "- **`text_summaries = summarize_chain.batch(texts, {\"max_concurrency\": 5})`**: This applies the same summarization process to the texts.\n",
        "\n",
        "### Purpose of Summarizations\n",
        "- **Improve Retrieval Quality**: The summaries are used to enhance the retrieval process, as described in the multi-vector retriever documentation. Summarizing complex or lengthy text into concise forms can make it easier for retrieval systems to understand and process the content.\n",
        "- **Provide Context for Language Models (LLMs)**: The raw tables (and possibly texts) are also passed to language models (like GPT from OpenAI) to provide full context. This enables the LLM to generate more informed and accurate responses.\n",
        "\n",
        "### Summary\n",
        "In summary, this code is part of a larger system that integrates language models with a retrieval system, enhancing the ability to process, summarize, and retrieve information from structured (tables) and unstructured (text) data.\n"
      ],
      "metadata": {
        "id": "9BUqSn08Dasy"
      },
      "id": "9BUqSn08Dasy"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8e275736-3408-4d7a-990e-4362c88e81f8",
      "metadata": {
        "id": "8e275736-3408-4d7a-990e-4362c88e81f8"
      },
      "outputs": [],
      "source": [
        "from langchain_core.output_parsers import StrOutputParser\n",
        "from langchain_core.prompts import ChatPromptTemplate\n",
        "from langchain_openai import ChatOpenAI"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "\n",
        "# Replace 'your-api-key' with your actual OpenAI API key\n",
        "os.environ['OPENAI_API_KEY'] = 'Your OpenAI API Key'\n"
      ],
      "metadata": {
        "id": "zcrOjzqF_ZUH"
      },
      "id": "zcrOjzqF_ZUH",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "id": "37b65677-aeb4-44fd-b06d-4539341ede97",
      "metadata": {
        "id": "37b65677-aeb4-44fd-b06d-4539341ede97"
      },
      "source": [
        "We create a simple summarize chain for each element.\n",
        "\n",
        "You can also see, re-use, or modify the prompt in the Hub [here](https://smith.langchain.com/hub/rlm/multi-vector-retriever-summarization).\n",
        "\n",
        "```\n",
        "from langchain import hub\n",
        "obj = hub.pull(\"rlm/multi-vector-retriever-summarization\")\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1b12536a-1303-41ad-9948-4eb5a5f32614",
      "metadata": {
        "id": "1b12536a-1303-41ad-9948-4eb5a5f32614"
      },
      "outputs": [],
      "source": [
        "# Prompt\n",
        "prompt_text = \"\"\"You are an assistant tasked with summarizing tables and text. \\\n",
        "Give a concise summary of the table or text. Table or text chunk: {element} \"\"\"\n",
        "prompt = ChatPromptTemplate.from_template(prompt_text)\n",
        "\n",
        "# Summary chain\n",
        "model = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo\")\n",
        "summarize_chain = {\"element\": lambda x: x} | prompt | model | StrOutputParser()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8d8b567c-b442-4bf0-b639-04bd89effc62",
      "metadata": {
        "id": "8d8b567c-b442-4bf0-b639-04bd89effc62"
      },
      "outputs": [],
      "source": [
        "# Apply to tables\n",
        "tables = [i.text for i in table_elements]\n",
        "table_summaries = summarize_chain.batch(tables, {\"max_concurrency\": 5})"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "if table_summaries:\n",
        "    print(table_summaries[1])\n",
        "else:\n",
        "    print(\"No summaries available\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "AtM_brk6CmZK",
        "outputId": "42ae5f6e-e283-4569-9a8c-23c42483b95a"
      },
      "id": "AtM_brk6CmZK",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "The table provides information on different versions of Llama, including the training data, parameters, context length, GQA compatibility, number of tokens, and learning rate. Llama 1 and Llama 2 both use publicly available online data, while Llama 2 also incorporates a new mix of data. The versions vary in the number of parameters (ranging from 7B to 70B), context length (2k to 4k), and tokens (1.0T to 2.0T). Additionally, Llama 2 is compatible with GQA and has a lower learning rate compared to Llama 1.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3e9c176c-3d46-4034-b169-0d7305d42d27",
      "metadata": {
        "id": "3e9c176c-3d46-4034-b169-0d7305d42d27"
      },
      "outputs": [],
      "source": [
        "# Apply to texts\n",
        "texts = [i.text for i in text_elements]\n",
        "text_summaries = summarize_chain.batch(texts, {\"max_concurrency\": 5})"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "if text_summaries:\n",
        "    print(text_summaries[1])\n",
        "else:\n",
        "    print(\"No summaries available\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "cNwAMfEQC1gN",
        "outputId": "91e5df19-d5e0-4d28-aee4-ac97456ccd3d"
      },
      "id": "cNwAMfEQC1gN",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "The table shows the comparison of different models in terms of loss and win rates. The figures provide the results of human evaluation and win-rate percentages for helpfulness and safety between different models. The evaluation results should be interpreted with caution due to potential noise and subjectivity.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "id": "60524010-754f-4924-ad75-78cb54ca7257",
      "metadata": {
        "id": "60524010-754f-4924-ad75-78cb54ca7257"
      },
      "source": [
        "### Add to vectorstore\n",
        "\n",
        "Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries:\n",
        "\n",
        "* `InMemoryStore` stores the raw text, tables\n",
        "* `vectorstore` stores the embedded summaries"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Import Statements\n",
        "- The code begins with importing necessary classes and functions from `langchain` and other modules. This includes `MultiVectorRetriever`, `InMemoryStore`, `Chroma`, `Document`, `OpenAIEmbeddings`, and `uuid`.\n",
        "\n",
        "### Setting Up Vectorstore and Store\n",
        "- **`vectorstore = Chroma(...)`**: This initializes a `vectorstore` using the `Chroma` class. `Chroma` is likely a type of vector store that indexes embeddings. The `collection_name` is set to `\"summaries\"`, and it uses `OpenAIEmbeddings` for generating embeddings of the summaries.\n",
        "- **`store = InMemoryStore()`**: An `InMemoryStore` is initialized to act as a storage layer for documents.\n",
        "- **`id_key = \"doc_id\"`**: This sets a key name for document IDs.\n",
        "\n",
        "### Initializing MultiVectorRetriever\n",
        "- **`retriever = MultiVectorRetriever(...)`**: This creates an instance of `MultiVectorRetriever`, which is configured to use the previously created `vectorstore` and `store`. The `id_key` is used to identify documents.\n",
        "\n",
        "### Processing and Adding Text Summaries\n",
        "- **`doc_ids = [str(uuid.uuid4()) for _ in texts]`**: Generates unique IDs for each text using `uuid.uuid4()`.\n",
        "- The list comprehension creates `Document` objects for each text summary. Each `Document` contains the page content (summary) and metadata (document ID).\n",
        "- **`retriever.vectorstore.add_documents(summary_texts)`**: Adds the summary texts to the `vectorstore`. The texts are likely converted into embeddings and stored.\n",
        "- **`retriever.docstore.mset(...)`**: Adds the original texts to the document store, mapping them with their respective IDs.\n",
        "\n",
        "### Processing and Adding Table Summaries\n",
        "- **`table_ids = [str(uuid.uuid4()) for _ in tables]`**: Generates unique IDs for each table.\n",
        "- Similar to texts, it creates `Document` objects for each table summary and adds them to the `vectorstore`.\n",
        "- Adds the original tables to the document store.\n",
        "\n",
        "### Summary\n",
        "In summary, the code is setting up a system where summaries of texts and tables are embedded and indexed in a `vectorstore` for efficient retrieval, while the original texts and tables are stored in an `InMemoryStore`. The `MultiVectorRetriever` is then used to facilitate the retrieval of these documents based on the indexed summaries, allowing for a more efficient and context-aware retrieval process. This setup is particularly useful in scenarios where the original documents are large or complex, and quick access to summarized content is beneficial.\n"
      ],
      "metadata": {
        "id": "gUBI49P-ECRl"
      },
      "id": "gUBI49P-ECRl"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "346c3a02-8fea-4f75-a69e-fc9542b99dbc",
      "metadata": {
        "id": "346c3a02-8fea-4f75-a69e-fc9542b99dbc"
      },
      "outputs": [],
      "source": [
        "import uuid\n",
        "\n",
        "from langchain.retrievers.multi_vector import MultiVectorRetriever\n",
        "from langchain.storage import InMemoryStore\n",
        "from langchain_community.vectorstores import Chroma\n",
        "from langchain_core.documents import Document\n",
        "from langchain_openai import OpenAIEmbeddings\n",
        "\n",
        "# The vectorstore to use to index the child chunks\n",
        "vectorstore = Chroma(collection_name=\"summaries\", embedding_function=OpenAIEmbeddings())\n",
        "\n",
        "# The storage layer for the parent documents\n",
        "store = InMemoryStore()\n",
        "id_key = \"doc_id\"\n",
        "\n",
        "# The retriever (empty to start)\n",
        "retriever = MultiVectorRetriever(\n",
        "    vectorstore=vectorstore,\n",
        "    docstore=store,\n",
        "    id_key=id_key,\n",
        ")\n",
        "\n",
        "# Add texts\n",
        "doc_ids = [str(uuid.uuid4()) for _ in texts]\n",
        "summary_texts = [\n",
        "    Document(page_content=s, metadata={id_key: doc_ids[i]})\n",
        "    for i, s in enumerate(text_summaries)\n",
        "]\n",
        "retriever.vectorstore.add_documents(summary_texts)\n",
        "retriever.docstore.mset(list(zip(doc_ids, texts)))\n",
        "\n",
        "# Add tables\n",
        "table_ids = [str(uuid.uuid4()) for _ in tables]\n",
        "summary_tables = [\n",
        "    Document(page_content=s, metadata={id_key: table_ids[i]})\n",
        "    for i, s in enumerate(table_summaries)\n",
        "]\n",
        "retriever.vectorstore.add_documents(summary_tables)\n",
        "retriever.docstore.mset(list(zip(table_ids, tables)))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1d8bbbd9-009b-4b34-a206-5874a60adbda",
      "metadata": {
        "id": "1d8bbbd9-009b-4b34-a206-5874a60adbda"
      },
      "source": [
        "## RAG\n",
        "\n",
        "Run [RAG pipeline](https://python.langchain.com/docs/expression_language/cookbook/retrieval)."
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Import Statement\n",
        "- **`from langchain_core.runnables import RunnablePassthrough`**: Imports the `RunnablePassthrough` class, which is likely used to pass data through the pipeline without modifying it.\n",
        "\n",
        "### Setting Up the Prompt Template\n",
        "- **`template = \"\"\"...{context}...{question}...\"\"\"`**: This string defines a template for prompts that will be sent to the language model. The `{context}` and `{question}` placeholders will be filled with relevant information later in the pipeline.\n",
        "- **`prompt = ChatPromptTemplate.from_template(template)`**: Creates a `ChatPromptTemplate` object from the defined template. This object is used to format prompts for the language model.\n",
        "\n",
        "### Initializing the Language Model\n",
        "- **`model = ChatOpenAI(temperature=0, model=\"gpt-4\")`**: Initializes an instance of `ChatOpenAI`, configured to use the GPT-4 model with a temperature of 0. The temperature parameter controls the randomness of the model's responses, with 0 resulting in more deterministic outputs.\n",
        "\n",
        "### Creating the RAG Pipeline\n",
        "- The pipeline is a sequence of operations that will be applied to process a question:\n",
        "  - **`{\"context\": retriever, \"question\": RunnablePassthrough()}`**: This step sets up the input for the pipeline. The `context` will be retrieved using the `retriever` (defined earlier), and the `question` is passed through as is.\n",
        "  - **`| prompt`**: The retrieved context and the question are formatted using the prompt template.\n",
        "  - **`| model`**: The formatted prompt is then sent to the model (GPT-4) for generating an answer.\n",
        "  - **`| StrOutputParser()`**: The output from the model is parsed into a string format.\n",
        "\n",
        "### Invoking the Pipeline\n",
        "- **`chain.invoke(\"What is the number of training tokens for LLaMA2?\")`**: This line invokes the pipeline with a specific question. The pipeline works as follows:\n",
        "  - The `retriever` fetches context relevant to the question (\"What is the number of training tokens for LLaMA2?\").\n",
        "  - The context and the question are formatted into a prompt.\n",
        "  - This prompt is fed to the GPT-4 model, which generates an answer based on the provided context.\n",
        "  - The answer is parsed into a string and returned.\n",
        "\n",
        "### Summary\n",
        "In summary, this code sets up a sophisticated RAG pipeline combining a retrieval system and a GPT-4 language model. It's designed to process questions by retrieving relevant context, formulating a prompt for the language model, and then parsing the language model's response to produce a final answer. This approach is particularly useful for answering specific questions where context from large datasets or documents is essential for generating accurate responses.\n"
      ],
      "metadata": {
        "id": "PJWq_FVvFCKF"
      },
      "id": "PJWq_FVvFCKF"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f2489de4-51e3-48b4-bbcd-ed9171deadf3",
      "metadata": {
        "id": "f2489de4-51e3-48b4-bbcd-ed9171deadf3"
      },
      "outputs": [],
      "source": [
        "from langchain_core.runnables import RunnablePassthrough\n",
        "\n",
        "# Prompt template\n",
        "template = \"\"\"Answer the question based only on the following context, which can include text and tables:\n",
        "{context}\n",
        "Question: {question}\n",
        "\"\"\"\n",
        "prompt = ChatPromptTemplate.from_template(template)\n",
        "\n",
        "# LLM\n",
        "model = ChatOpenAI(temperature=0, model=\"gpt-4\")\n",
        "\n",
        "# RAG pipeline\n",
        "chain = (\n",
        "    {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
        "    | prompt\n",
        "    | model\n",
        "    | StrOutputParser()\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "90e3d100-10e8-4ee6-ae46-2480b1524ec8",
      "metadata": {
        "id": "90e3d100-10e8-4ee6-ae46-2480b1524ec8",
        "outputId": "bb6e9215-7ad8-4fdd-8fae-6ed064ba498e",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'The number of training tokens for LLaMA2 is 2 trillion.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 46
        }
      ],
      "source": [
        "chain.invoke(\"What is the number of training tokens for LLaMA2?\")"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "chain.invoke(\"What is the number of training tokens for LLaMA1?\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "Fgl8I3QM-2Qk",
        "outputId": "ec2c7f0f-c59c-41e2-df8b-a583ff6de1fc"
      },
      "id": "Fgl8I3QM-2Qk",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'The number of training tokens for Llama 1 ranges from 1.0T to 1.4T depending on the model parameters.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 34
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "chain.invoke(\"What is the Commonsense score for the 40B version of Llama 2?\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "Yaha_H0bKNT8",
        "outputId": "871f0a65-4c61-43a7-aed2-bcbae9c400e5"
      },
      "id": "Yaha_H0bKNT8",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'The text does not provide information on the Commonsense score for the 40B version of Llama 2.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 35
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "chain.invoke(\"What is the world knowledge score for the 7B version of Falcon?\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "9WfQJQoAKvhh",
        "outputId": "c22d212e-a530-45fd-bd35-8d29e86b1f6b"
      },
      "id": "9WfQJQoAKvhh",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'The world knowledge score for the 7B version of Falcon is 56.1.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 36
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "chain.invoke(\"what is the final reward function we use during optimization?\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 53
        },
        "id": "FjTE16SnLdlS",
        "outputId": "203bfa75-f602-485a-fa69-3c779acbf478"
      },
      "id": "FjTE16SnLdlS",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'The final reward function used during optimization is R(g | p) = ˜Rc(g | p) − βDKL(πθ(g | p) ∥ π0(g | p)). This function contains a penalty term for diverging from the original policy π0.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 37
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "chain.invoke(\"What is the Math MMLU score for the 65B version of Llama 2?\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "bdaTSL8-Ggw_",
        "outputId": "3cc64793-0732-4be4-f294-534656107c7c"
      },
      "id": "bdaTSL8-Ggw_",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'The Math MMLU score for the 65B version of Llama 2 is 78.9.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 38
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "chain.invoke(\"How does Llama 2 perform on GSM8K (8-shot) compared to PaLM-2-L?\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "P7yDMdNxGg5E",
        "outputId": "51ad43c2-8286-4783-9692-767b5a4eafe8"
      },
      "id": "P7yDMdNxGg5E",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'The text does not provide specific information on how Llama 2 performs on GSM8K (8-shot) compared to PaLM-2-L.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 39
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "chain.invoke(\"How does the ToxiGen score of the 7B MPT Falcon compare to the 70B Llama 2?\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "kRxF5XptGg89",
        "outputId": "8e529317-ceb4-4b24-e5f9-3b3594243fe9"
      },
      "id": "kRxF5XptGg89",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'The ToxiGen score of the 7B MPT Falcon is 22.61, while the 70B Llama 2 has a ToxiGen score of 24.60.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 40
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "chain.invoke(\"For the dataset with an average of 1.0 turn per dialogue, what is the average number of tokens per example?\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "mPxSnQlSG4Zn",
        "outputId": "15ed8b8e-3b16-4363-8ece-5d0ca0e31253"
      },
      "id": "mPxSnQlSG4Zn",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'The dataset with an average of 1.0 turn per dialogue has an average of 371.1 tokens per example.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 41
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "chain.invoke(\"What is the average number of turns per dialogue for the dataset with the highest average number of tokens per example?\")"
      ],
      "metadata": {
        "id": "2SyTMhU6NDuk"
      },
      "id": "2SyTMhU6NDuk",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "id": "37f46054-e239-4ba8-af81-22d0d6a9bc32",
      "metadata": {
        "id": "37f46054-e239-4ba8-af81-22d0d6a9bc32"
      },
      "source": [
        "We can check the [trace](https://smith.langchain.com/public/4739ae7c-1a13-406d-bc4e-3462670ebc01/r) to see what chunks were retrieved:\n",
        "\n",
        "This includes Table 1 of the paper, showing the Tokens used for training.\n",
        "\n",
        "```\n",
        "Training Data Params Context GQA Tokens LR Length 7B 2k 1.0T 3.0x 10-4 See Touvron et al. 13B 2k 1.0T 3.0 x 10-4 LiaMa 1 (2023) 33B 2k 14T 1.5 x 10-4 65B 2k 1.4T 1.5 x 10-4 7B 4k 2.0T 3.0x 10-4 Liama 2 A new mix of publicly 13B 4k 2.0T 3.0 x 10-4 available online data 34B 4k v 2.0T 1.5 x 10-4 70B 4k v 2.0T 1.5 x 10-4\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pd\n",
        "\n",
        "# Questions based on the tables provided earlier\n",
        "questions = [\n",
        "    \"What is the number of training tokens for the 33B version of Llama 2?\",\n",
        "    \"Does the 70B version of Llama 2 use the GQA (Guided Question Answering) format?\",\n",
        "    \"What is the learning rate for the 7B version of Llama 1?\",\n",
        "    \"What is the Commonsense score for the 40B version of Llama 2?\",\n",
        "    \"How does the World Reading score of the 30B Model MPT Falcon compare to the 70B Llama 1?\",\n",
        "    \"What is the Math MMLU score for the 65B version of Llama 2?\",\n",
        "    \"What is the performance score of GPT-4 on Natural Questions (1-shot)?\",\n",
        "    \"How does Llama 2 perform on GSM8K (8-shot) compared to PaLM-2-L?\",\n",
        "    \"What is the score of PaLM on the BIG-Bench Hard (3-shot) benchmark?\",\n",
        "    \"What is the average number of turns per dialogue for the dataset with the highest average number of tokens per example?\",\n",
        "    \"Which dataset has the lowest average number of tokens in the response, and what is this average?\",\n",
        "    \"For the dataset with an average of 1.0 turn per dialogue, what is the average number of tokens per example?\",\n",
        "    \"What is the TruthfulQA score for the 65B version of Llama 1?\",\n",
        "    \"How does the ToxiGen score of the 7B MPT Falcon compare to the 70B Llama 2?\",\n",
        "    \"What is the TruthfulQA score for the 13B version of Llama 2?\"\n",
        "]\n",
        "\n",
        "# Correct answers for the questions\n",
        "correct_answers = [\n",
        "    \"1.4 trillion training tokens\", \"Yes\", \"3.0 × 10^−4\", \"69.2\", \"The 30B Model MPT Falcon scores 64.9, while the 70B Llama 1 scores 71.9\",\n",
        "    \"60.5\", \"86.1\", \"Llama 2 scores 56.8, PaLM-2-L scores 80.7\", \"65.7\", \"3.9 turns per dialogue\", \"35.1 tokens\",\n",
        "    \"Ranges from 237.2 to 440.2 tokens\", \"48.71\", \"7B MPT Falcon scores 22.32, 70B Llama 2 scores 24.60\", \"41.86\"\n",
        "]\n",
        "\n",
        "# List to store LLM answers\n",
        "llm_answers = []\n",
        "\n",
        "# Looping through the questions and getting LLM answers\n",
        "for question in questions:\n",
        "    llm_answer = chain.invoke(question)\n",
        "    llm_answers.append(llm_answer)\n",
        "# Simulating LLM responses (since I can't actually call an LLM here)\n",
        "# For demonstration, assume all answers from the LLM are correct\n",
        "#llm_responses = correct_answers  # In a real scenario, this would be replaced with actual LLM responses\n",
        "\n",
        "# Comparing LLM responses with the correct answers\n",
        "results = []\n",
        "for q, correct_answer, llm_answer in zip(questions, correct_answers, llm_answers):\n",
        "    result = \"Correct\" if correct_answer == llm_answer else \"Failed\"\n",
        "    results.append({\n",
        "        \"Question\": q,\n",
        "        \"Expected Answer\": correct_answer,\n",
        "        \"LLM Answer\": llm_answer,\n",
        "        \"Result\": result\n",
        "    })\n",
        "\n",
        "# Creating a DataFrame\n",
        "df_results = pd.DataFrame(results)\n",
        "\n",
        "# Print the DataFrame\n",
        "print(df_results)"
      ],
      "metadata": {
        "id": "kR5O1zDUCzje",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "1923ffd0-655c-4217-e129-2beca32dd4b0"
      },
      "id": "kR5O1zDUCzje",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "                                             Question  \\\n",
            "0   What is the number of training tokens for the ...   \n",
            "1   Does the 70B version of Llama 2 use the GQA (G...   \n",
            "2   What is the learning rate for the 7B version o...   \n",
            "3   What is the Commonsense score for the 40B vers...   \n",
            "4   How does the World Reading score of the 30B Mo...   \n",
            "5   What is the Math MMLU score for the 65B versio...   \n",
            "6   What is the performance score of GPT-4 on Natu...   \n",
            "7   How does Llama 2 perform on GSM8K (8-shot) com...   \n",
            "8   What is the score of PaLM on the BIG-Bench Har...   \n",
            "9   What is the average number of turns per dialog...   \n",
            "10  Which dataset has the lowest average number of...   \n",
            "11  For the dataset with an average of 1.0 turn pe...   \n",
            "12  What is the TruthfulQA score for the 65B versi...   \n",
            "13  How does the ToxiGen score of the 7B MPT Falco...   \n",
            "14  What is the TruthfulQA score for the 13B versi...   \n",
            "\n",
            "                                      Expected Answer  \\\n",
            "0                        1.4 trillion training tokens   \n",
            "1                                                 Yes   \n",
            "2                                         3.0 × 10^−4   \n",
            "3                                                69.2   \n",
            "4   The 30B Model MPT Falcon scores 64.9, while th...   \n",
            "5                                                60.5   \n",
            "6                                                86.1   \n",
            "7           Llama 2 scores 56.8, PaLM-2-L scores 80.7   \n",
            "8                                                65.7   \n",
            "9                              3.9 turns per dialogue   \n",
            "10                                        35.1 tokens   \n",
            "11                  Ranges from 237.2 to 440.2 tokens   \n",
            "12                                              48.71   \n",
            "13  7B MPT Falcon scores 22.32, 70B Llama 2 scores...   \n",
            "14                                              41.86   \n",
            "\n",
            "                                           LLM Answer  Result  \n",
            "0   The 33B version of Llama 2 was trained on 2.0 ...  Failed  \n",
            "1   Yes, the 70B version of Llama 2 uses the GQA (...  Failed  \n",
            "2   The text does not provide information on the l...  Failed  \n",
            "3   The text does not provide information on the C...  Failed  \n",
            "4   The text does not provide information on the W...  Failed  \n",
            "5   The text does not provide information on the M...  Failed  \n",
            "6   The text does not provide information on the p...  Failed  \n",
            "7   The text does not provide specific information...  Failed  \n",
            "8   The score of PaLM on the BIG-Bench Hard (3-sho...  Failed  \n",
            "9   The dataset with the highest average number of...  Failed  \n",
            "10  The dataset with the lowest average number of ...  Failed  \n",
            "11  The dataset with an average of 1.0 turn per di...  Failed  \n",
            "12  The TruthfulQA score for the 65B version of Ll...  Failed  \n",
            "13  The ToxiGen score of the 7B MPT Falcon is 14.5...  Failed  \n",
            "14  The TruthfulQA score for the 13B version of Ll...  Failed  \n"
          ]
        }
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.16"
    },
    "colab": {
      "provenance": [],
      "machine_shape": "hm",
      "gpuType": "T4",
      "include_colab_link": true
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "2414b346e197486fa3ec261eaafda277": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HBoxModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_c44ad3dfaa1d43b5b88a3443508d37c9",
              "IPY_MODEL_2dfe21d7b0d541da86fdceb86303aa4b",
              "IPY_MODEL_3d9d125614844a50857210c5c76db292"
            ],
            "layout": "IPY_MODEL_a45028a5ec0347c99f780a83b9c3b1a2"
          }
        },
        "c44ad3dfaa1d43b5b88a3443508d37c9": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HTMLModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_6122a72d4b0b4a06a60e03cfba2828e5",
            "placeholder": "",
            "style": "IPY_MODEL_f539fe338e1c4df4afa61c060e6168c6",
            "value": "config.json: 100%"
          }
        },
        "2dfe21d7b0d541da86fdceb86303aa4b": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "FloatProgressModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "success",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_036c66235c7644f181d021e64da90afd",
            "max": 1469,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_529a898eff1648538816e73d8e39a678",
            "value": 1469
          }
        },
        "3d9d125614844a50857210c5c76db292": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HTMLModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_632910e4507e4f51b3cfa6c9c877e42d",
            "placeholder": "",
            "style": "IPY_MODEL_4f3e28c3cdb64e72b3f9bfd5f29a30b6",
            "value": " 1.47k/1.47k [00:00&lt;00:00, 40.6kB/s]"
          }
        },
        "a45028a5ec0347c99f780a83b9c3b1a2": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "6122a72d4b0b4a06a60e03cfba2828e5": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "f539fe338e1c4df4afa61c060e6168c6": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "DescriptionStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "036c66235c7644f181d021e64da90afd": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "529a898eff1648538816e73d8e39a678": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "ProgressStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "632910e4507e4f51b3cfa6c9c877e42d": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "4f3e28c3cdb64e72b3f9bfd5f29a30b6": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "DescriptionStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "978c6b3be9484f42850fa61702319a53": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HBoxModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_baa7ba9b80bd44f5afcaea7784f44c13",
              "IPY_MODEL_1862cdfbd1f94bba881b7bab356d6435",
              "IPY_MODEL_cba9ca4311844f3d8c0609382c1cbd04"
            ],
            "layout": "IPY_MODEL_e8de0a5cb25d46e6875f9c3297fa9bcd"
          }
        },
        "baa7ba9b80bd44f5afcaea7784f44c13": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HTMLModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_73d4003bf1a34370a57b9e12b823602b",
            "placeholder": "",
            "style": "IPY_MODEL_b1f644ccd2274c8abb7517d09ffc0f55",
            "value": "model.safetensors: 100%"
          }
        },
        "1862cdfbd1f94bba881b7bab356d6435": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "FloatProgressModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "success",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_12ae4e014abc4e29811b85a432773585",
            "max": 115434268,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_24c252c09a0247d78759f47601925606",
            "value": 115434268
          }
        },
        "cba9ca4311844f3d8c0609382c1cbd04": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HTMLModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_b4151ced4b3044d9b78239e0d8d8f84e",
            "placeholder": "",
            "style": "IPY_MODEL_0781be1837704abbb819a32954116868",
            "value": " 115M/115M [00:00&lt;00:00, 126MB/s]"
          }
        },
        "e8de0a5cb25d46e6875f9c3297fa9bcd": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "73d4003bf1a34370a57b9e12b823602b": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "b1f644ccd2274c8abb7517d09ffc0f55": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "DescriptionStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "12ae4e014abc4e29811b85a432773585": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "24c252c09a0247d78759f47601925606": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "ProgressStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "b4151ced4b3044d9b78239e0d8d8f84e": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "0781be1837704abbb819a32954116868": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "DescriptionStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "57b0d6e3c92a4c18b007691ebf2dfe1d": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HBoxModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_1232eb4459124199b4f89e39a53e2676",
              "IPY_MODEL_9b3f9d1288a14edfb427cfded0afce69",
              "IPY_MODEL_7d78a77b0006457a89fb3e888200299c"
            ],
            "layout": "IPY_MODEL_35c1ac9c1c0948938dbf245634c32c11"
          }
        },
        "1232eb4459124199b4f89e39a53e2676": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HTMLModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_864da1504d0847919eaae3936273232d",
            "placeholder": "",
            "style": "IPY_MODEL_738425dc04df44a6a8eee244560d3f36",
            "value": "model.safetensors: 100%"
          }
        },
        "9b3f9d1288a14edfb427cfded0afce69": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "FloatProgressModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "success",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_9bcf29de053145c0936b610b4168b968",
            "max": 46807446,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_59252baca46e4262a4583731ccd4c8fb",
            "value": 46807446
          }
        },
        "7d78a77b0006457a89fb3e888200299c": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HTMLModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_914cfc1097f0469db308542d35b0baa8",
            "placeholder": "",
            "style": "IPY_MODEL_ea1d06499e3e4690aaba0f867a2bebd9",
            "value": " 46.8M/46.8M [00:00&lt;00:00, 136MB/s]"
          }
        },
        "35c1ac9c1c0948938dbf245634c32c11": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "864da1504d0847919eaae3936273232d": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "738425dc04df44a6a8eee244560d3f36": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "DescriptionStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "9bcf29de053145c0936b610b4168b968": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "59252baca46e4262a4583731ccd4c8fb": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "ProgressStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "914cfc1097f0469db308542d35b0baa8": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "ea1d06499e3e4690aaba0f867a2bebd9": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "DescriptionStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        }
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}