wey-gu/KnowledgeGraphIndex_vs_VectorStoreIndex_vs_CustomIndex_combined.ipynb

## KnowledgeGraphIndex_vs_VectorStoreIndex_vs_CustomIndex_combined.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "105a2123",
   "metadata": {},
   "source": [
    "## Custom Index combining KG Index and VectorStore Index\n",
    "\n",
    "Now let's demo how KG Index could be used. We will create a VectorStore Index, KG Index and a Custom Index combining the two.\n",
    "\n",
    "Below digrams are showing how in-context learning works:\n",
    "\n",
    "```\n",
    "          in-context learning with Llama Index\n",
    "                  ┌────┬────┬────┬────┐                  \n",
    "                  │ 1  │ 2  │ 3  │ 4  │                  \n",
    "                  ├────┴────┴────┴────┤                  \n",
    "                  │  Docs/Knowledge   │                  \n",
    "┌───────┐         │        ...        │       ┌─────────┐\n",
    "│       │         ├────┬────┬────┬────┤       │         │\n",
    "│       │         │ 95 │ 96 │    │    │       │         │\n",
    "│       │         └────┴────┴────┴────┘       │         │\n",
    "│ User  │─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─▶   LLM   │\n",
    "│       │                                     │         │\n",
    "│       │                                     │         │\n",
    "└───────┘    ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐  └─────────┘\n",
    "    │          ┌──────────────────────────┐        ▲     \n",
    "    └────────┼▶│  Tell me ....., please   │├───────┘     \n",
    "               └──────────────────────────┘              \n",
    "             │ ┌────┐ ┌────┐               │             \n",
    "               │ 3  │ │ 96 │                             \n",
    "             │ └────┘ └────┘               │             \n",
    "              ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ \n",
    "```\n",
    "\n",
    "With VectorStoreIndex, we create embeddings of each node(chunk), and find TopK related ones towards a given question during the query. In the above diagram, nodes `3` and `96` were fetched as the TopK related nodes, used to help answer the user query. \n",
    "\n",
    "With KG Index, we will extract relationships between entities, representing concise facts from each node. It would look something like this:\n",
    "\n",
    "```\n",
    "Node Split and Embedding\n",
    "\n",
    "┌────┬────┬────┬────┐\n",
    "│ 1  │ 2  │ 3  │ 4  │\n",
    "├────┴────┴────┴────┤\n",
    "│  Docs/Knowledge   │\n",
    "│        ...        │\n",
    "├────┬────┬────┬────┤\n",
    "│ 95 │ 96 │    │    │\n",
    "└────┴────┴────┴────┘\n",
    "```\n",
    "\n",
    "Then, if we zoomed in of it:\n",
    "\n",
    "```\n",
    "       Node Split and Embedding, with Knowledge Graph being extracted\n",
    "\n",
    "┌──────────────────┬──────────────────┬──────────────────┬──────────────────┐\n",
    "│ .─.       .─.    │  .─.       .─.   │            .─.   │  .─.       .─.   │\n",
    "│( x )─────▶ y )   │ ( x )─────▶ a )  │           ( j )  │ ( m )◀────( x )  │\n",
    "│ `▲'       `─'    │  `─'       `─'   │            `─'   │  `─'       `─'   │\n",
    "│  │     1         │        2         │        3    │    │        4         │\n",
    "│ .─.              │                  │            .▼.   │                  │\n",
    "│( z )─────────────┼──────────────────┼──────────▶( i )─┐│                  │\n",
    "│ `◀────┐          │                  │            `─'  ││                  │\n",
    "├───────┼──────────┴──────────────────┴─────────────────┼┴──────────────────┤\n",
    "│       │                      Docs/Knowledge           │                   │\n",
    "│       │                            ...                │                   │\n",
    "│       │                                               │                   │\n",
    "├───────┼──────────┬──────────────────┬─────────────────┼┬──────────────────┤\n",
    "│  .─.  └──────.   │  .─.             │                 ││  .─.             │\n",
    "│ ( x ◀─────( b )  │ ( x )            │                 └┼▶( n )            │\n",
    "│  `─'       `─'   │  `─'             │                  │  `─'             │\n",
    "│        95   │    │   │    96        │                  │   │    98        │\n",
    "│            .▼.   │  .▼.             │                  │   ▼              │\n",
    "│           ( c )  │ ( d )            │                  │  .─.             │\n",
    "│            `─'   │  `─'             │                  │ ( x )            │\n",
    "└──────────────────┴──────────────────┴──────────────────┴──`─'─────────────┘\n",
    "```\n",
    "\n",
    "Where, knowledge, the more granular spliting and information with higher density, optionally multi-hop of `x -> y`, `i -> j -> z -> x` etc... across many more nodes(chunks) than K(in TopK search) could be inlucded in Retrievers. And we believe there are cases that this additional work matters.\n",
    "\n",
    "Let's show examples of that now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "061a39e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# For OpenAI\n",
    "\n",
    "import os\n",
    "os.environ['OPENAI_API_KEY'] = \"INSERT OPENAI KEY\"\n",
    "\n",
    "import logging\n",
    "import sys\n",
    "\n",
    "logging.basicConfig(stream=sys.stdout, level=logging.INFO) # logging.DEBUG for more verbose output\n",
    "\n",
    "from llama_index import (\n",
    "    KnowledgeGraphIndex,\n",
    "    LLMPredictor,\n",
    "    ServiceContext,\n",
    "    SimpleDirectoryReader,\n",
    ")\n",
    "from llama_index.storage.storage_context import StorageContext\n",
    "from llama_index.graph_stores import NebulaGraphStore\n",
    "\n",
    "\n",
    "from langchain import OpenAI\n",
    "from IPython.display import Markdown, display\n",
    "\n",
    "\n",
    "# define LLM\n",
    "# NOTE: at the time of demo, text-davinci-002 did not have rate-limit errors\n",
    "llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name=\"text-davinci-002\"))\n",
    "service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=512)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5363496e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# For Azure OpenAI\n",
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "import os\n",
    "import json\n",
    "import openai\n",
    "from langchain.llms import AzureOpenAI\n",
    "from langchain.embeddings import OpenAIEmbeddings\n",
    "from llama_index import LangchainEmbedding\n",
    "from llama_index import (\n",
    "    VectorStoreIndex,\n",
    "    SimpleDirectoryReader,\n",
    "    KnowledgeGraphIndex,\n",
    "    LLMPredictor,\n",
    "    ServiceContext\n",
    ")\n",
    "\n",
    "from llama_index.storage.storage_context import StorageContext\n",
    "from llama_index.graph_stores import NebulaGraphStore\n",
    "\n",
    "import logging\n",
    "import sys\n",
    "\n",
    "from IPython.display import Markdown, display\n",
    "\n",
    "logging.basicConfig(stream=sys.stdout, level=logging.INFO) # logging.DEBUG for more verbose output\n",
    "logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))\n",
    "\n",
    "openai.api_type = \"azure\"\n",
    "openai.api_base = \"https://<foo-bar>.openai.azure.com\"\n",
    "openai.api_version = \"2022-12-01\"\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"youcannottellanyone\"\n",
    "openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n",
    "\n",
    "llm = AzureOpenAI(\n",
    "    deployment_name=\"<foo-bar-deployment>\",\n",
    "    temperature=0,\n",
    "    model_kwargs={\n",
    "        \"api_key\": openai.api_key,\n",
    "        \"api_base\": openai.api_base,\n",
    "        \"api_type\": openai.api_type,\n",
    "        \"api_version\": openai.api_version,\n",
    "    }\n",
    ")\n",
    "llm_predictor = LLMPredictor(llm=llm)\n",
    "\n",
    "# You need to deploy your own embedding model as well as your own chat completion model\n",
    "embedding_llm = LangchainEmbedding(\n",
    "    OpenAIEmbeddings(\n",
    "        model=\"text-embedding-ada-002\",\n",
    "        deployment=\"nebula_docs_2_embedding\",\n",
    "        openai_api_key=openai.api_key,\n",
    "        openai_api_base=openai.api_base,\n",
    "        openai_api_type=openai.api_type,\n",
    "        openai_api_version=openai.api_version,\n",
    "    ),\n",
    "    embed_batch_size=1,\n",
    ")\n",
    "\n",
    "service_context = ServiceContext.from_defaults(\n",
    "    llm_predictor=llm_predictor,\n",
    "    embed_model=embedding_llm,\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b17442e",
   "metadata": {},
   "source": [
    "## Prepare for NebulaGraph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6cf0e92",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install nebula3-python\n",
    "\n",
    "os.environ['NEBULA_USER'] = \"root\"\n",
    "os.environ['NEBULA_PASSWORD'] = \"nebula\"\n",
    "os.environ['NEBULA_ADDRESS'] = \"127.0.0.1:9669\" # assumed we have NebulaGraph installed locally\n",
    "\n",
    "# Assume that the graph has already been created\n",
    "    # Create a NebulaGraph cluster with:\n",
    "    # Option 0: `curl -fsSL nebula-up.siwei.io/install.sh | bash`\n",
    "    # Option 1: NebulaGraph Docker Extension https://hub.docker.com/extensions/weygu/nebulagraph-dd-ext\n",
    "# and that the graph space is called \"test\"\n",
    "    # If not, create it with the following commands from NebulaGraph's console:\n",
    "    # CREATE SPACE llamaindex(vid_type=FIXED_STRING(256), partition_num=1, replica_factor=1);\n",
    "    # :sleep 10;\n",
    "    # USE llamaindex;\n",
    "    # CREATE TAG entity();\n",
    "    # CREATE EDGE rel(predicate string);\n",
    "\n",
    "space_name = \"llamaindex\"\n",
    "edge_types, rel_prop_names = [\"rel\"], [\"predicate\"] # default, could be omit if create from an empty kg\n",
    "tags = [\"entity\"] # default, could be omit if create from an empty kg"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9cb7083",
   "metadata": {},
   "source": [
    "## Load Data from Wikipedia"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c376da93",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import download_loader\n",
    "\n",
    "WikipediaReader = download_loader(\"WikipediaReader\")\n",
    "\n",
    "loader = WikipediaReader()\n",
    "\n",
    "documents = loader.load_data(pages=['2023 in science'], auto_suggest=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc21c03b",
   "metadata": {},
   "source": [
    "## Create KnowledgeGraphIndex Index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c05437d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 21204 tokens\n",
      "> [build_index_from_nodes] Total LLM token usage: 21204 tokens\n",
      "> [build_index_from_nodes] Total LLM token usage: 21204 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 3953 tokens\n",
      "> [build_index_from_nodes] Total embedding token usage: 3953 tokens\n",
      "> [build_index_from_nodes] Total embedding token usage: 3953 tokens\n"
     ]
    }
   ],
   "source": [
    "graph_store = NebulaGraphStore(space_name=space_name, edge_types=edge_types, rel_prop_names=rel_prop_names, tags=tags)\n",
    "storage_context = StorageContext.from_defaults(graph_store=graph_store)\n",
    "\n",
    "kg_index = KnowledgeGraphIndex.from_documents(\n",
    "    documents,\n",
    "    storage_context=storage_context,\n",
    "    max_triplets_per_chunk=10,\n",
    "    service_context=service_context,\n",
    "    space_name=space_name,\n",
    "    edge_types=edge_types,\n",
    "    rel_prop_names=rel_prop_names,\n",
    "    tags=tags,\n",
    "    include_embeddings=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c3efdc4",
   "metadata": {},
   "source": [
    "## Create VectorStoreIndex Index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "474c2251",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 15419 tokens\n",
      "> [build_index_from_nodes] Total embedding token usage: 15419 tokens\n",
      "> [build_index_from_nodes] Total embedding token usage: 15419 tokens\n"
     ]
    }
   ],
   "source": [
    "vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ac0ee74",
   "metadata": {},
   "source": [
    "## Define a CustomRetriever\n",
    "\n",
    "The purpose of this demo was to test the effectiveness of using Knowledge Graph queries for retrieving information that is distributed across multiple nodes in small pieces. To achieve this, we adopted a simple approach: performing retrieval on both sources and then combining them into a single context to be sent to LLM.\n",
    "\n",
    "Thanks to the flexible abstraction provided by Llama Index Retriever, implementing this approach was relatively straightforward. We created a new class called `CustomRetriever` which retrieves data from both `VectorIndexRetriever` and `KGTableRetriever`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a9516c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import QueryBundle\n",
    "from llama_index import QueryBundle\n",
    "# import NodeWithScore\n",
    "from llama_index.data_structs import NodeWithScore\n",
    "# Retrievers \n",
    "from llama_index.retrievers import BaseRetriever, VectorIndexRetriever, KGTableRetriever\n",
    "\n",
    "from typing import List\n",
    "\n",
    "\n",
    "class CustomRetriever(BaseRetriever):\n",
    "    \"\"\"Custom retriever that performs both Vector search and Knowledge Graph search\"\"\"\n",
    "    \n",
    "    def __init__(\n",
    "        self,\n",
    "        vector_retriever: VectorIndexRetriever,\n",
    "        kg_retriever: KGTableRetriever,\n",
    "        mode: str = \"OR\"\n",
    "    ) -> None:\n",
    "        \"\"\"Init params.\"\"\"\n",
    "        \n",
    "        self._vector_retriever = vector_retriever\n",
    "        self._kg_retriever = kg_retriever\n",
    "        if mode not in (\"AND\", \"OR\"):\n",
    "            raise ValueError(\"Invalid mode.\")\n",
    "        self._mode = mode\n",
    "        \n",
    "    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]: \n",
    "        \"\"\"Retrieve nodes given query.\"\"\"\n",
    "        \n",
    "        vector_nodes = self._vector_retriever.retrieve(query_bundle)\n",
    "        kg_nodes = self._kg_retriever.retrieve(query_bundle)\n",
    "\n",
    "        vector_ids = {n.node.get_doc_id() for n in vector_nodes}\n",
    "        kg_ids = {n.node.get_doc_id() for n in kg_nodes}\n",
    "        \n",
    "        combined_dict = {n.node.get_doc_id(): n for n in vector_nodes}\n",
    "        combined_dict.update({n.node.get_doc_id(): n for n in kg_nodes})\n",
    "        \n",
    "        if self._mode == \"AND\":\n",
    "            retrieve_ids = vector_ids.intersection(kg_ids)\n",
    "        else:\n",
    "            retrieve_ids = vector_ids.union(kg_ids)\n",
    "\n",
    "        retrieve_nodes = [combined_dict[rid] for rid in retrieve_ids]\n",
    "        return retrieve_nodes\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fbe22b2b",
   "metadata": {},
   "source": [
    "Next, we will create instances of the Vector and KG retrievers, which will be used in the instantiation of the Custom Retriever."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ecb2d2bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import ResponseSynthesizer\n",
    "from llama_index.query_engine import RetrieverQueryEngine\n",
    "\n",
    "# create custom retriever\n",
    "vector_retriever = VectorIndexRetriever(index=vector_index)\n",
    "kg_retriever = KGTableRetriever(index=kg_index, retriever_mode='keyword', include_text=False)\n",
    "custom_retriever = CustomRetriever(vector_retriever, kg_retriever)\n",
    "\n",
    "# create response synthesizer\n",
    "response_synthesizer = ResponseSynthesizer.from_args(\n",
    "    service_context=service_context,\n",
    "    response_mode=\"tree_summarize\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "541608be",
   "metadata": {},
   "source": [
    "## Create Query Engines\n",
    "\n",
    "To enable comparsion, we also create `vector_query_engine`, `kg_keyword_query_engine` together with our `custom_query_engine`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc59b511",
   "metadata": {},
   "outputs": [],
   "source": [
    "custom_query_engine = RetrieverQueryEngine(\n",
    "    retriever=custom_retriever,\n",
    "    response_synthesizer=response_synthesizer,\n",
    ")\n",
    "\n",
    "vector_query_engine = vector_index.as_query_engine()\n",
    "\n",
    "kg_keyword_query_engine = kg_index.as_query_engine(\n",
    "    # setting to false uses the raw triplets instead of adding the text from the corresponding nodes\n",
    "    include_text=False,  \n",
    "    retriever_mode='keyword',\n",
    "    response_mode=\"tree_summarize\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6cd7e11",
   "metadata": {},
   "source": [
    "## Query with different retrievers\n",
    "\n",
    "With the above query engines created for corresponding retrievers, let's see how they perform.\n",
    "\n",
    "First, we go with the pure knowledge graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a09de04",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:llama_index.indices.knowledge_graph.retriever:> Starting query: Tell me events about NASA\n",
      "> Starting query: Tell me events about NASA\n",
      "> Starting query: Tell me events about NASA\n",
      "INFO:llama_index.indices.knowledge_graph.retriever:> Query keywords: ['NASA', 'events']\n",
      "> Query keywords: ['NASA', 'events']\n",
      "> Query keywords: ['NASA', 'events']\n",
      "INFO:llama_index.indices.knowledge_graph.retriever:> Extracted relationships: The following are knowledge triplets in max depth 2 in the form of `subject [predicate, object, predicate_next_hop, object_next_hop ...]`\n",
      "nasa ['public release date', 'mid-2023']\n",
      "nasa ['announces', 'future space telescope programs']\n",
      "nasa ['publishes images of', 'debris disk']\n",
      "nasa ['discovers', 'exoplanet lhs 475 b']\n",
      "> Extracted relationships: The following are knowledge triplets in max depth 2 in the form of `subject [predicate, object, predicate_next_hop, object_next_hop ...]`\n",
      "nasa ['public release date', 'mid-2023']\n",
      "nasa ['announces', 'future space telescope programs']\n",
      "nasa ['publishes images of', 'debris disk']\n",
      "nasa ['discovers', 'exoplanet lhs 475 b']\n",
      "> Extracted relationships: The following are knowledge triplets in max depth 2 in the form of `subject [predicate, object, predicate_next_hop, object_next_hop ...]`\n",
      "nasa ['public release date', 'mid-2023']\n",
      "nasa ['announces', 'future space telescope programs']\n",
      "nasa ['publishes images of', 'debris disk']\n",
      "nasa ['discovers', 'exoplanet lhs 475 b']\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 159 tokens\n",
      "> [get_response] Total LLM token usage: 159 tokens\n",
      "> [get_response] Total LLM token usage: 159 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 159 tokens\n",
      "> [get_response] Total LLM token usage: 159 tokens\n",
      "> [get_response] Total LLM token usage: 159 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "<b>\n",
       "NASA announced future space telescope programs in mid-2023, published images of a debris disk, and discovered an exoplanet called LHS 475 b.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = kg_keyword_query_engine.query(\n",
    "    \"Tell me events about NASA\"\n",
    ")\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adb2311e",
   "metadata": {},
   "source": [
    "Then the vector store approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df1d1976",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 5 tokens\n",
      "> [retrieve] Total embedding token usage: 5 tokens\n",
      "> [retrieve] Total embedding token usage: 5 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1892 tokens\n",
      "> [get_response] Total LLM token usage: 1892 tokens\n",
      "> [get_response] Total LLM token usage: 1892 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "<b>\n",
       "NASA scientists report evidence for the existence of a second Kuiper Belt, which the New Horizons spacecraft could potentially visit during the late 2020s or early 2030s.\n",
       "NASA is expected to release the first study on UAP in mid-2023.\n",
       "NASA's Venus probe is scheduled to be launched and to arrive on Venus in October, partly to search for signs of life on Venus.\n",
       "NASA is expected to start the Vera Rubin Observatory, the Qitai Radio Telescope, the European Spallation Source and the Jiangmen Underground Neutrino.\n",
       "NASA scientists suggest that a space sunshade could be created by mining the lunar soil and launching it towards the Sun to form a shield against global warming.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = vector_query_engine.query(\n",
    "    \"Tell me events about NASA\"\n",
    ")\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ac8c99a",
   "metadata": {},
   "source": [
    "Finally, let's do with the one with both vector store and knowledge graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb325682",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 5 tokens\n",
      "> [retrieve] Total embedding token usage: 5 tokens\n",
      "> [retrieve] Total embedding token usage: 5 tokens\n",
      "INFO:llama_index.indices.knowledge_graph.retriever:> Starting query: Tell me events about NASA\n",
      "> Starting query: Tell me events about NASA\n",
      "> Starting query: Tell me events about NASA\n",
      "INFO:llama_index.indices.knowledge_graph.retriever:> Query keywords: ['NASA', 'events']\n",
      "> Query keywords: ['NASA', 'events']\n",
      "> Query keywords: ['NASA', 'events']\n",
      "INFO:llama_index.indices.knowledge_graph.retriever:> Extracted relationships: The following are knowledge triplets in max depth 2 in the form of `subject [predicate, object, predicate_next_hop, object_next_hop ...]`\n",
      "nasa ['public release date', 'mid-2023']\n",
      "nasa ['announces', 'future space telescope programs']\n",
      "nasa ['publishes images of', 'debris disk']\n",
      "nasa ['discovers', 'exoplanet lhs 475 b']\n",
      "> Extracted relationships: The following are knowledge triplets in max depth 2 in the form of `subject [predicate, object, predicate_next_hop, object_next_hop ...]`\n",
      "nasa ['public release date', 'mid-2023']\n",
      "nasa ['announces', 'future space telescope programs']\n",
      "nasa ['publishes images of', 'debris disk']\n",
      "nasa ['discovers', 'exoplanet lhs 475 b']\n",
      "> Extracted relationships: The following are knowledge triplets in max depth 2 in the form of `subject [predicate, object, predicate_next_hop, object_next_hop ...]`\n",
      "nasa ['public release date', 'mid-2023']\n",
      "nasa ['announces', 'future space telescope programs']\n",
      "nasa ['publishes images of', 'debris disk']\n",
      "nasa ['discovers', 'exoplanet lhs 475 b']\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 2046 tokens\n",
      "> [get_response] Total LLM token usage: 2046 tokens\n",
      "> [get_response] Total LLM token usage: 2046 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 2046 tokens\n",
      "> [get_response] Total LLM token usage: 2046 tokens\n",
      "> [get_response] Total LLM token usage: 2046 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "<b>\n",
       "NASA announces future space telescope programs on May 21.\n",
       "NASA publishes images of debris disk on May 23.\n",
       "NASA discovers exoplanet LHS 475 b on May 25.\n",
       "NASA scientists present evidence for the existence of a second Kuiper Belt on May 29.\n",
       "NASA confirms the start of the next El Niño on June 8.\n",
       "NASA produces the first X-ray of a single atom on May 31.\n",
       "NASA reports the first successful beaming of solar energy from space down to a receiver on the ground on June 1.\n",
       "NASA scientists report evidence that Earth may have formed in just three million years on June 14.\n",
       "NASA scientists report the presence of phosphates on Enceladus, moon of the planet Saturn, on June 14.\n",
       "NASA's Venus probe is scheduled to be launched and to arrive on Venus in October.\n",
       "NASA's MBR Explorer is announced by the United Arab Emirates Space Agency on May 29.\n",
       "NASA's Vera Rubin Observatory is expected to start in 2023.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = custom_query_engine.query(\n",
    "    \"Tell me events about NASA\"\n",
    ")\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb2eb936",
   "metadata": {},
   "source": [
    "## Comparison of results\n",
    "\n",
    "Let's put results together with their LLM tokens during the query process:\n",
    "\n",
    "> Tell me events about NASA.\n",
    "\n",
    "|        | VectorStore                                                  | Knowledge Graph + VectorStore                                | Knowledge Graph                                              |\n",
    "| ------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |\n",
    "| Answer | NASA scientists report evidence for the existence of a second Kuiper Belt, which the New Horizons spacecraft could potentially visit during the late 2020s or early 2030s. NASA is expected to release the first study on UAP in mid-2023. NASA's Venus probe is scheduled to be launched and to arrive on Venus in October, partly to search for signs of life on Venus. NASA is expected to start the Vera Rubin Observatory, the Qitai Radio Telescope, the European Spallation Source and the Jiangmen Underground Neutrino. NASA scientists suggest that a space sunshade could be created by mining the lunar soil and launching it towards the Sun to form a shield against global warming. | NASA announces future space telescope programs on May 21. **NASA publishes images of debris disk on May 23. NASA discovers exoplanet LHS 475 b on May 25.** NASA scientists present evidence for the existence of a second Kuiper Belt on May 29. NASA confirms the start of the next El Niño on June 8. NASA produces the first X-ray of a single atom on May 31. NASA reports the first successful beaming of solar energy from space down to a receiver on the ground on June 1. NASA scientists report evidence that Earth may have formed in just three million years on June 14. NASA scientists report the presence of phosphates on Enceladus, moon of the planet Saturn, on June 14. NASA's Venus probe is scheduled to be launched and to arrive on Venus in October. NASA's MBR Explorer is announced by the United Arab Emirates Space Agency on May 29. NASA's Vera Rubin Observatory is expected to start in 2023. | NASA announced future space telescope programs in mid-2023, published images of a debris disk, and discovered an exoplanet called LHS 475 b. |\n",
    "| Cost   | 1897 tokens                                                  | 2046 Tokens                                                  | 159 Tokens                                                   |\n",
    "\n",
    "\n",
    "And we could see there are indeed some knowledges added with the help of Knowledge Graph retriever:\n",
    "\n",
    "- NASA publishes images of debris disk on May 23.\n",
    "- NASA discovers exoplanet LHS 475 b on May 25.\n",
    "\n",
    "The additional cost, however, does not seem to be very significant, at `7.28%`: `(2046-1897)/2046`.\n",
    "\n",
    "Furthermore, the answer from the knwoledge graph is extremely concise (only 159 tokens used!), but is still informative."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ecc23e9",
   "metadata": {},
   "source": [
    "## Not all cases are advantageous\n",
    "\n",
    "While, of course, many other questions do not contain small-grained pieces of knowledges in chunks. In these cases, the extra Knowledge Graph retriever may not that helpful. Let's see this question: \"Tell me events about ChatGPT\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5312e43b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 7 tokens\n",
      "> [retrieve] Total embedding token usage: 7 tokens\n",
      "> [retrieve] Total embedding token usage: 7 tokens\n",
      "INFO:llama_index.indices.knowledge_graph.retriever:> Starting query: Tell me events about ChatGPT\n",
      "> Starting query: Tell me events about ChatGPT\n",
      "> Starting query: Tell me events about ChatGPT\n",
      "INFO:llama_index.indices.knowledge_graph.retriever:> Query keywords: ['events', 'ChatGPT']\n",
      "> Query keywords: ['events', 'ChatGPT']\n",
      "> Query keywords: ['events', 'ChatGPT']\n",
      "INFO:llama_index.indices.knowledge_graph.retriever:> Extracted relationships: The following are knowledge triplets in max depth 2 in the form of `subject [predicate, object, predicate_next_hop, object_next_hop ...]`\n",
      "chatgpt ['is', 'language model']\n",
      "chatgpt ['outperform', 'human doctors']\n",
      "chatgpt ['has', '100 million active users']\n",
      "chatgpt ['released on', '30 nov 2022']\n",
      "> Extracted relationships: The following are knowledge triplets in max depth 2 in the form of `subject [predicate, object, predicate_next_hop, object_next_hop ...]`\n",
      "chatgpt ['is', 'language model']\n",
      "chatgpt ['outperform', 'human doctors']\n",
      "chatgpt ['has', '100 million active users']\n",
      "chatgpt ['released on', '30 nov 2022']\n",
      "> Extracted relationships: The following are knowledge triplets in max depth 2 in the form of `subject [predicate, object, predicate_next_hop, object_next_hop ...]`\n",
      "chatgpt ['is', 'language model']\n",
      "chatgpt ['outperform', 'human doctors']\n",
      "chatgpt ['has', '100 million active users']\n",
      "chatgpt ['released on', '30 nov 2022']\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 2045 tokens\n",
      "> [get_response] Total LLM token usage: 2045 tokens\n",
      "> [get_response] Total LLM token usage: 2045 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 2045 tokens\n",
      "> [get_response] Total LLM token usage: 2045 tokens\n",
      "> [get_response] Total LLM token usage: 2045 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "<b>\n",
       "ChatGPT is a chatbot and text-generating AI released on 30 November 2022. It quickly became highly popular, with some estimating that only two months after its launch, it had 100 million active users. Potential applications of ChatGPT include solving or supporting school writing assignments, malicious social bots (e.g. for misinformation, propaganda, and scams), and providing inspiration (e.g. for artistic writing or in design or ideation in general). There was extensive media coverage of views that regard ChatGPT as a potential step towards AGI or sentient machines, also extending to some academic works. Google released chatbot Bard due to effects of the ChatGPT release, with potential for integration into its Web search and, like ChatGPT software, also as a software development helper tool (21 Mar). DuckDuckGo released the DuckAssist feature integrated into its search engine that summarizes information from Wikipedia to answer search queries that are questions (8 Mar). The experimental feature was shut down without explanation on 12 April. Around the same time, a proprietary feature by scite.ai was released that delivers answers that use research papers and provide citations for the quoted paper(s). An open letter \"Pause Giant AI Experiments\" by the Future of Life</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = custom_query_engine.query(\n",
    "    \"Tell me events about ChatGPT\"\n",
    ")\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92120738",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:llama_index.indices.knowledge_graph.retriever:> Starting query: Tell me events about ChatGPT\n",
      "> Starting query: Tell me events about ChatGPT\n",
      "> Starting query: Tell me events about ChatGPT\n",
      "INFO:llama_index.indices.knowledge_graph.retriever:> Query keywords: ['events', 'ChatGPT']\n",
      "> Query keywords: ['events', 'ChatGPT']\n",
      "> Query keywords: ['events', 'ChatGPT']\n",
      "INFO:llama_index.indices.knowledge_graph.retriever:> Extracted relationships: The following are knowledge triplets in max depth 2 in the form of `subject [predicate, object, predicate_next_hop, object_next_hop ...]`\n",
      "chatgpt ['is', 'language model']\n",
      "chatgpt ['outperform', 'human doctors']\n",
      "chatgpt ['has', '100 million active users']\n",
      "chatgpt ['released on', '30 nov 2022']\n",
      "> Extracted relationships: The following are knowledge triplets in max depth 2 in the form of `subject [predicate, object, predicate_next_hop, object_next_hop ...]`\n",
      "chatgpt ['is', 'language model']\n",
      "chatgpt ['outperform', 'human doctors']\n",
      "chatgpt ['has', '100 million active users']\n",
      "chatgpt ['released on', '30 nov 2022']\n",
      "> Extracted relationships: The following are knowledge triplets in max depth 2 in the form of `subject [predicate, object, predicate_next_hop, object_next_hop ...]`\n",
      "chatgpt ['is', 'language model']\n",
      "chatgpt ['outperform', 'human doctors']\n",
      "chatgpt ['has', '100 million active users']\n",
      "chatgpt ['released on', '30 nov 2022']\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 150 tokens\n",
      "> [get_response] Total LLM token usage: 150 tokens\n",
      "> [get_response] Total LLM token usage: 150 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 150 tokens\n",
      "> [get_response] Total LLM token usage: 150 tokens\n",
      "> [get_response] Total LLM token usage: 150 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "<b>\n",
       "ChatGPT is a language model that outperforms human doctors and has 100 million active users. It was released on 30 November 2022.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = kg_keyword_query_engine.query(\n",
    "    \"Tell me events about ChatGPT\"\n",
    ")\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aee74efa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 7 tokens\n",
      "> [retrieve] Total embedding token usage: 7 tokens\n",
      "> [retrieve] Total embedding token usage: 7 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1956 tokens\n",
      "> [get_response] Total LLM token usage: 1956 tokens\n",
      "> [get_response] Total LLM token usage: 1956 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "<b>\n",
       "ChatGPT (released on 30 Nov 2022) is a chatbot and text-generating AI, and a large language model that quickly became highly popular. It is estimated that only two months after its launch, it had 100 million active users. Applications may include solving or supporting school writing assignments, malicious social bots (e.g. for misinformation, propaganda, and scams), and providing inspiration (e.g. for artistic writing or in design or ideation in general).\n",
       "In response to the ChatGPT release, Google released chatbot Bard (21 Mar) with potential for integration into its Web search and, like ChatGPT software, also as a software development helper tool. DuckDuckGo released the DuckAssist feature integrated into its search engine that summarizes information from Wikipedia to answer search queries that are questions (8 Mar). The experimental feature was shut down without explanation on 12 April.\n",
       "Around the time, a proprietary feature by scite.ai was released that delivers answers that use research papers and provide citations for the quoted paper(s).\n",
       "An open letter \"Pause Giant AI Experiments\" by the Future of Life Institute calls for \"AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = vector_query_engine.query(\n",
    "    \"Tell me events about ChatGPT\"\n",
    ")\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41ee9f9b",
   "metadata": {},
   "source": [
    "## Comparison of results\n",
    "\n",
    "We can see that being w/ vs. w/o Knowledge Graph has no unique advantage under this question.\n",
    "\n",
    "> Question: Tell me events about ChatGPT.\n",
    "\n",
    "|        | VectorStore                                                  | Knowledge Graph + VectorStore                                | Knowledge Graph                                              |\n",
    "| ------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |\n",
    "| Answer | ChatGPT (released on 30 Nov 2022) is a chatbot and text-generating AI, and a large language model that quickly became highly popular. It is estimated that only two months after its launch, it had 100 million active users. Applications may include solving or supporting school writing assignments, malicious social bots (e.g. for misinformation, propaganda, and scams), and providing inspiration (e.g. for artistic writing or in design or ideation in general). In response to the ChatGPT release, Google released chatbot Bard (21 Mar) with potential for integration into its Web search and, like ChatGPT software, also as a software development helper tool. DuckDuckGo released the DuckAssist feature integrated into its search engine that summarizes information from Wikipedia to answer search queries that are questions (8 Mar). The experimental feature was shut down without explanation on 12 April. Around the time, a proprietary feature by scite.ai was released that delivers answers that use research papers and provide citations for the quoted paper(s). An open letter \"Pause Giant AI Experiments\" by the Future of Life Institute calls for \"AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT- | ChatGPT is a chatbot and text-generating AI released on 30 November 2022. It quickly became highly popular, with some estimating that only two months after its launch, it had 100 million active users. Potential applications of ChatGPT include solving or supporting school writing assignments, malicious social bots (e.g. for misinformation, propaganda, and scams), and providing inspiration (e.g. for artistic writing or in design or ideation in general). There was extensive media coverage of views that regard ChatGPT as a potential step towards AGI or sentient machines, also extending to some academic works. Google released chatbot Bard due to effects of the ChatGPT release, with potential for integration into its Web search and, like ChatGPT software, also as a software development helper tool (21 Mar). DuckDuckGo released the DuckAssist feature integrated into its search engine that summarizes information from Wikipedia to answer search queries that are questions (8 Mar). The experimental feature was shut down without explanation on 12 April. Around the same time, a proprietary feature by scite.ai was released that delivers answers that use research papers and provide citations for the quoted paper(s). An open letter \"Pause Giant AI Experiments\" by the Future of Life | ChatGPT is a language model that outperforms human doctors and has 100 million active users. It was released on 30 November 2022. |\n",
    "| Cost   | 1963 Tokens                                                  | 2045 Tokens                                                  | 150 Tokens                                                   |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f68f0e0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2023_Science_Wikipedia_KnowledgeGraph.html\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"100%\"\n",
       "            height=\"600px\"\n",
       "            src=\"2023_Science_Wikipedia_KnowledgeGraph.html\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "            \n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x161ed9a30>"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## create graph\n",
    "from pyvis.network import Network\n",
    "\n",
    "g = kg_index.get_networkx_graph(200)\n",
    "net = Network(notebook=True, cdn_resources=\"in_line\", directed=True)\n",
    "net.from_nx(g)\n",
    "net.show(\"2023_Science_Wikipedia_KnowledgeGraph.html\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "556b81cb",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}