Skip to content

Instantly share code, notes, and snippets.

@virattt
Last active October 20, 2023 17:51
Show Gist options
  • Save virattt/9a256241a01bfc3fd3ba26378aa35e6a to your computer and use it in GitHub Desktop.
Save virattt/9a256241a01bfc3fd3ba26378aa35e6a to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "c4c04f33",
"metadata": {},
"source": [
"# Extracting the Top TV Shows from IMDB (HTML)\n",
"\n",
"The Top TV shows on IMDB: https://www.imdb.com/chart/toptv/\n",
"\n",
"In this notebook, I am using Kor and LangChain to:\n",
"1. Pull raw, unstructured HTML data of the top 200 TV shows from IMDB\n",
"2. Create a structured DataFrame of the raw data\n",
"\n",
"I hope you find this code useful. Please follow me on https://twitter.com/virattt for more tutorials like this.\n",
"\n",
"Happy learning! :)"
]
},
{
"cell_type": "markdown",
"id": "0ba1c899",
"metadata": {},
"source": [
"# Step 1. Download TV show HTML data from IMDB\n",
"\n",
"The TV shows, including their name, release year, and rating are in the HTML at the url below."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8c14d191",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"url = \"https://www.imdb.com/chart/toptv/\"\n",
"response = requests.get(url)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ab792281",
"metadata": {},
"outputs": [],
"source": [
"from langchain.schema import Document\n",
"\n",
"# Use LangChain to create a document out of the HTML response\n",
"document = Document(page_content=response.text)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "eb1066f7",
"metadata": {},
"outputs": [],
"source": [
"from kor.documents.html import MarkdownifyHTMLProcessor\n",
"\n",
"# Use Kor to convert the HTML document to Markdown\n",
"md = MarkdownifyHTMLProcessor().process(document)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ce35105c",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"\n",
"# Split the document into smaller pieces, so we can feed it into our LLM later\n",
"split_docs = RecursiveCharacterTextSplitter().split_documents([md])"
]
},
{
"cell_type": "markdown",
"id": "e7c25be2",
"metadata": {},
"source": [
"# Step 2. Create a model to extract TV show data from the HTML document"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b4cb2e8f",
"metadata": {},
"outputs": [],
"source": [
"from pydantic import BaseModel, Field, validator\n",
"from typing import Optional\n",
"\n",
"# Extend the BaseModel\n",
"class ImdbTvShowModel(BaseModel):\n",
" name: str = Field(\n",
" description=\"The name of the TV show\",\n",
" )\n",
" release_year: Optional[str] = Field(\n",
" description=\"The year the TV show was released\",\n",
" )\n",
" rating: Optional[str] = Field(\n",
" description=\"The rating of the TV show\",\n",
" )\n",
"\n",
" @validator(\"name\")\n",
" def name_must_not_be_empty(cls, v):\n",
" if not v:\n",
" raise ValueError(\"TV show name must not be empty\")\n",
" return v"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "3d3e3019",
"metadata": {},
"outputs": [],
"source": [
"from kor import from_pydantic\n",
"\n",
"# Provide an example to the ImdbTvShowModel of what a TV show looks like within the HTML data.\n",
"schema, extraction_validator = from_pydantic(\n",
" ImdbTvShowModel,\n",
" description=\"Planet Earth II (2016) 9.4\",\n",
" # Here, I only provide 1 example. Feel free to provide more examples for better results!\n",
" examples=[\n",
" (\n",
" \"2023-02-03 10-K Track Changes\",\n",
" {\n",
" \"name\": \"Planet Earth II\", \n",
" \"release_year\": \"2016\",\n",
" \"rating\": \"9.4\",\n",
" },\n",
" )\n",
" ],\n",
" many=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2b888c50",
"metadata": {},
"outputs": [],
"source": [
"from typing import List, Optional\n",
"\n",
"from pydantic import BaseModel, Field, validator\n",
"from kor import extract_from_documents, from_pydantic, create_extraction_chain\n",
"from langchain.document_loaders import SeleniumURLLoader"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84b4fa97",
"metadata": {},
"outputs": [],
"source": [
"import nltk\n",
"nltk.download('punkt')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dc27c4df",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"nltk.download('averaged_perceptron_tagger')"
]
},
{
"cell_type": "markdown",
"id": "28311f09",
"metadata": {},
"source": [
"# Step 3. Begin TV show data extraction"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d161d43d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"\n",
"# Create the language model. I'm using the basic ChatOpenAI model here. \n",
"llm = ChatOpenAI(temperature=0)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "cb9613d8",
"metadata": {},
"outputs": [],
"source": [
"from kor import create_extraction_chain\n",
"\n",
"# Create the Kor extraction chain\n",
"chain = create_extraction_chain(\n",
" llm,\n",
" schema,\n",
" encoder_or_encoder_class=\"csv\",\n",
" validator=extraction_validator,\n",
" input_formatter=\"triple_quotes\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a454e089",
"metadata": {},
"outputs": [],
"source": [
"from langchain.callbacks import get_openai_callback\n",
"from kor import extract_from_documents\n",
"\n",
"# Begin extraction. NOTE: This can be super expensive. Use at your own risk.\n",
"with get_openai_callback() as cb:\n",
" document_extraction_results = await extract_from_documents(\n",
" chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True\n",
" )\n",
" print(f\"Total Tokens: {cb.total_tokens}\")\n",
" print(f\"Prompt Tokens: {cb.prompt_tokens}\")\n",
" print(f\"Completion Tokens: {cb.completion_tokens}\")\n",
" print(f\"Successful Requests: {cb.successful_requests}\")\n",
" print(f\"Total Cost (USD): ${cb.total_cost}\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "1a33ab9c",
"metadata": {},
"outputs": [],
"source": [
"import itertools\n",
"\n",
"# Parse the extraction results into a list\n",
"validated_data = list(\n",
" itertools.chain.from_iterable(\n",
" extraction[\"validated_data\"] for extraction in document_extraction_results\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "fee31531",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Create the DataFrame!\n",
"df = pd.DataFrame(record.dict() for record in validated_data)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "328793e0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>release_year</th>\n",
" <th>rating</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Planet Earth II</td>\n",
" <td>2016</td>\n",
" <td>9.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Breaking Bad</td>\n",
" <td>2008</td>\n",
" <td>9.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Planet Earth</td>\n",
" <td>2006</td>\n",
" <td>9.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Band of Brothers</td>\n",
" <td>2001</td>\n",
" <td>9.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Chernobyl</td>\n",
" <td>2019</td>\n",
" <td>9.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>256</th>\n",
" <td>Foyle's War</td>\n",
" <td>2002</td>\n",
" <td>8.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>257</th>\n",
" <td>Gintama</td>\n",
" <td>2005</td>\n",
" <td>8.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>258</th>\n",
" <td>Black Books</td>\n",
" <td>2000</td>\n",
" <td>8.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>259</th>\n",
" <td>The Great British Baking Show</td>\n",
" <td>2010</td>\n",
" <td>8.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>260</th>\n",
" <td>X-Men: The Animated Series</td>\n",
" <td>1992</td>\n",
" <td>8.4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>261 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" name release_year rating\n",
"0 Planet Earth II 2016 9.4\n",
"1 Breaking Bad 2008 9.4\n",
"2 Planet Earth 2006 9.4\n",
"3 Band of Brothers 2001 9.4\n",
"4 Chernobyl 2019 9.3\n",
".. ... ... ...\n",
"256 Foyle's War 2002 8.4\n",
"257 Gintama 2005 8.4\n",
"258 Black Books 2000 8.4\n",
"259 The Great British Baking Show 2010 8.4\n",
"260 X-Men: The Animated Series 1992 8.4\n",
"\n",
"[261 rows x 3 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Voila! Done. DataFrame created from unstructured HTML data.\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93db3b63",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment