Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save marcusschiesser/a7e13026bbf105c388eeae6e61ade8fd to your computer and use it in GitHub Desktop.
Save marcusschiesser/a7e13026bbf105c388eeae6e61ade8fd to your computer and use it in GitHub Desktop.
Extract data in tables from a PDF
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "BP8bdWSbBRqw"
},
"source": [
"## Extract table from a pdf\n",
"\n",
"Using the `unstructured` library. See\n",
"our [documentation page](https://unstructured-io.github.io/unstructured) for a full description\n",
"of the features in the library."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jZp37lfueaeZ"
},
"outputs": [],
"source": [
"# Install Requirements\n",
"!apt-get -qq install poppler-utils tesseract-ocr\n",
"# Upgrade Pillow to latest version\n",
"%pip install -q --user --upgrade pillow\n",
"# Install Python Packages\n",
"%pip install -q unstructured[\"pdf\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "F2JBqRZgtq2Z"
},
"outputs": [],
"source": [
"# Read sample from https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf\n",
"import requests\n",
"from io import BytesIO\n",
"from unstructured.partition.pdf import partition_pdf\n",
"\n",
"\n",
"URL = \"https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf\"\n",
"\n",
"response = requests.get(URL)\n",
"pdf_file = BytesIO(response.content)\n",
"elements = partition_pdf(file=pdf_file, infer_table_structure=True)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<table><thead><th>Number of Coils</th><th>Number</th><th>of Paperclips</th></thead><tr><td></td><td></td><td>3,5,4</td></tr><tr><td></td><td>11,</td><td>10, 12</td></tr><tr><td></td><td>15,</td><td>13, 14</td></tr></table>\n",
"<table><thead><th>Number of Coils</th><th>Number</th><th>of Paperclips</th></thead><tr><td></td><td></td><td>3,5,4</td></tr><tr><td></td><td>11,</td><td>10, 12</td></tr><tr><td></td><td>15,</td><td>13, 14</td></tr></table>\n",
"<table><thead><th>Number of Coils</th><th>Number</th><th>of Paperclips</th></thead><tr><td></td><td></td><td>3,5,4</td></tr><tr><td></td><td>11,</td><td>10, 12</td></tr><tr><td></td><td>15,</td><td>13, 14</td></tr></table>\n"
]
}
],
"source": [
"from unstructured.staging.base import convert_to_csv\n",
"from unstructured.documents.elements import Table, NarrativeText, Title\n",
"\n",
"tables = [el for el in elements if el.category == \"Table\"]\n",
"\n",
"for table in tables:\n",
" print(tables[0].metadata.text_as_html)\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"description = \"The following object contains information about cars and their drivers.\"\n",
"properties = json.loads(\n",
" \"\"\"\n",
" {\n",
"\"driver\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The name of the driver\"\n",
"},\n",
"\"car\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The name of the car\"\n",
"}\n",
"}\n",
"\"\"\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"role\": \"assistant\",\n",
" \"content\": \"The given table does not match with the description for the `parse_object` function. The function is designed to take objects containing information about cars and their drivers. However, the provided table seems to contain information about coils and paperclips. I'm afraid we can't use the `parse_object` function for this table unless the table includes drivers and cars information.\"\n",
"}\n",
"{\n",
" \"role\": \"assistant\",\n",
" \"content\": \"As an AI, I'm unable to interact directly with the HTML, but I can provide instructions on how one can extract the required data and call the `parse_object` function.\\n\\nIf you're using JavaScript, you could utilize the DOM API or jQuery to parse the HTML table and call the function. Here's a basic step-by-step guide:\\n\\n1. Select the table element using document.querySelector or equivalent method.\\n2. Select the rows within this table using HTMLTableRowElement's `rowIndex` property.\\n3. Iterate over each table row.\\n4. Within each row, extract the data from the `Driver` and `Car` cells (those cells are the second and third ones respectively counting starts from `0`).\\n5. Call the `parse_object` function passing the appropriate driver and car.\\n\\nThe JavaScript code would be something like the following:\\n\\n```JavaScript\\nlet table = document.querySelector(\\\"table\\\");\\nlet rows = table.rows;\\n\\nfor (let i = 1; i < rows.length; i++) {\\n let driver = rows[i].cells[1].textContent;\\n let car = rows[i].cells[2].textContent;\\n\\n functions.parse_object({driver: driver, car: car});\\n}\\n```\\n\\nPlease replace \\\"table\\\" in `document.querySelector(\\\"table\\\")` with a specific selector if there are multiple tables on the page.\\n\\nIt's worth noting that you might want to amend this code to handle conditions where the table does not follow the required format. The provided code assumes that every row has the correct format. Additionally, if the `parse_object` function belongs to another system or it is an async function, additional handling/connection would be necessary.\"\n",
"}\n",
"{\n",
" \"role\": \"assistant\",\n",
" \"content\": \"Your request cannot be fulfilled as it's incorrect. The `parse_object` function is designed to handle information relating to cars and their drivers. However, the given HTML table involves time and distance data, unrelated to cars or their drivers. It appears there is a mismatch between the function and the data. Please provide data relevant to the function or a function that is intended to parse such table data.\"\n",
"}\n"
]
}
],
"source": [
"import openai\n",
"def messages(html):\n",
" messages = []\n",
" messages.append(\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"Don't make assumptions about what values to plug into functions. You can call the function if there is just a subset of the required parameters.\",\n",
" }\n",
" )\n",
" messages.append(\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": f\"Given the following HTML table, please parse each row into a JSON object by calling the `parse_object` function. The table might not be correctly formatted.:\\n\\n{html}\",\n",
" }\n",
" )\n",
" return messages\n",
"\n",
"for table in tables:\n",
" html = table.metadata.text_as_html\n",
" response = openai.ChatCompletion.create(\n",
" model=\"gpt-4\",\n",
" messages=messages(html),\n",
" functions=[\n",
" {\n",
" \"name\": \"parse_object\",\n",
" \"description\": f\"This function is called for objects that fit to this description: {description}\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": properties,\n",
" },\n",
" },\n",
" ],\n",
" )\n",
" print(response[\"choices\"][0][\"message\"])"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment