dannguyen/README.openai-structured-output-demo.md

## README.openai-structured-output-demo.md

      
    Raw
  

              README.openai-structured-output-demo.md
            
          
    Extracting financial disclosure reports and police blotter narratives using OpenAI's Structured Output


tl;dr this demo shows how to call OpenAI's gpt-4o-mini model, provide it with URL of a screenshot of a document, and extract data that follows a schema you define. The results are pretty solid even with little effort in defining the data — and no effort doing data prep. OpenAI's API could be a cost-efficient tool for large scale data gathering projects involving public documents.

OpenAI announced Structured Outputs for its API, a feature that allows users to specify the fields and schema of extracted data, and guarantees that the JSON output will follow that specification.
For example, given a Congressional financial disclosure report, with assets defined in a table like this:

You define the data model you're expecting to extract, either in JSON schema or (as this demo does) via the pydantic library:
class Asset(BaseModel):
    asset_name: str
    owner: str
    location: Union[str, None]
    asset_value_low: Union[int, None]
    asset_value_high: Union[int, None]
    income_type: str
    income_low: Union[int, None]
    income_high: Union[int, None]
    tx_gt_1000: bool

class DisclosureReport(BaseModel):
    assets: list[Asset]
OpenAI's API infers from the field names (the above example is basic; there are ways to provide detailed descriptions for each data field) how your data model relates to the actual document you're trying to parse, and produces the extracted data in JSON format:
{
      "asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]",
      "owner": "JT",
      "location": "St. Helena/Napa, CA, US",
      "asset_value_low": 5000001,
      "asset_value_high": 25000000,
      "income_type": "Grape Sales",
      "income_low": 100001,
      "income_high": 1000000,
      "tx_gt_1000": false
    },
    {
      "asset_name": "25 Point Lobos - Commercial Property [RP]",
      "owner": "SP",
      "location": "San Francisco/San Francisco, CA, US",
      "asset_value_low": 5000001,
      "asset_value_high": 25000000,
      "income_type": "Rent",
      "income_low": 100001,
      "income_high": 1000000,
      "tx_gt_1000": false
}
This demo gist provides code and results for two scenarios:

Financial disclosure reports: this is a data-tables-in-PDF problem where you'd typically have to use a PDF parsing library like pdfplumber and write your own data parsing methods.
Newspaper police blotter: this is a situation of irregular information — brief descriptions of reported crime incidents, written by a human reporter —  where you'd employ humans to read, interpret, and do data entry.

Note: these are very basic examples, using the bare minimum of instructions to the API (e.g. "Extract the text from this image") and relatively little code to define the expected data schema. That said
How to run this code/use this demo

Each example has the Python script used to produce the corresponding JSON output. To re-run these scripts on your own, the first thing you need to do is to create your own OpenAI developer account at platform.openai.com, then:

Put in a couple bucks into your account balance. Both of these examples use around 30,000-50,000 tokens, i.e. cost about half a cent to execute)
Create an API key
Set it as your $OPENAI_API_KEY environmental variable

alternatively, you can paste your key into the api_key argument, i.e. replace client = OpenAI() with client = OpenAI(api_key='Yourkeyhere')


Then install the OpenAI Python SDK and pydantic:
pip install openai pydantic
For ease of use, these scripts are set up to use gpt-4o-mini's vision capabilities to ingest PNG files via web URLs. If you want to modify the script to test a URL of your choosing, simply modify the INPUT_URL variable at the top of the script.
Scanned financial disclosure


Financial disclosure report


The script: extract-financial-disclosure.py
The results: output-financial-disclosure.json

The following screenshot is taken from the PDF of the full report, which can be found at disclosures-clerk.house.gov). Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.

As shown in the following snippet, the results look accurate and as expected. Note that it also correctly parses the "Location" and "Description" fields (when it exists), even though those fields aren't provided in tabular format (i.e. they're globbed into the "Asset" description as free form text).
It also understands that tx_gt_1000 corresponds to the Tx. > $1,000? header, and that that field contains checkboxes. Even though the sample page has no examples of checked checkboxes, the model correctly infers that tx_gt_1000 is false.

    {
      "asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]",
      "owner": "OL",
      "location": "New York, NY, US",
      "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors.",
      "asset_value_low": 1000001,
      "asset_value_high": 5000000,
      "income_type": "Partnership Income",
      "income_low": 50001,
      "income_high": 100000,
      "tx_gt_1000": false
    },
    {
      "asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]",
      "owner": "SP",
      "location": null,
      "description": null,
      "asset_value_low": 5000001,
      "asset_value_high": 25000000,
      "income_type": "None",
      "income_low": null,
      "income_high": null,
      "tx_gt_1000": false
    },
It's also nice that I didn't have to do even the minimum of "data prep": I gave it a screenshot of the report page — the top third of which has info I don't need — and it "knew" that it should only care about the data under the "Schedule A: Assets and Unearned Income" header.
If I were scraping financial disclosures for real, I would make use of json-schema's "description" attribute, which can be defined via Pydantic like this:
from pydantic import BaseModel, Field

class Asset(BaseModel):
    asset_name: str = Field(
        description="The name of the asset, under the 'Asset' header"
    )
    owner: str = Field(
        description="Under the 'Owner' header, a 2-letter abbreviation, e.g. SP, DC, JT"
    )
    location: Union[str, None] = Field(
        description="Some records have 'Location:' text as part of the 'Asset' header"
    )
    description: Union[str, None] = Field(
        description="Some records have 'Description:' text as part of the 'Asset' header"
    )
    asset_value_low: Union[int, None] = Field(
        description="Under the 'Value of Asset' field, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'"
    )
    asset_value_high: Union[int, None] = Field(
        description="Under the 'Value of Asset' field, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'"
    )
    income_type: str = Field(description="Under the 'Income Type(s) field")
    income_low: Union[int, None] = Field(
        description="Under the 'Income' field, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'"
    )
    income_high: Union[int, None] = Field(
        description="Under the 'Income' field, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'"
    )
    tx_gt_1000: bool = Field(
        description="Under the 'Tx. > $1,000?' header: True if the checkbox is checked, False if it is empty"
    )


class DisclosureReport(BaseModel):
    assets: list[Asset]
But as you can see from the result JSON, OpenAI's model seems "smart" enough to understand a basic data-copying task without specific instructions.
Financial disclosure report with no instruction

I was curious how well the model without any instruction, i.e. when you don't bother to define a pydantic model and instead pass in a response format of {"type": "json_object"}:
response = client.beta.chat.completions.parse(
    response_format={"type": "json_object"},
    model="gpt-4o-mini",
    messages=input_messages
)
The answer: just fine. You can see the code and full results here:

extract-basic-financial-disclosure.py
output-basic-financial-disclosure.json

Without a defined schema, the model treated the entire document (not just the Assets Schedule) as data:
{
  "document": {
    "title": "Financial Disclosure Report",
    "header": "Clerk of the House of Representatives \u2022 Legislative Resource Center \u2022 B81 Cannon Building \u2022 Washington, DC 20515",
    "filer_information": {
      "name": "Hon. Nancy Pelosi",
      "status": "Member",
      "state_district": "CA11"
    },
    "filing_information": {
      "filing_type": "Annual Report",
      "filing_year": "2023",
      "filing_date": "05/15/2024"
    },
    "schedule_a": {
      "title": "Schedule A: Assets and 'Unearned' Income",
      "assets": [
        {
          "asset": "11 Zinfandel Lane - Home & Vineyard [RP]",
          "owner": "JT",
          "value": "$5,000,001 - $25,000,000",
          "income_type": "Grape Sales",
          "income": "$100,001 - $1,000,000",
          "location": "St. Helena/Napa, CA, US"
        },
It left the values as text, e.g. "value": "$5,000,001 - $25,000,000" versus "asset_value_low": 5000001. And it left out the optional data fields, e.g. location and description, for entries that didn't have them:
        {
          "asset": "AllianceBernstein Holding L.P. Units (AB) [OL]",
          "owner": "SP",
          "value": "$1,000,001 - $5,000,000",
          "income_type": "Partnership Income",
          "income": "$50,001 - $100,000",
          "location": "New York, NY, US",
          "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors."
        },
        {
          "asset": "Alphabet Inc. - Class A (GOOGL) [ST]",
          "owner": "SP",
          "value": "$5,000,001 - $25,000,000",
          "income_type": "None"
        },
Scanned financial disclosure


extract-scanned-financial-disclosure.py
output-scanned-financial-disclosure.json

As I said at the beginning of this section, the report screenshot comes from a PDF with actual text — most Congressional disclosure filings in the past 5 years have used the e-filing system, which inherently results in more regular data even when the output is PDF.
So I tried using Structured Outputs on a screenshot of a 2008-era report, and the results were pretty solid.

The main caveat is that I had to rotate the page orientation by 90 degrees. The model did try to parse the vertically-orientated page, and got about half of the values right — which is probably one of the worst-case scenarios (you'd prefer the model to completely flub things, so that you could at least catch with automated-error checks)


Newspaper police blotter


The script: extract-police-blotter.py


The results: output-police-blotter.json


The screenshot was taken from the Stanford Daily archives: https://archives.stanforddaily.com/2004/04/09?page=3&section=MODSMD_ARTICLE12#article
For reasons that are explained in detail below, this example isn't meant to be a reasonable test of the model capabilities. But it's a fun experiment to see how well its model performs with something not meant to be "data" and is inherently riddled with data quality issues.
Consider what the data point of a basic crime incident report might contain:

When: a date and time
Where: a place
Who:

a victim
a suspect


What: the crime the suspect allegedly committed

It's easy to come up with many variations and edge cases:

No specific time: i.e. "computer science graduate students reported that they had books stolen from the Gates Computer Science Building in the previous five months"
No listed place: it's unclear if the reporter purposefully omitted it, or if it was left off the original police report.
No suspect ("an alcohol-related medical call") or no victim (e.g. "an accidental fire call"). Or multiple suspects and multiple victims.

Unlike the financial disclosure example, the input data is freeform narrative text. The onus is entirely on us to define what what a blotter report is, which ends up requiring defining what a crime incident is. Not surprisingly, the corresponding Pydantic code is a lot more verbose, and I bet if you asked 1,000 journalists to write a definition, they'd all be different.
Here's what mine looks like:
# Define the data structures in Pydantic:
# an Incident involves several Persons (victims, perpetrators)
class Person(BaseModel):
    description: str
    gender: str
    is_student: bool


# Pydantic docs on field descriptions:
# https://docs.pydantic.dev/latest/concepts/fields/
class Incident(BaseModel):
    date: str
    time: str
    location: str
    summary: str = Field(description="""Brief summary, less than 30 chars""")
    category: str = Field(
        description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """
    )
    property_damage: str = Field(
        description="""If a property crime, then a description of what was stolen/damaged/lost"""
    )
    arrest_made: bool
    perpetrators: list[Person]
    victims: list[Person]
    incident_text: str = Field(
        description="""Include the complete verbatim text from the input that pertains to the incident"""
    )

class Blotter(BaseModel):
    incidents: list[Incident]
Police blotter results

I ask the model to provide an incident_text field, i.e. the verbatim text from which it extracted the incident data point. This is helpful for evaluating the experiment. But for an actual data project, you might want to omit it as it adds to the number of output tokens and API cost
    incident_text: str = Field(
        description="""Include the complete verbatim text from the input that pertains to the incident"""
    )

The resulting incident_text field extracted from the above snippet is basically correct:

A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments.

However, it leaves off the 11:40 p.m., which is at the beginning of the printed incident, and is something that I normally would like to include because I want to know everything the model looked at when extracting the data point.
The 11:40 p.m. time is correctly included in the rest of the data output:
{
  "date": "April 2",
  "time": "11:40 p.m.",
  "location": "Rains apartments",
  "summary": "Bike vandalized",
  "category": "property",
  "property_damage": "Wheel of bike",
  "arrest_made": false,
  "perpetrators": [
    {
      "description": "Two unknown suspects",
      "gender": "unknown",
      "is_student": false
    }
  ],
  "victims": [
    {
      "description": "A graduate student in the School of Education",
      "gender": "unknown",
      "is_student": true
    }
  ],
  "incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments."
}

The good

As with the financial disclosure report, my script provides a screenshot and leaves it up to OpenAI's model to figure out what's going on. I was pleasantly surprised at how well gpt-4o-mini did in gleaning structure from a newspaper print listicle, with instructions as basic as: "Extract the text from this image"
For example, on first glance of the blotter, it seems that every incident has a date (in the subhed) and time (at the beginning of the graf). But under "Thursday, April 1", you can see that pattern already broken:

Is that second graf ("A female administrator in Materials Science...") a continuation of the 9:30 p.m. incident where a "man reported that someone removed his rear license plate"?
Most human readers, after reading both paragraphs — and then the rest of the blotter — will realize that these are 2 separate incidents. But there's nothing at all in the structure of the text to indicate that. Before I ran this experiment, I thought I would have to provide detailed parsing instructions to the model, e.g.

What you are reading is a police blotter, a list of reported incidents that police were called to. Every paragraph should be treated as a separate incident. Most incidents, but not all, begin with a timestamp, e.g. "11:20 p.m".

But the model saw on its own that there are 2 incidents, and that the second one happened on April 1 at an unspecified time.
 {
      "date": "April 1",
      "time": "9:30 p.m.",
      "location": "Toyon parking lot",
      "summary": "License plate stolen",
      "category": "property",
      "property_damage": "rear license plate",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "Man",
          "gender": "unknown",
          "is_student": false
        }
      ],
      "incident_text": "A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot."
    },
    {
      "date": "April 1",
      "time": "unknown",
      "location": "unknown",
      "summary": "Unauthorized purchase reported",
      "category": "other",
      "property_damage": "computer equipment",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "Female administrator",
          "gender": "female",
          "is_student": false
        }
      ],
      "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry\u2019s Electronics sometime in the past five months."
    },
By my count, there are 19 incidents in this issue of the Stanford Daily's police blotter, and the API correctly returns 19 different incidents.
The bad

Again, the data model is inherently messy, and I put in minimal effort to describe what an "incident" is, such as the variety of situations and edge cases. That, plus the inherent limitations of the data, are the root cause of most of the model's problems.
For example, I intended the perpetrators and victims to be lists of proper nouns or simple nouns, so that we could ask questions like: "how many incidents involved multiple people". Given the following incident text:

A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments.

— this is how the model parsed the suspects:
 "perpetrators": [
    {
      "description": "Two unknown suspects",
      "gender": "unknown",
      "is_student": false
    }
  ]
For a data project, I might have preferred a result that would easily return a result of 2, e.g.:
 "perpetrators": [
    {
      "description": "Unknown suspect",
      "gender": "unknown",
      "is_student": false
    }
    {
      "description": "Unknown suspect",
      "gender": "unknown",
      "is_student": false
    }
  ]
But how should the model know what I'm trying to do sans specific instructions? I think most humans, given the same minimalist instructions, would have also recorded "Two unknown suspects".
However, the model greatly struggled with filling out the perpetrators and victims lists, such as frequently mistaking the suspect/perpetrator as the victim, when there was no specific victim mentioned:

A male undergraduate was cited and released for running a stop sign on his bike and for not having a bike light or bike license.

      "victims": [
        {
          "description": "A male undergraduate",
          "gender": "male",
          "is_student": true
        }
      ]
It goes without saying that the model missed when the narrative was more complicated. For example, in the case of the unauthorized purchases at Fry's:

A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months.

The "female administrator" is not the victim, but the person who reported the crime. The victim would be Stanford University, or more specifically, its MScE department.
I'm not surprised the model had problems with identifying victims and suspects, though I'm unsure how much extra instruction would be needed to get reliable results from a general model.
One thing that the model frequently and inexplicably erred on was classifying people's gender.
This is how I defined a Person using pydantic:
class Person(BaseModel):
    description: str
    gender: str
    is_student: bool
Even when the subject's noun has an obvious gender, the model would inexplicably flub it:

A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot.

      "victims": [
        {
          "description": "A man",
          "gender": "unknown",
          "is_student": false
        }
      ]
It was worse when the subject's noun did not indicate gender, but the rest of the sentence did:

A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments.

"victims": [
        {
          "description": "A graduate student in the School of Education",
          "gender": "unknown",
          "is_student": true
        }
      ],
Not sure what the issue is. It might be remedied if I provided explicit and thorough instructions and examples, but this seemed like a much easier thing to infer than the other things that OpenAI's model was able to infer on its own.
The weird

With so many things left to the interpretation of the LLM, it was no surprise that I get different results every time I run the extract-police-blotter.py script, especially when it comes to the categorization of crimes.
In the data specification, I did attempt to describe for the model what I wanted for category:
category: str = Field(
    description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """
)
Given the option of saying "other", the model seemed eager to use it for any slightly vague situation. It classified the unauthorized purchases at Fry's as "other", even though embezzlement would better fit under property crimes by the FBI's UCR definition. Maybe this could be fixed by providing the model with detailed examples and definitions of statutes and criminal code?
But ultimately, as I said from the start, the model's performance is bounded by the limitations and errors in the source data. For example, an incident where someone gets hit on the head with a bottle seems to me obviously "violent", i.e. assault:

A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested.

However, the model thinks it is "other":
{
      "date": "April 4",
      "time": "3:05 a.m.",
      "location": "Sigma Alpha Epsilon",
      "summary": "Altercation reported",
      "category": "other",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [
        {
          "description": "Two undergraduate suspects",
          "gender": "unknown",
          "is_student": true
        }
      ],
      "victims": [
        {
          "description": "A male undergraduate",
          "gender": "male",
          "is_student": true
        }
      ],
      "incident_text": "A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested."
    }
But is the model necessarily wrong? Two "suspects" were apparently identified, but no one was actually arrested. I took this to mean that the suspects fled and hadn't been located at the time of the report. But maybe it's something more benign: an "altercation" happened, but when the cops arrived, everyone was cool including the guy who got hit by the bottle, thus no allegation of assault for police to act on or file as part of their UCR statistics. Ultimately we have to guess the author's intent.
OpenAI model's performance here wouldn't work for a real data project — but again, this was just a toy experiment, and doesn't represent what you'd get if you spend more than 10 minutes thinking about the data model, nevermind pick a data source slightly more structured than a newspaper listicle. I think OpenAI's model would work very well for something with more substantive text and formal structure, such as obituaries.

  
## extract-basic-financial-disclosure.py
#!/usr/bin/env python3

"""
extract-basic-financial-disclosure.py
Parses and extracts structured data — and lets the model infer the structure by itself —
from the screenshot at the given URL:

https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504

Full financial disclosure report:
https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf


This script assumes your API key is set up in the default way,
  i.e. environment variable: $OPENAI_API_KEY
  https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety

"""
import base64
import json
from openai import OpenAI
from pathlib import Path
from typing import Union

INPUT_URL = "https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504"


## initialize OpenAI client
client = OpenAI()

# Example of message format for passing in an image via URL
# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
input_messages = [
    {"role": "system", "content": "Output the result in JSON format."},
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract the text from this image"},
            {
                "type": "image_url",
                "image_url": {"url": INPUT_URL},
            },
        ],
    },
]

# we are letting the model infer the data structure by itself
# but we still need to tell it to respond in JSON, hence
# response_format={"type": "json_object"}
response = client.beta.chat.completions.parse(
    response_format={"type": "json_object"},
    model="gpt-4o-mini",
    messages=input_messages
)

message = response.choices[0].message

# Print it out in readable format
obj = json.loads(message.content)
print(json.dumps(obj, indent=2))

## extract-financial-disclosure.py
#!/usr/bin/env python3

"""
extract-financial-disclosure.py

Parses and extracts structured data from the screenshot at the given URL:

https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504

Full financial disclosure report:
https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf


This script assumes your API key is set up in the default way,
  i.e. environment variable: $OPENAI_API_KEY
  https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety

"""
import base64
import json
from openai import OpenAI
from pathlib import Path
from pydantic import BaseModel, Field
from typing import Union

INPUT_URL = "https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504"


# OpenAI examples of Stuctured Output scripts and data definitions
# https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2


# Define the data structures in Pydantic:
# a Disclosure Report has a list of assets
class Asset(BaseModel):
    asset_name: str
    owner: str
    location: Union[str, None]
    asset_value_low: Union[int, None]
    asset_value_high: Union[int, None]
    income_type: str
    income_low: Union[int, None]
    income_high: Union[int, None]
    tx_gt_1000: bool

class DisclosureReport(BaseModel):
    assets: list[Asset]

## initialize OpenAI client
client = OpenAI()


# Example of message format for passing in an image via URL
# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
input_messages = [
    {"role": "system", "content": "Output the result in JSON format."},
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract the text from this image"},
            {
                "type": "image_url",
                "image_url": {"url": INPUT_URL},
            },
        ],
    },
]

# gpt-4o-mini is cheap and fast and has vision capabilities
response = client.beta.chat.completions.parse(
    response_format=DisclosureReport,
    model="gpt-4o-mini",
    messages=input_messages
)

message = response.choices[0].message

# Print it out in readable format
obj = json.loads(message.content)
print(json.dumps(obj, indent=2))

## extract-police-blotter.py
#!/usr/bin/env python3

"""
extract-police-blotter.py

Parses and extracts structured data from the screenshot at the given URL:

https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703

This script assumes your API key is set up in the default way,
  i.e. environment variable: $OPENAI_API_KEY
  https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety

"""
import base64
import json
from openai import OpenAI
from pathlib import Path
from pydantic import BaseModel, Field

INPUT_URL = "https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703"


# OpenAI examples of Stuctured Output scripts and data definitions
# https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2


# Define the data structures in Pydantic:
# an Incident involves several Persons (victims, perpetrators)
class Person(BaseModel):
    description: str
    gender: str
    is_student: bool


# Pydantic docs on field descriptions:
# https://docs.pydantic.dev/latest/concepts/fields/
class Incident(BaseModel):
    date: str
    time: str
    location: str
    summary: str = Field(description="""Brief summary, less than 30 chars""")
    category: str = Field(
        description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """
    )
    property_damage: str = Field(
        description="""If a property crime, then a description of what was stolen/damaged/lost"""
    )
    arrest_made: bool
    perpetrators: list[Person]
    victims: list[Person]
    incident_text: str = Field(
        description="""Include the complete verbatim text from the input that pertains to the incident"""
    )


class Blotter(BaseModel):
    incidents: list[Incident]


## done defining the data structures
##################################################


## initialize OpenAI client
client = OpenAI()


# Example of message format for passing in an image via URL
# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
input_messages = [
    {"role": "system", "content": "Output the result in JSON format."},
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract the text from this image"},
            {
                "type": "image_url",
                "image_url": {"url": INPUT_URL},
            },
        ],
    },
]

# gpt-4o-mini is cheap and fast and has vision capabilities
response = client.beta.chat.completions.parse(
    response_format=Blotter,
    model="gpt-4o-mini",
    messages=input_messages
)

message = response.choices[0].message

# Print it out in readable format
obj = json.loads(message.content)
print(json.dumps(obj, indent=2))

## extract-scanned-financial-disclosure.py
#!/usr/bin/env python3

"""
extract-financial-disclosure.py
Parses and extracts structured data from the screenshot at the given URL:
https://gist.github.com/user-attachments/assets/e430e76a-2519-43fa-a370-85a584b816b6

The page comes from page 5 of 24; Schedule III, of the full financial
disclosure report found here:
https://gist.github.com/user-attachments/assets/e430e76a-2519-43fa-a370-85a584b816b6

(the page was manually rotated 90 degrees from its original orientation in the scanned document)

This script assumes your API key is set up in the default way,
  i.e. environment variable: $OPENAI_API_KEY
  https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
"""
import base64
import json
from openai import OpenAI
from pathlib import Path
from pydantic import BaseModel, Field
from typing import Union, Literal

INPUT_URL = "https://gist.github.com/user-attachments/assets/52c5c8f5-886f-45fe-a338-d1cd3e36ecc8"


# OpenAI examples of Stuctured Output scripts and data definitions
# https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2


# Define the data structures in Pydantic:
# a Disclosure Report has a list of assets
class Asset(BaseModel):
    owner: Union[Literal['SP', 'DC', 'JT'], None] = Field(description="The leftmost first column of the table")

    asset_name: str = Field(
        description="The name of the asset, the second column of the table"
    )
    asset_value_low: Union[int, None] = Field(
        description="In the third column, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'"
    )
    asset_value_high: Union[int, None] = Field(
        description="In the third column, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'"
    )
    income_type: str = Field(description="The fourth column")

    income_low: Union[int, None] = Field(
        description="In the 5th column, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'.  If the value is enclosed in parentheses, then the income values are meant to be negative"
    )
    income_high: Union[int, None] = Field(
        description="In the 5th column,  the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'. If the value is enclosed in parentheses, then the income values are meant to be negative"
    )

    transaction_type: Union[Literal['P', 'S', 'E'], None]

class DisclosureReport(BaseModel):
    assets: list[Asset]

## initialize OpenAI client
client = OpenAI()


# Example of message format for passing in an image via URL
# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
input_messages = [
    {"role": "system", "content": "Output the result in JSON format."},
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract the text from this image"},
            {
                "type": "image_url",
                "image_url": {"url": INPUT_URL},
            },
        ],
    },
]

# gpt-4o-mini is cheap and fast and has vision capabilities
response = client.beta.chat.completions.parse(
    response_format=DisclosureReport,
    model="gpt-4o-mini",
    messages=input_messages
)

message = response.choices[0].message

# Print it out in readable format
obj = json.loads(message.content)
print(json.dumps(obj, indent=2))

## output-basic-financial-disclosure.json
{
  "document": {
    "title": "Financial Disclosure Report",
    "header": "Clerk of the House of Representatives \u2022 Legislative Resource Center \u2022 B81 Cannon Building \u2022 Washington, DC 20515",
    "filer_information": {
      "name": "Hon. Nancy Pelosi",
      "status": "Member",
      "state_district": "CA11"
    },
    "filing_information": {
      "filing_type": "Annual Report",
      "filing_year": "2023",
      "filing_date": "05/15/2024"
    },
    "schedule_a": {
      "title": "Schedule A: Assets and 'Unearned' Income",
      "assets": [
        {
          "asset": "11 Zinfandel Lane - Home & Vineyard [RP]",
          "owner": "JT",
          "value": "$5,000,001 - $25,000,000",
          "income_type": "Grape Sales",
          "income": "$100,001 - $1,000,000",
          "location": "St. Helena/Napa, CA, US"
        },
        {
          "asset": "25 Point Lobos - Commercial Property [RP]",
          "owner": "SP",
          "value": "$5,000,001 - $25,000,000",
          "income_type": "Rent",
          "income": "$100,001 - $1,000,000",
          "location": "San Francisco/San Francisco, CA, US"
        },
        {
          "asset": "45 Belden Place - Four Story Commercial Building [RP]",
          "owner": "SP",
          "value": "$5,000,001 - $25,000,000",
          "income_type": "Rent",
          "income": "$100,001 - $1,000,000",
          "location": "San Francisco/San Francisco, CA, US"
        },
        {
          "asset": "AllianceBernstein Holding L.P. Units (AB) [OL]",
          "owner": "SP",
          "value": "$1,000,001 - $5,000,000",
          "income_type": "Partnership Income",
          "income": "$50,001 - $100,000",
          "location": "New York, NY, US",
          "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors."
        },
        {
          "asset": "Alphabet Inc. - Class A (GOOGL) [ST]",
          "owner": "SP",
          "value": "$5,000,001 - $25,000,000",
          "income_type": "None"
        },
        {
          "asset": "Amazon.com, Inc. (AMZN) [ST]",
          "owner": "SP",
          "value": "$5,000,001 - $25,000,000",
          "income_type": "None"
        }
      ]
    }
  }
}

## output-financial-disclosure.json
{
  "assets": [
    {
      "asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]",
      "owner": "JT",
      "location": "St. Helena/Napa, CA, US",
      "description": null,
      "asset_value_low": 5000001,
      "asset_value_high": 25000000,
      "income_type": "Grape Sales",
      "income_low": 100001,
      "income_high": 1000000,
      "tx_gt_1000": false
    },
    {
      "asset_name": "25 Point Lobos - Commercial Property [RP]",
      "owner": "SP",
      "location": "San Francisco/San Francisco, CA, US",
      "description": null,
      "asset_value_low": 5000001,
      "asset_value_high": 25000000,
      "income_type": "Rent",
      "income_low": 100001,
      "income_high": 1000000,
      "tx_gt_1000": false
    },
    {
      "asset_name": "45 Belden Place - Four Story Commercial Building [RP]",
      "owner": "SP",
      "location": "San Francisco/San Francisco, CA, US",
      "description": null,
      "asset_value_low": 5000001,
      "asset_value_high": 25000000,
      "income_type": "Rent",
      "income_low": 100001,
      "income_high": 1000000,
      "tx_gt_1000": false
    },
    {
      "asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]",
      "owner": "OL",
      "location": "New York, NY, US",
      "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors.",
      "asset_value_low": 1000001,
      "asset_value_high": 5000000,
      "income_type": "Partnership Income",
      "income_low": 50001,
      "income_high": 100000,
      "tx_gt_1000": false
    },
    {
      "asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]",
      "owner": "SP",
      "location": null,
      "description": null,
      "asset_value_low": 5000001,
      "asset_value_high": 25000000,
      "income_type": "None",
      "income_low": null,
      "income_high": null,
      "tx_gt_1000": false
    },
    {
      "asset_name": "Amazon.com, Inc. (AMZN) [ST]",
      "owner": "SP",
      "location": null,
      "description": null,
      "asset_value_low": 5000001,
      "asset_value_high": 25000000,
      "income_type": "None",
      "income_low": null,
      "income_high": null,
      "tx_gt_1000": false
    }
  ]
}

## output-police-blotter.json
{
  "incidents": [
    {
      "date": "April 1",
      "time": "9:30 p.m.",
      "location": "Toyon parking lot",
      "summary": "License plate stolen",
      "category": "property",
      "property_damage": "Rear license plate",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "A man",
          "gender": "unknown",
          "is_student": false
        }
      ],
      "incident_text": "A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot."
    },
    {
      "date": "April 1",
      "time": "unknown",
      "location": "Fry's Electronics",
      "summary": "Unauthorized purchase reported",
      "category": "other",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "A female administrator in Materials Science and Engineering",
          "gender": "female",
          "is_student": false
        }
      ],
      "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months."
    },
    {
      "date": "April 2",
      "time": "3:30 p.m.",
      "location": "unknown",
      "summary": "Rear license plate stolen reported",
      "category": "property",
      "property_damage": "Rear license plate",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "Another man",
          "gender": "unknown",
          "is_student": false
        }
      ],
      "incident_text": "Another man reported that the rear license plate was missing from his vehicle."
    },
    {
      "date": "April 2",
      "time": "11:40 p.m.",
      "location": "Rains apartments",
      "summary": "Bike vandalized",
      "category": "property",
      "property_damage": "Wheel of bike",
      "arrest_made": false,
      "perpetrators": [
        {
          "description": "Two unknown suspects",
          "gender": "unknown",
          "is_student": false
        }
      ],
      "victims": [
        {
          "description": "A graduate student in the School of Education",
          "gender": "unknown",
          "is_student": true
        }
      ],
      "incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments."
    },
    {
      "date": "April 2",
      "time": "11:40 p.m.",
      "location": "Adelfa",
      "summary": "Medical call",
      "category": "call for service",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "unknown",
          "gender": "unknown",
          "is_student": false
        }
      ],
      "incident_text": "Police responded to an alcohol-related medical call in Adelfa."
    },
    {
      "date": "April 3",
      "time": "10:20 p.m.",
      "location": "unknown",
      "summary": "Bike citation",
      "category": "other",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "A male undergraduate",
          "gender": "male",
          "is_student": true
        }
      ],
      "incident_text": "A male undergraduate was cited and released for running a stop sign on his bike and for not having a bike light or bike license."
    },
    {
      "date": "April 3",
      "time": "11:20 p.m.",
      "location": "Lomita Drive",
      "summary": "Minor in possession citation",
      "category": "other",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "A woman",
          "gender": "female",
          "is_student": false
        }
      ],
      "incident_text": "A woman was cited and released on Lomita Drive for being a minor in possession of alcohol."
    },
    {
      "date": "April 3",
      "time": "11:40 p.m.",
      "location": "unknown",
      "summary": "Driving citation",
      "category": "other",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "A man",
          "gender": "male",
          "is_student": false
        }
      ],
      "incident_text": "A man was cited and released for driving with a suspended license after he was stopped near Galvez Street and Campus Drive."
    },
    {
      "date": "April 4",
      "time": "1:00 a.m.",
      "location": "unknown",
      "summary": "Car damage reported",
      "category": "property",
      "property_damage": "Trunk and hood",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "A man",
          "gender": "male",
          "is_student": false
        }
      ],
      "incident_text": "A man reported that someone walked over the top of his car, causing damage to the trunk, top and hood."
    },
    {
      "date": "April 4",
      "time": "1:31 a.m.",
      "location": "Mayfield Avenue",
      "summary": "Bikes U-Locked incident",
      "category": "other",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "Two men",
          "gender": "unknown",
          "is_student": false
        }
      ],
      "incident_text": "Two men were cited and released for walking with bikes U-Locked to themselves on Mayfield Avenue after neither could establish ownership of the bikes."
    },
    {
      "date": "April 4",
      "time": "3:05 a.m.",
      "location": "Sigma Alpha Epsilon",
      "summary": "Altercation reported",
      "category": "other",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [
        {
          "description": "Two undergraduate suspects",
          "gender": "unknown",
          "is_student": true
        }
      ],
      "victims": [
        {
          "description": "A male undergraduate",
          "gender": "male",
          "is_student": true
        }
      ],
      "incident_text": "A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested."
    },
    {
      "date": "April 5",
      "time": "2:45 a.m.",
      "location": "unknown",
      "summary": "Arrest for intoxication",
      "category": "other",
      "property_damage": "None",
      "arrest_made": true,
      "perpetrators": [],
      "victims": [
        {
          "description": "A man",
          "gender": "male",
          "is_student": false
        }
      ],
      "incident_text": "Police arrested a man for being drunk in public on Palm Drive near the entrance arch."
    },
    {
      "date": "April 6",
      "time": "7:15 a.m.",
      "location": "Andronico's Supermarket",
      "summary": "Assisted in detaining suspect",
      "category": "call for service",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "unknown",
          "gender": "unknown",
          "is_student": false
        }
      ],
      "incident_text": "Police assisted Andronico's Supermarket in detaining a suspect after a call was made on a blue emergency phone. Palo Alto police later took the suspect into custody."
    },
    {
      "date": "April 6",
      "time": "3:20 p.m.",
      "location": "Studio 3 on Angell Court",
      "summary": "Accidental fire call",
      "category": "call for service",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "unknown",
          "gender": "unknown",
          "is_student": false
        }
      ],
      "incident_text": "Police responded to an accidental fire call to Studio 3 on Angell Court."
    },
    {
      "date": "April 6",
      "time": "9:00 p.m.",
      "location": "unknown",
      "summary": "Found property reported",
      "category": "property",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "A man",
          "gender": "male",
          "is_student": false
        }
      ],
      "incident_text": "A man reported that he found someone else's personal property in his locked car."
    },
    {
      "date": "April 6",
      "time": "1:45 a.m.",
      "location": "San Jose main jail",
      "summary": "Arrest for trespassing",
      "category": "other",
      "property_damage": "None",
      "arrest_made": true,
      "perpetrators": [],
      "victims": [
        {
          "description": "A local vagrant",
          "gender": "unknown",
          "is_student": false
        }
      ],
      "incident_text": "A local vagrant was booked into the San Jose main jail for trespassing - his seventh time trespassing in a month."
    },
    {
      "date": "April 6",
      "time": "2:50 a.m.",
      "location": "Palm Drive",
      "summary": "Driving citation",
      "category": "other",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "A man",
          "gender": "male",
          "is_student": false
        }
      ],
      "incident_text": "A man was cited and released for driving without a license on Palm Drive."
    }
  ]
}

## output-scanned-financial-disclosure.json
{
  "assets": [
    {
      "owner": "SP",
      "asset_name": "820 Sir Francis Drake Blvd., San Anselmo, CA - Commercial Property",
      "asset_value_low": 1000001,
      "asset_value_high": 5000000,
      "income_type": "RENT",
      "income_low": 100001,
      "income_high": 1000000,
      "transaction_type": "P"
    },
    {
      "owner": "SP",
      "asset_name": "Access Technology Partners, LP",
      "asset_value_low": 0,
      "asset_value_high": 0,
      "income_type": "PARTNERSHIP INCOME/(LOSS)",
      "income_low": -1000000,
      "income_high": -1,
      "transaction_type": "S"
    },
    {
      "owner": "SP",
      "asset_name": "Active, LLC",
      "asset_value_low": 15001,
      "asset_value_high": 50000,
      "income_type": "PARTNERSHIP INCOME/(LOSS)",
      "income_low": -200,
      "income_high": -1,
      "transaction_type": "P"
    },
    {
      "owner": "SP",
      "asset_name": "Agile Software Corp. - Public Common Stock",
      "asset_value_low": 0,
      "asset_value_high": 0,
      "income_type": "CAPITAL GAIN",
      "income_low": 15001,
      "income_high": 50000,
      "transaction_type": "S"
    },
    {
      "owner": "SP",
      "asset_name": "Akamai Technologies Inc. - Public Common Stock",
      "asset_value_low": 50001,
      "asset_value_high": 100000,
      "income_type": "NONE",
      "income_low": null,
      "income_high": null,
      "transaction_type": "P"
    },
    {
      "owner": "SP",
      "asset_name": "Alcatel Lucent Ads - Public Common Stock",
      "asset_value_low": 1001,
      "asset_value_high": 15000,
      "income_type": "DIVIDENDS",
      "income_low": 1,
      "income_high": 200,
      "transaction_type": null
    },
    {
      "owner": "SP",
      "asset_name": "Alcoa Inc. - Public Common Stock",
      "asset_value_low": 15001,
      "asset_value_high": 50000,
      "income_type": "DIVIDENDS",
      "income_low": 201,
      "income_high": 1000,
      "transaction_type": "P"
    },
    {
      "owner": "SP",
      "asset_name": "American International Group Inc. - Public Common Stock",
      "asset_value_low": 250001,
      "asset_value_high": 500000,
      "income_type": "DIVIDENDS",
      "income_low": 2501,
      "income_high": 5000,
      "transaction_type": null
    },
    {
      "owner": "SP",
      "asset_name": "Americas Doctors.com - Preferred Stock",
      "asset_value_low": 1001,
      "asset_value_high": 15000,
      "income_type": "NONE",
      "income_low": null,
      "income_high": null,
      "transaction_type": null
    },
    {
      "owner": "SP",
      "asset_name": "Apple Computer - Public Common Stock",
      "asset_value_low": 5000001,
      "asset_value_high": 25000000,
      "income_type": "CAPITAL GAIN",
      "income_low": 100001,
      "income_high": 1000000,
      "transaction_type": "S"
    },
    {
      "owner": "SP",
      "asset_name": "Aristotle, LLC",
      "asset_value_low": 15001,
      "asset_value_high": 50000,
      "income_type": "NONE",
      "income_low": null,
      "income_high": null,
      "transaction_type": null
    },
    {
      "owner": "SP",
      "asset_name": "Ashlar, Inc. - Common Stock",
      "asset_value_low": 0,
      "asset_value_high": 0,
      "income_type": "CAPITAL GAIN/(LOSS)",
      "income_low": -1001,
      "income_high": -201,
      "transaction_type": "S"
    },
    {
      "owner": "SP",
      "asset_name": "AT&T - Public Common Stock",
      "asset_value_low": 250001,
      "asset_value_high": 500000,
      "income_type": "DIVIDENDS",
      "income_low": 5001,
      "income_high": 15000,
      "transaction_type": null
    }
  ]
}
	#!/usr/bin/env python3

	"""
	extract-basic-financial-disclosure.py
	Parses and extracts structured data — and lets the model infer the structure by itself —
	from the screenshot at the given URL:

	https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504

	Full financial disclosure report:
	https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf


	This script assumes your API key is set up in the default way,
	i.e. environment variable: $OPENAI_API_KEY
	https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety

	"""
	import base64
	import json
	from openai import OpenAI
	from pathlib import Path
	from typing import Union

	INPUT_URL = "https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504"


	## initialize OpenAI client
	client = OpenAI()

	# Example of message format for passing in an image via URL
	# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
	input_messages = [
	{"role": "system", "content": "Output the result in JSON format."},
	{
	"role": "user",
	"content": [
	{"type": "text", "text": "Extract the text from this image"},
	{
	"type": "image_url",
	"image_url": {"url": INPUT_URL},
	},
	],
	},
	]

	# we are letting the model infer the data structure by itself
	# but we still need to tell it to respond in JSON, hence
	# response_format={"type": "json_object"}
	response = client.beta.chat.completions.parse(
	response_format={"type": "json_object"},
	model="gpt-4o-mini",
	messages=input_messages
	)

	message = response.choices[0].message

	# Print it out in readable format
	obj = json.loads(message.content)
	print(json.dumps(obj, indent=2))
	#!/usr/bin/env python3

	"""
	extract-financial-disclosure.py

	Parses and extracts structured data from the screenshot at the given URL:

	https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504

	Full financial disclosure report:
	https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf


	This script assumes your API key is set up in the default way,
	i.e. environment variable: $OPENAI_API_KEY
	https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety

	"""
	import base64
	import json
	from openai import OpenAI
	from pathlib import Path
	from pydantic import BaseModel, Field
	from typing import Union

	INPUT_URL = "https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504"


	# OpenAI examples of Stuctured Output scripts and data definitions
	# https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2


	# Define the data structures in Pydantic:
	# a Disclosure Report has a list of assets
	class Asset(BaseModel):
	asset_name: str
	owner: str
	location: Union[str, None]
	asset_value_low: Union[int, None]
	asset_value_high: Union[int, None]
	income_type: str
	income_low: Union[int, None]
	income_high: Union[int, None]
	tx_gt_1000: bool

	class DisclosureReport(BaseModel):
	assets: list[Asset]

	## initialize OpenAI client
	client = OpenAI()


	# Example of message format for passing in an image via URL
	# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
	input_messages = [
	{"role": "system", "content": "Output the result in JSON format."},
	{
	"role": "user",
	"content": [
	{"type": "text", "text": "Extract the text from this image"},
	{
	"type": "image_url",
	"image_url": {"url": INPUT_URL},
	},
	],
	},
	]

	# gpt-4o-mini is cheap and fast and has vision capabilities
	response = client.beta.chat.completions.parse(
	response_format=DisclosureReport,
	model="gpt-4o-mini",
	messages=input_messages
	)

	message = response.choices[0].message

	# Print it out in readable format
	obj = json.loads(message.content)
	print(json.dumps(obj, indent=2))
	#!/usr/bin/env python3

	"""
	extract-police-blotter.py

	Parses and extracts structured data from the screenshot at the given URL:

	https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703

	This script assumes your API key is set up in the default way,
	i.e. environment variable: $OPENAI_API_KEY
	https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety

	"""
	import base64
	import json
	from openai import OpenAI
	from pathlib import Path
	from pydantic import BaseModel, Field

	INPUT_URL = "https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703"


	# OpenAI examples of Stuctured Output scripts and data definitions
	# https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2


	# Define the data structures in Pydantic:
	# an Incident involves several Persons (victims, perpetrators)
	class Person(BaseModel):
	description: str
	gender: str
	is_student: bool


	# Pydantic docs on field descriptions:
	# https://docs.pydantic.dev/latest/concepts/fields/
	class Incident(BaseModel):
	date: str
	time: str
	location: str
	summary: str = Field(description="""Brief summary, less than 30 chars""")
	category: str = Field(
	description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """
	)
	property_damage: str = Field(
	description="""If a property crime, then a description of what was stolen/damaged/lost"""
	)
	arrest_made: bool
	perpetrators: list[Person]
	victims: list[Person]
	incident_text: str = Field(
	description="""Include the complete verbatim text from the input that pertains to the incident"""
	)


	class Blotter(BaseModel):
	incidents: list[Incident]


	## done defining the data structures
	##################################################


	## initialize OpenAI client
	client = OpenAI()


	# Example of message format for passing in an image via URL
	# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
	input_messages = [
	{"role": "system", "content": "Output the result in JSON format."},
	{
	"role": "user",
	"content": [
	{"type": "text", "text": "Extract the text from this image"},
	{
	"type": "image_url",
	"image_url": {"url": INPUT_URL},
	},
	],
	},
	]

	# gpt-4o-mini is cheap and fast and has vision capabilities
	response = client.beta.chat.completions.parse(
	response_format=Blotter,
	model="gpt-4o-mini",
	messages=input_messages
	)

	message = response.choices[0].message

	# Print it out in readable format
	obj = json.loads(message.content)
	print(json.dumps(obj, indent=2))
	{
	"document": {
	"title": "Financial Disclosure Report",
	"header": "Clerk of the House of Representatives \u2022 Legislative Resource Center \u2022 B81 Cannon Building \u2022 Washington, DC 20515",
	"filer_information": {
	"name": "Hon. Nancy Pelosi",
	"status": "Member",
	"state_district": "CA11"
	},
	"filing_information": {
	"filing_type": "Annual Report",
	"filing_year": "2023",
	"filing_date": "05/15/2024"
	},
	"schedule_a": {
	"title": "Schedule A: Assets and 'Unearned' Income",
	"assets": [
	{
	"asset": "11 Zinfandel Lane - Home & Vineyard [RP]",
	"owner": "JT",
	"value": "$5,000,001 - $25,000,000",
	"income_type": "Grape Sales",
	"income": "$100,001 - $1,000,000",
	"location": "St. Helena/Napa, CA, US"
	},
	{
	"asset": "25 Point Lobos - Commercial Property [RP]",
	"owner": "SP",
	"value": "$5,000,001 - $25,000,000",
	"income_type": "Rent",
	"income": "$100,001 - $1,000,000",
	"location": "San Francisco/San Francisco, CA, US"
	},
	{
	"asset": "45 Belden Place - Four Story Commercial Building [RP]",
	"owner": "SP",
	"value": "$5,000,001 - $25,000,000",
	"income_type": "Rent",
	"income": "$100,001 - $1,000,000",
	"location": "San Francisco/San Francisco, CA, US"
	},
	{
	"asset": "AllianceBernstein Holding L.P. Units (AB) [OL]",
	"owner": "SP",
	"value": "$1,000,001 - $5,000,000",
	"income_type": "Partnership Income",
	"income": "$50,001 - $100,000",
	"location": "New York, NY, US",
	"description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors."
	},
	{
	"asset": "Alphabet Inc. - Class A (GOOGL) [ST]",
	"owner": "SP",
	"value": "$5,000,001 - $25,000,000",
	"income_type": "None"
	},
	{
	"asset": "Amazon.com, Inc. (AMZN) [ST]",
	"owner": "SP",
	"value": "$5,000,001 - $25,000,000",
	"income_type": "None"
	}
	]
	}
	}
	}
	{
	"assets": [
	{
	"asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]",
	"owner": "JT",
	"location": "St. Helena/Napa, CA, US",
	"description": null,
	"asset_value_low": 5000001,
	"asset_value_high": 25000000,
	"income_type": "Grape Sales",
	"income_low": 100001,
	"income_high": 1000000,
	"tx_gt_1000": false
	},
	{
	"asset_name": "25 Point Lobos - Commercial Property [RP]",
	"owner": "SP",
	"location": "San Francisco/San Francisco, CA, US",
	"description": null,
	"asset_value_low": 5000001,
	"asset_value_high": 25000000,
	"income_type": "Rent",
	"income_low": 100001,
	"income_high": 1000000,
	"tx_gt_1000": false
	},
	{
	"asset_name": "45 Belden Place - Four Story Commercial Building [RP]",
	"owner": "SP",
	"location": "San Francisco/San Francisco, CA, US",
	"description": null,
	"asset_value_low": 5000001,
	"asset_value_high": 25000000,
	"income_type": "Rent",
	"income_low": 100001,
	"income_high": 1000000,
	"tx_gt_1000": false
	},
	{
	"asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]",
	"owner": "OL",
	"location": "New York, NY, US",
	"description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors.",
	"asset_value_low": 1000001,
	"asset_value_high": 5000000,
	"income_type": "Partnership Income",
	"income_low": 50001,
	"income_high": 100000,
	"tx_gt_1000": false
	},
	{
	"asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]",
	"owner": "SP",
	"location": null,
	"description": null,
	"asset_value_low": 5000001,
	"asset_value_high": 25000000,
	"income_type": "None",
	"income_low": null,
	"income_high": null,
	"tx_gt_1000": false
	},
	{
	"asset_name": "Amazon.com, Inc. (AMZN) [ST]",
	"owner": "SP",
	"location": null,
	"description": null,
	"asset_value_low": 5000001,
	"asset_value_high": 25000000,
	"income_type": "None",
	"income_low": null,
	"income_high": null,
	"tx_gt_1000": false
	}
	]
	}
	{
	"incidents": [
	{
	"date": "April 1",
	"time": "9:30 p.m.",
	"location": "Toyon parking lot",
	"summary": "License plate stolen",
	"category": "property",
	"property_damage": "Rear license plate",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "A man",
	"gender": "unknown",
	"is_student": false
	}
	],
	"incident_text": "A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot."
	},
	{
	"date": "April 1",
	"time": "unknown",
	"location": "Fry's Electronics",
	"summary": "Unauthorized purchase reported",
	"category": "other",
	"property_damage": "None",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "A female administrator in Materials Science and Engineering",
	"gender": "female",
	"is_student": false
	}
	],
	"incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months."
	},
	{
	"date": "April 2",
	"time": "3:30 p.m.",
	"location": "unknown",
	"summary": "Rear license plate stolen reported",
	"category": "property",
	"property_damage": "Rear license plate",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "Another man",
	"gender": "unknown",
	"is_student": false
	}
	],
	"incident_text": "Another man reported that the rear license plate was missing from his vehicle."
	},
	{
	"date": "April 2",
	"time": "11:40 p.m.",
	"location": "Rains apartments",
	"summary": "Bike vandalized",
	"category": "property",
	"property_damage": "Wheel of bike",
	"arrest_made": false,
	"perpetrators": [
	{
	"description": "Two unknown suspects",
	"gender": "unknown",
	"is_student": false
	}
	],
	"victims": [
	{
	"description": "A graduate student in the School of Education",
	"gender": "unknown",
	"is_student": true
	}
	],
	"incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments."
	},
	{
	"date": "April 2",
	"time": "11:40 p.m.",
	"location": "Adelfa",
	"summary": "Medical call",
	"category": "call for service",
	"property_damage": "None",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "unknown",
	"gender": "unknown",
	"is_student": false
	}
	],
	"incident_text": "Police responded to an alcohol-related medical call in Adelfa."
	},
	{
	"date": "April 3",
	"time": "10:20 p.m.",
	"location": "unknown",
	"summary": "Bike citation",
	"category": "other",
	"property_damage": "None",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "A male undergraduate",
	"gender": "male",
	"is_student": true
	}
	],
	"incident_text": "A male undergraduate was cited and released for running a stop sign on his bike and for not having a bike light or bike license."
	},
	{
	"date": "April 3",
	"time": "11:20 p.m.",
	"location": "Lomita Drive",
	"summary": "Minor in possession citation",
	"category": "other",
	"property_damage": "None",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "A woman",
	"gender": "female",
	"is_student": false
	}
	],
	"incident_text": "A woman was cited and released on Lomita Drive for being a minor in possession of alcohol."
	},
	{
	"date": "April 3",
	"time": "11:40 p.m.",
	"location": "unknown",
	"summary": "Driving citation",
	"category": "other",
	"property_damage": "None",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "A man",
	"gender": "male",
	"is_student": false
	}
	],
	"incident_text": "A man was cited and released for driving with a suspended license after he was stopped near Galvez Street and Campus Drive."
	},
	{
	"date": "April 4",
	"time": "1:00 a.m.",
	"location": "unknown",
	"summary": "Car damage reported",
	"category": "property",
	"property_damage": "Trunk and hood",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "A man",
	"gender": "male",
	"is_student": false
	}
	],
	"incident_text": "A man reported that someone walked over the top of his car, causing damage to the trunk, top and hood."
	},
	{
	"date": "April 4",
	"time": "1:31 a.m.",
	"location": "Mayfield Avenue",
	"summary": "Bikes U-Locked incident",
	"category": "other",
	"property_damage": "None",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "Two men",
	"gender": "unknown",
	"is_student": false
	}
	],
	"incident_text": "Two men were cited and released for walking with bikes U-Locked to themselves on Mayfield Avenue after neither could establish ownership of the bikes."
	},
	{
	"date": "April 4",
	"time": "3:05 a.m.",
	"location": "Sigma Alpha Epsilon",
	"summary": "Altercation reported",
	"category": "other",
	"property_damage": "None",
	"arrest_made": false,
	"perpetrators": [
	{
	"description": "Two undergraduate suspects",
	"gender": "unknown",
	"is_student": true
	}
	],
	"victims": [
	{
	"description": "A male undergraduate",
	"gender": "male",
	"is_student": true
	}
	],
	"incident_text": "A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested."
	},
	{
	"date": "April 5",
	"time": "2:45 a.m.",
	"location": "unknown",
	"summary": "Arrest for intoxication",
	"category": "other",
	"property_damage": "None",
	"arrest_made": true,
	"perpetrators": [],
	"victims": [
	{
	"description": "A man",
	"gender": "male",
	"is_student": false
	}
	],
	"incident_text": "Police arrested a man for being drunk in public on Palm Drive near the entrance arch."
	},
	{
	"date": "April 6",
	"time": "7:15 a.m.",
	"location": "Andronico's Supermarket",
	"summary": "Assisted in detaining suspect",
	"category": "call for service",
	"property_damage": "None",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "unknown",
	"gender": "unknown",
	"is_student": false
	}
	],
	"incident_text": "Police assisted Andronico's Supermarket in detaining a suspect after a call was made on a blue emergency phone. Palo Alto police later took the suspect into custody."
	},
	{
	"date": "April 6",
	"time": "3:20 p.m.",
	"location": "Studio 3 on Angell Court",
	"summary": "Accidental fire call",
	"category": "call for service",
	"property_damage": "None",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "unknown",
	"gender": "unknown",
	"is_student": false
	}
	],
	"incident_text": "Police responded to an accidental fire call to Studio 3 on Angell Court."
	},
	{
	"date": "April 6",
	"time": "9:00 p.m.",
	"location": "unknown",
	"summary": "Found property reported",
	"category": "property",
	"property_damage": "None",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "A man",
	"gender": "male",
	"is_student": false
	}
	],
	"incident_text": "A man reported that he found someone else's personal property in his locked car."
	},
	{
	"date": "April 6",
	"time": "1:45 a.m.",
	"location": "San Jose main jail",
	"summary": "Arrest for trespassing",
	"category": "other",
	"property_damage": "None",
	"arrest_made": true,
	"perpetrators": [],
	"victims": [
	{
	"description": "A local vagrant",
	"gender": "unknown",
	"is_student": false
	}
	],
	"incident_text": "A local vagrant was booked into the San Jose main jail for trespassing - his seventh time trespassing in a month."
	},
	{
	"date": "April 6",
	"time": "2:50 a.m.",
	"location": "Palm Drive",
	"summary": "Driving citation",
	"category": "other",
	"property_damage": "None",
	"arrest_made": false,
	"perpetrators": [],
	"victims": [
	{
	"description": "A man",
	"gender": "male",
	"is_student": false
	}
	],
	"incident_text": "A man was cited and released for driving without a license on Palm Drive."
	}
	]
	}
	{
	"assets": [
	{
	"owner": "SP",
	"asset_name": "820 Sir Francis Drake Blvd., San Anselmo, CA - Commercial Property",
	"asset_value_low": 1000001,
	"asset_value_high": 5000000,
	"income_type": "RENT",
	"income_low": 100001,
	"income_high": 1000000,
	"transaction_type": "P"
	},
	{
	"owner": "SP",
	"asset_name": "Access Technology Partners, LP",
	"asset_value_low": 0,
	"asset_value_high": 0,
	"income_type": "PARTNERSHIP INCOME/(LOSS)",
	"income_low": -1000000,
	"income_high": -1,
	"transaction_type": "S"
	},
	{
	"owner": "SP",
	"asset_name": "Active, LLC",
	"asset_value_low": 15001,
	"asset_value_high": 50000,
	"income_type": "PARTNERSHIP INCOME/(LOSS)",
	"income_low": -200,
	"income_high": -1,
	"transaction_type": "P"
	},
	{
	"owner": "SP",
	"asset_name": "Agile Software Corp. - Public Common Stock",
	"asset_value_low": 0,
	"asset_value_high": 0,
	"income_type": "CAPITAL GAIN",
	"income_low": 15001,
	"income_high": 50000,
	"transaction_type": "S"
	},
	{
	"owner": "SP",
	"asset_name": "Akamai Technologies Inc. - Public Common Stock",
	"asset_value_low": 50001,
	"asset_value_high": 100000,
	"income_type": "NONE",
	"income_low": null,
	"income_high": null,
	"transaction_type": "P"
	},
	{
	"owner": "SP",
	"asset_name": "Alcatel Lucent Ads - Public Common Stock",
	"asset_value_low": 1001,
	"asset_value_high": 15000,
	"income_type": "DIVIDENDS",
	"income_low": 1,
	"income_high": 200,
	"transaction_type": null
	},
	{
	"owner": "SP",
	"asset_name": "Alcoa Inc. - Public Common Stock",
	"asset_value_low": 15001,
	"asset_value_high": 50000,
	"income_type": "DIVIDENDS",
	"income_low": 201,
	"income_high": 1000,
	"transaction_type": "P"
	},
	{
	"owner": "SP",
	"asset_name": "American International Group Inc. - Public Common Stock",
	"asset_value_low": 250001,
	"asset_value_high": 500000,
	"income_type": "DIVIDENDS",
	"income_low": 2501,
	"income_high": 5000,
	"transaction_type": null
	},
	{
	"owner": "SP",
	"asset_name": "Americas Doctors.com - Preferred Stock",
	"asset_value_low": 1001,
	"asset_value_high": 15000,
	"income_type": "NONE",
	"income_low": null,
	"income_high": null,
	"transaction_type": null
	},
	{
	"owner": "SP",
	"asset_name": "Apple Computer - Public Common Stock",
	"asset_value_low": 5000001,
	"asset_value_high": 25000000,
	"income_type": "CAPITAL GAIN",
	"income_low": 100001,
	"income_high": 1000000,
	"transaction_type": "S"
	},
	{
	"owner": "SP",
	"asset_name": "Aristotle, LLC",
	"asset_value_low": 15001,
	"asset_value_high": 50000,
	"income_type": "NONE",
	"income_low": null,
	"income_high": null,
	"transaction_type": null
	},
	{
	"owner": "SP",
	"asset_name": "Ashlar, Inc. - Common Stock",
	"asset_value_low": 0,
	"asset_value_high": 0,
	"income_type": "CAPITAL GAIN/(LOSS)",
	"income_low": -1001,
	"income_high": -201,
	"transaction_type": "S"
	},
	{
	"owner": "SP",
	"asset_name": "AT&T - Public Common Stock",
	"asset_value_low": 250001,
	"asset_value_high": 500000,
	"income_type": "DIVIDENDS",
	"income_low": 5001,
	"income_high": 15000,
	"transaction_type": null
	}
	]
	}