Extracting financial disclosure reports and police blotter narratives using OpenAI's Structured Output
tl;dr this demo shows how to call OpenAI's gpt-4o-mini model, provide it with URL of a screenshot of a document, and extract data that follows a schema you define. The results are pretty solid even with little effort in defining the data — and no effort doing data prep. OpenAI's API could be a cost-efficient tool for large scale data gathering projects involving public documents.
OpenAI announced Structured Outputs for its API, a feature that allows users to specify the fields and schema of extracted data, and guarantees that the JSON output will follow that specification.
For example, given a Congressional financial disclosure report, with assets defined in a table like this:
You define the data model you're expecting to extract, either in JSON schema or (as this demo does) via the pydantic library:
class Asset(BaseModel):
asset_name: str
owner: str
location: Union[str, None]
asset_value_low: Union[int, None]
asset_value_high: Union[int, None]
income_type: str
income_low: Union[int, None]
income_high: Union[int, None]
tx_gt_1000: bool
class DisclosureReport(BaseModel):
assets: list[Asset]
OpenAI's API infers from the field names (the above example is basic; there are ways to provide detailed descriptions for each data field) how your data model relates to the actual document you're trying to parse, and produces the extracted data in JSON format:
{
"asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]",
"owner": "JT",
"location": "St. Helena/Napa, CA, US",
"asset_value_low": 5000001,
"asset_value_high": 25000000,
"income_type": "Grape Sales",
"income_low": 100001,
"income_high": 1000000,
"tx_gt_1000": false
},
{
"asset_name": "25 Point Lobos - Commercial Property [RP]",
"owner": "SP",
"location": "San Francisco/San Francisco, CA, US",
"asset_value_low": 5000001,
"asset_value_high": 25000000,
"income_type": "Rent",
"income_low": 100001,
"income_high": 1000000,
"tx_gt_1000": false
}
This demo gist provides code and results for two scenarios:
- Financial disclosure reports: this is a data-tables-in-PDF problem where you'd typically have to use a PDF parsing library like pdfplumber and write your own data parsing methods.
- Newspaper police blotter: this is a situation of irregular information — brief descriptions of reported crime incidents, written by a human reporter — where you'd employ humans to read, interpret, and do data entry.
Note: these are very basic examples, using the bare minimum of instructions to the API (e.g. "Extract the text from this image") and relatively little code to define the expected data schema. That said
Each example has the Python script used to produce the corresponding JSON output. To re-run these scripts on your own, the first thing you need to do is to create your own OpenAI developer account at platform.openai.com, then:
- Put in a couple bucks into your account balance. Both of these examples use around 30,000-50,000 tokens, i.e. cost about half a cent to execute)
- Create an API key
- Set it as your $OPENAI_API_KEY environmental variable
- alternatively, you can paste your key into the
api_key
argument, i.e. replaceclient = OpenAI()
withclient = OpenAI(api_key='Yourkeyhere')
- alternatively, you can paste your key into the
Then install the OpenAI Python SDK and pydantic:
pip install openai pydantic
For ease of use, these scripts are set up to use gpt-4o-mini's vision capabilities to ingest PNG files via web URLs. If you want to modify the script to test a URL of your choosing, simply modify the INPUT_URL
variable at the top of the script.
- The script: extract-financial-disclosure.py
- The results: output-financial-disclosure.json
The following screenshot is taken from the PDF of the full report, which can be found at disclosures-clerk.house.gov). Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.
As shown in the following snippet, the results look accurate and as expected. Note that it also correctly parses the "Location" and "Description" fields (when it exists), even though those fields aren't provided in tabular format (i.e. they're globbed into the "Asset" description as free form text).
It also understands that tx_gt_1000
corresponds to the Tx. > $1,000?
header, and that that field contains checkboxes. Even though the sample page has no examples of checked checkboxes, the model correctly infers that tx_gt_1000
is false.
{
"asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]",
"owner": "OL",
"location": "New York, NY, US",
"description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors.",
"asset_value_low": 1000001,
"asset_value_high": 5000000,
"income_type": "Partnership Income",
"income_low": 50001,
"income_high": 100000,
"tx_gt_1000": false
},
{
"asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]",
"owner": "SP",
"location": null,
"description": null,
"asset_value_low": 5000001,
"asset_value_high": 25000000,
"income_type": "None",
"income_low": null,
"income_high": null,
"tx_gt_1000": false
},
It's also nice that I didn't have to do even the minimum of "data prep": I gave it a screenshot of the report page — the top third of which has info I don't need — and it "knew" that it should only care about the data under the "Schedule A: Assets and Unearned Income" header.
If I were scraping financial disclosures for real, I would make use of json-schema's "description" attribute, which can be defined via Pydantic like this:
from pydantic import BaseModel, Field
class Asset(BaseModel):
asset_name: str = Field(
description="The name of the asset, under the 'Asset' header"
)
owner: str = Field(
description="Under the 'Owner' header, a 2-letter abbreviation, e.g. SP, DC, JT"
)
location: Union[str, None] = Field(
description="Some records have 'Location:' text as part of the 'Asset' header"
)
description: Union[str, None] = Field(
description="Some records have 'Description:' text as part of the 'Asset' header"
)
asset_value_low: Union[int, None] = Field(
description="Under the 'Value of Asset' field, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'"
)
asset_value_high: Union[int, None] = Field(
description="Under the 'Value of Asset' field, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'"
)
income_type: str = Field(description="Under the 'Income Type(s) field")
income_low: Union[int, None] = Field(
description="Under the 'Income' field, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'"
)
income_high: Union[int, None] = Field(
description="Under the 'Income' field, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'"
)
tx_gt_1000: bool = Field(
description="Under the 'Tx. > $1,000?' header: True if the checkbox is checked, False if it is empty"
)
class DisclosureReport(BaseModel):
assets: list[Asset]
But as you can see from the result JSON, OpenAI's model seems "smart" enough to understand a basic data-copying task without specific instructions.
I was curious how well the model without any instruction, i.e. when you don't bother to define a pydantic model and instead pass in a response format of {"type": "json_object"}
:
response = client.beta.chat.completions.parse(
response_format={"type": "json_object"},
model="gpt-4o-mini",
messages=input_messages
)
The answer: just fine. You can see the code and full results here:
Without a defined schema, the model treated the entire document (not just the Assets Schedule) as data:
{
"document": {
"title": "Financial Disclosure Report",
"header": "Clerk of the House of Representatives \u2022 Legislative Resource Center \u2022 B81 Cannon Building \u2022 Washington, DC 20515",
"filer_information": {
"name": "Hon. Nancy Pelosi",
"status": "Member",
"state_district": "CA11"
},
"filing_information": {
"filing_type": "Annual Report",
"filing_year": "2023",
"filing_date": "05/15/2024"
},
"schedule_a": {
"title": "Schedule A: Assets and 'Unearned' Income",
"assets": [
{
"asset": "11 Zinfandel Lane - Home & Vineyard [RP]",
"owner": "JT",
"value": "$5,000,001 - $25,000,000",
"income_type": "Grape Sales",
"income": "$100,001 - $1,000,000",
"location": "St. Helena/Napa, CA, US"
},
It left the values as text, e.g. "value": "$5,000,001 - $25,000,000"
versus "asset_value_low": 5000001
. And it left out the optional data fields, e.g. location and description, for entries that didn't have them:
{
"asset": "AllianceBernstein Holding L.P. Units (AB) [OL]",
"owner": "SP",
"value": "$1,000,001 - $5,000,000",
"income_type": "Partnership Income",
"income": "$50,001 - $100,000",
"location": "New York, NY, US",
"description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors."
},
{
"asset": "Alphabet Inc. - Class A (GOOGL) [ST]",
"owner": "SP",
"value": "$5,000,001 - $25,000,000",
"income_type": "None"
},
As I said at the beginning of this section, the report screenshot comes from a PDF with actual text — most Congressional disclosure filings in the past 5 years have used the e-filing system, which inherently results in more regular data even when the output is PDF.
So I tried using Structured Outputs on a screenshot of a 2008-era report, and the results were pretty solid.
The main caveat is that I had to rotate the page orientation by 90 degrees. The model did try to parse the vertically-orientated page, and got about half of the values right — which is probably one of the worst-case scenarios (you'd prefer the model to completely flub things, so that you could at least catch with automated-error checks)
-
The script: extract-police-blotter.py
-
The results: output-police-blotter.json
The screenshot was taken from the Stanford Daily archives: https://archives.stanforddaily.com/2004/04/09?page=3§ion=MODSMD_ARTICLE12#article
For reasons that are explained in detail below, this example isn't meant to be a reasonable test of the model capabilities. But it's a fun experiment to see how well its model performs with something not meant to be "data" and is inherently riddled with data quality issues.
Consider what the data point of a basic crime incident report might contain:
- When: a date and time
- Where: a place
- Who:
- a victim
- a suspect
- What: the crime the suspect allegedly committed
It's easy to come up with many variations and edge cases:
- No specific time: i.e. "computer science graduate students reported that they had books stolen from the Gates Computer Science Building in the previous five months"
- No listed place: it's unclear if the reporter purposefully omitted it, or if it was left off the original police report.
- No suspect ("an alcohol-related medical call") or no victim (e.g. "an accidental fire call"). Or multiple suspects and multiple victims.
Unlike the financial disclosure example, the input data is freeform narrative text. The onus is entirely on us to define what what a blotter report is, which ends up requiring defining what a crime incident is. Not surprisingly, the corresponding Pydantic code is a lot more verbose, and I bet if you asked 1,000 journalists to write a definition, they'd all be different.
Here's what mine looks like:
# Define the data structures in Pydantic:
# an Incident involves several Persons (victims, perpetrators)
class Person(BaseModel):
description: str
gender: str
is_student: bool
# Pydantic docs on field descriptions:
# https://docs.pydantic.dev/latest/concepts/fields/
class Incident(BaseModel):
date: str
time: str
location: str
summary: str = Field(description="""Brief summary, less than 30 chars""")
category: str = Field(
description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """
)
property_damage: str = Field(
description="""If a property crime, then a description of what was stolen/damaged/lost"""
)
arrest_made: bool
perpetrators: list[Person]
victims: list[Person]
incident_text: str = Field(
description="""Include the complete verbatim text from the input that pertains to the incident"""
)
class Blotter(BaseModel):
incidents: list[Incident]
I ask the model to provide an incident_text
field, i.e. the verbatim text from which it extracted the incident data point. This is helpful for evaluating the experiment. But for an actual data project, you might want to omit it as it adds to the number of output tokens and API cost
incident_text: str = Field(
description="""Include the complete verbatim text from the input that pertains to the incident"""
)
The resulting incident_text
field extracted from the above snippet is basically correct:
A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments.
However, it leaves off the 11:40 p.m.
, which is at the beginning of the printed incident, and is something that I normally would like to include because I want to know everything the model looked at when extracting the data point.
The 11:40 p.m.
time is correctly included in the rest of the data output:
{
"date": "April 2",
"time": "11:40 p.m.",
"location": "Rains apartments",
"summary": "Bike vandalized",
"category": "property",
"property_damage": "Wheel of bike",
"arrest_made": false,
"perpetrators": [
{
"description": "Two unknown suspects",
"gender": "unknown",
"is_student": false
}
],
"victims": [
{
"description": "A graduate student in the School of Education",
"gender": "unknown",
"is_student": true
}
],
"incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments."
}
As with the financial disclosure report, my script provides a screenshot and leaves it up to OpenAI's model to figure out what's going on. I was pleasantly surprised at how well gpt-4o-mini did in gleaning structure from a newspaper print listicle, with instructions as basic as: "Extract the text from this image"
For example, on first glance of the blotter, it seems that every incident has a date (in the subhed) and time (at the beginning of the graf). But under "Thursday, April 1", you can see that pattern already broken:
Is that second graf ("A female administrator in Materials Science...") a continuation of the 9:30 p.m. incident where a "man reported that someone removed his rear license plate"?
Most human readers, after reading both paragraphs — and then the rest of the blotter — will realize that these are 2 separate incidents. But there's nothing at all in the structure of the text to indicate that. Before I ran this experiment, I thought I would have to provide detailed parsing instructions to the model, e.g.
What you are reading is a police blotter, a list of reported incidents that police were called to. Every paragraph should be treated as a separate incident. Most incidents, but not all, begin with a timestamp, e.g. "11:20 p.m".
But the model saw on its own that there are 2 incidents, and that the second one happened on April 1 at an unspecified time.
{
"date": "April 1",
"time": "9:30 p.m.",
"location": "Toyon parking lot",
"summary": "License plate stolen",
"category": "property",
"property_damage": "rear license plate",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "Man",
"gender": "unknown",
"is_student": false
}
],
"incident_text": "A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot."
},
{
"date": "April 1",
"time": "unknown",
"location": "unknown",
"summary": "Unauthorized purchase reported",
"category": "other",
"property_damage": "computer equipment",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "Female administrator",
"gender": "female",
"is_student": false
}
],
"incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry\u2019s Electronics sometime in the past five months."
},
By my count, there are 19 incidents in this issue of the Stanford Daily's police blotter, and the API correctly returns 19 different incidents.
Again, the data model is inherently messy, and I put in minimal effort to describe what an "incident" is, such as the variety of situations and edge cases. That, plus the inherent limitations of the data, are the root cause of most of the model's problems.
For example, I intended the perpetrators
and victims
to be lists of proper nouns or simple nouns, so that we could ask questions like: "how many incidents involved multiple people". Given the following incident text:
A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments.
— this is how the model parsed the suspects:
"perpetrators": [
{
"description": "Two unknown suspects",
"gender": "unknown",
"is_student": false
}
]
For a data project, I might have preferred a result that would easily return a result of 2
, e.g.:
"perpetrators": [
{
"description": "Unknown suspect",
"gender": "unknown",
"is_student": false
}
{
"description": "Unknown suspect",
"gender": "unknown",
"is_student": false
}
]
But how should the model know what I'm trying to do sans specific instructions? I think most humans, given the same minimalist instructions, would have also recorded "Two unknown suspects"
.
However, the model greatly struggled with filling out the perpetrators
and victims
lists, such as frequently mistaking the suspect/perpetrator as the victim, when there was no specific victim mentioned:
A male undergraduate was cited and released for running a stop sign on his bike and for not having a bike light or bike license.
"victims": [
{
"description": "A male undergraduate",
"gender": "male",
"is_student": true
}
]
It goes without saying that the model missed when the narrative was more complicated. For example, in the case of the unauthorized purchases at Fry's:
A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months.
The "female administrator" is not the victim, but the person who reported the crime. The victim would be Stanford University, or more specifically, its MScE department.
I'm not surprised the model had problems with identifying victims and suspects, though I'm unsure how much extra instruction would be needed to get reliable results from a general model.
One thing that the model frequently and inexplicably erred on was classifying people's gender.
This is how I defined a Person
using pydantic:
class Person(BaseModel):
description: str
gender: str
is_student: bool
Even when the subject's noun has an obvious gender, the model would inexplicably flub it:
A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot.
"victims": [
{
"description": "A man",
"gender": "unknown",
"is_student": false
}
]
It was worse when the subject's noun did not indicate gender, but the rest of the sentence did:
A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments.
"victims": [
{
"description": "A graduate student in the School of Education",
"gender": "unknown",
"is_student": true
}
],
Not sure what the issue is. It might be remedied if I provided explicit and thorough instructions and examples, but this seemed like a much easier thing to infer than the other things that OpenAI's model was able to infer on its own.
With so many things left to the interpretation of the LLM, it was no surprise that I get different results every time I run the extract-police-blotter.py script, especially when it comes to the categorization of crimes.
In the data specification, I did attempt to describe for the model what I wanted for category
:
category: str = Field(
description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """
)
Given the option of saying "other", the model seemed eager to use it for any slightly vague situation. It classified the unauthorized purchases at Fry's as "other", even though embezzlement would better fit under property crimes by the FBI's UCR definition. Maybe this could be fixed by providing the model with detailed examples and definitions of statutes and criminal code?
But ultimately, as I said from the start, the model's performance is bounded by the limitations and errors in the source data. For example, an incident where someone gets hit on the head with a bottle seems to me obviously "violent", i.e. assault:
A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested.
However, the model thinks it is "other":
{
"date": "April 4",
"time": "3:05 a.m.",
"location": "Sigma Alpha Epsilon",
"summary": "Altercation reported",
"category": "other",
"property_damage": "None",
"arrest_made": false,
"perpetrators": [
{
"description": "Two undergraduate suspects",
"gender": "unknown",
"is_student": true
}
],
"victims": [
{
"description": "A male undergraduate",
"gender": "male",
"is_student": true
}
],
"incident_text": "A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested."
}
But is the model necessarily wrong? Two "suspects" were apparently identified, but no one was actually arrested. I took this to mean that the suspects fled and hadn't been located at the time of the report. But maybe it's something more benign: an "altercation" happened, but when the cops arrived, everyone was cool including the guy who got hit by the bottle, thus no allegation of assault for police to act on or file as part of their UCR statistics. Ultimately we have to guess the author's intent.
OpenAI model's performance here wouldn't work for a real data project — but again, this was just a toy experiment, and doesn't represent what you'd get if you spend more than 10 minutes thinking about the data model, nevermind pick a data source slightly more structured than a newspaper listicle. I think OpenAI's model would work very well for something with more substantive text and formal structure, such as obituaries.
I'd like to offer a couple of suggestions that could enhance the effectiveness and reliability of your approach:
Omitting the System Message for Structured Outputs:
When utilizing Structured Outputs with OpenAI's API, you can omit the system message that specifies 'JSON' outputs. This requirement was primarily relevant to the older
response_format="json_object"
mode. With the introduction of structured outputs, the API now inherently understands and adheres to the defined schema without needing explicit instructions to format the response as JSON.Using
Enum
ortyping.Literal
for Constrained Parameters:To ensure that parameters with limited, predefined options (like the
category
field in your police blotter example) strictly adhere to those options, it's essential to define them using Python'sEnum
ortyping.Literal
. This approach guarantees that only the specified values are generated, as the LLM's constrained generation mechanism masks all tokens except those defined in theEnum
orLiteral
. This not only enforces the constraints within your data model but also enhances the JSON schema by serializing these fields as enums. Consequently, the language model (LLM) is guaranteed to generate outputs that only include the specified enum values, thereby maintaining data consistency and eliminating the risk of unexpected or invalid entries.Implementing with
Enum
:Implementing with
typing.Literal
:Benefits:
Enum
ortyping.Literal
, the LLM is guaranteed to produce outputs that only include the specified enum values, ensuring strict adherence to the defined schema.