Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save mroswell/8506105 to your computer and use it in GitHub Desktop.
Save mroswell/8506105 to your computer and use it in GitHub Desktop.

Challenge

Which challenge are you working on?

  • Amnesty International Annual Reports – Torture Incident Database
  • Comprehensive Annual Financial Reports
  • Federal Communications Commission Daily Releases
  • House of Representatives Financial Disclosures (OpenSecrets.org)
  • IRS Form 990 – Not-for-Profit Organization Reports
  • New York City Council and Community Board Documents
  • New York City Economic Development Commission Monthly Snapshot
  • New York City Environmental Impact Statements
  • US Foreign Aid Reports (USAID)
  • Other: List/Describe here

PDF Samples

How would you categorize the PDFs?

Sample documents

PDF URL Document Title
https://www.dropbox.com/sh/cwb299emamxnqcc/982J2HOqyr/Electronic/N00029139_2012.pdf House of Representatives Financial Disclosure - image

Content category

  • Disclosure (filing, forms, report, ...)
  • Legislative doc (laws, analysis, ...)
  • Financial (statements, reports)
  • Government statistical data
  • Non-Government statistical data
  • Press (press releases, statements, ...)
  • Government reports
  • Non-Government reports
  • Directory
  • Other:

Number of pages

  • 1 page
  • 2 to 9 pages
  • 10+ pages
  • 100+ pages

Other observations

  • Collection includes PDFs made from scanned documents
  • PDFs include hand-written text

PDF Generation

  • Human authored
  • Machine generated
  • God only knows

Type of data embedded in PDF

  • Simple table of data
  • Complex table of data
  • Multiple tables of data from document
  • Table data greater than one page in length
  • Highly-structured form data
  • Loosely-structured form data
  • Has human-written text
  • Structure text of a report (e.g., headings, subheadings, ...)
  • Other: scanned

Desired output of data

  • CSV
  • JSON
  • text version (e.g., markdown)

Tool

What tool(s) are you using to extract the data?

Tool How we used it
ABBYY used the curl_recognize.sh shell script

Notes

Ross found that the python script (when he ran it on a text PDF) delivered 404 errors. so used the shell script. needed to update with application ID and password, provided by ABBYY by email, upon registering an application The script ran quickly. Got about 60% of the document. The rest would need to be done by hand.

How

How did you extract the desired data that produced the best results?

Improvements

What would have to be changed/added to the tool or process to achieve success?

Results quality

  • 99%+
  • 90%+
  • 80%+
  • 50% to 75%
  • less than 50%
  • utter crap

Speed

How fast is the data extracted?

  • < 10 seconds
  • < 30 seconds
  • < 1 minute
  • < 5 minutes
  • < 10 minutes
  • < 20 minutes
  • Other:

Notes

Convenient, and MUCH better than having to retype everything.

Code

Please list code, tips and howto's of your processing pipeline.

produce RTF (to retain formatting) ./curl_recognize.sh N00029139_2012.pdf N00029139_2012.rtf -rtf

produce markdown textutil -convert html N00029139_2012.rtf | pandoc -f html -t markdown -o result.md

Results are available at: https://github.com/mroswell/pdf-liberation-examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment