Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save copiesofcopies/8506740 to your computer and use it in GitHub Desktop.
Save copiesofcopies/8506740 to your computer and use it in GitHub Desktop.

Who

Who is working together?

Name Email Twitter Organization
Aaron Williamson aaron@copiesofcopies.org @copiesofcopies N/A

Challenge

Which challenge are you working on?

  • Amnesty International Annual Reports – Torture Incident Database
  • Comprehensive Annual Financial Reports
  • Federal Communications Commission Daily Releases
  • House of Representatives Financial Disclosures (OpenSecrets.org)
  • IRS Form 990 – Not-for-Profit Organization Reports
  • New York City Council and Community Board Documents
  • New York City Economic Development Commission Monthly Snapshot
  • New York City Environmental Impact Statements
  • US Foreign Aid Reports (USAID)
  • Other: List/Describe here

PDF Samples

How would you categorize the PDFs?

Sample documents

PDF URL Document Title
http://www.nyc.gov/html/dcp/pdf/env_review/rockefeller/noc_deis.pdf Rockefeller University New River Building and Fitness Center Draft Environmental Impact Statement 11/1/2013

Content category

  • Disclosure (filing, forms, report, ...)
  • Legislative doc (laws, analysis, ...)
  • Financial (statements, reports)
  • Government statistical data
  • Non-Government statistical data
  • Press (press releases, statements, ...)
  • Government reports
  • Non-Government reports
  • Directory
  • Other:

Number of pages

  • 1 page
  • 2 to 9 pages
  • 10+ pages
  • 100+ pages

Other observations

  • Collection includes PDFs made from scanned documents
  • PDFs include hand-written text (only signatures)

PDF Generation

  • Human authored
  • Machine generated
  • God only knows

Type of data embedded in PDF

  • Simple table of data
  • Complex table of data
  • Multiple tables of data from document
  • Table data greater than one page in length
  • Highly-structured form data
  • Loosely-structured form data
  • Has human-written text
  • Structure text of a report (e.g., headings, subheadings, ...)
  • Other: formal letter with headers & footers

Desired output of data

  • CSV
  • JSON
  • text version (e.g., markdown)

Tool

What tool(s) are you using to extract the data?

Tool How we used it
Tabula Tabula seems to work for extracting most of the tables if they're in the correct orientation
Tesseract Tesseract makes quick work of the letters themselves
ruby-tesseract-ocr This library promises access to the orientation of individual elements of a scanned document, so it could be useful for identifying rotated tables, but so far I haven't had any luck with it. This is where I'm hung up.

Notes

How

How did you extract the desired data that produced the best results?

Tesseract is great for the text, Tabula for the tables, but I'm still working on putting them together to render the whole document.

Improvements

What would have to be changed/added to the tool or process to achieve success?

I need a way to identify rotated tables. I've looked at a couple of techniques:

  • Tesseract run with psm -1 will output an estimated rotation value (I think that 0 = none, 1 = rotated 90 deg. clockwise, 2 = 180 deg., 3 = 270 deg.) to STDOUT, and so this could be grepped for the value, but I'd far prefer to access the value through a library call, which is why I've been trying to get ruby-tesseract-ocr to work.
  • Someone suggested a hack for determining orientation from the image: apply a Gaussian blur, shrink the text, then convert to GIF. Rotate the converted text 90 deg. and convert again. Since GIF compresses better horizontally, the image in the correct orientation should be smaller. There are several problems with this technique, not least of which is that it (presumably) can't discern between tables in the correct orientation and those flipped 180 deg.

Once I've solved the reoriented table problem, I still need to figure out the best way to insert the tables into the rendered text.

Results quality

  • 99%+
  • 90%+
  • 80%+
  • 50% to 75%
  • less than 50%
  • utter crap

This is a guess -- I haven't rendered any complete documents because I haven't started working on the (relatively) less complex problem of rendering the entire document + the already-correctly-oriented tables. I don't think this will be horribly difficult once the table orientation problem is solved.

Speed

How fast is the data extracted?

  • < 10 seconds
  • < 30 seconds
  • < 1 minute
  • < 5 minutes
  • < 10 minutes
  • < 20 minutes
  • Other:

Notes

Code

Please list code, tips and howto's of your processing pipeline.

No useful code yet, tips are above. I'll keep hacking on this and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment