Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save jsnyderjsnyderper/8531681 to your computer and use it in GitHub Desktop.
Save jsnyderjsnyderper/8531681 to your computer and use it in GitHub Desktop.

Who

Who is working together?

Name Email Twitter Organization
Joshua Snyder . . Independent

Challenge

Which challenge are you working on?

  • Federal Communications Commission Daily Releases

PDF Samples

How would you categorize the PDFs?

Sample documents

PDF URL Document Title
http://transition.fcc.gov/Daily_Releases/Daily_Business/2013/db1220/DA-13-2423A1.pdf Federal Communications Commission DA 13-2423

Content category

  • Disclosure (filing, forms, report, ...)
  • Legislative doc (laws, analysis, ...)
  • Financial (statements, reports)
  • Government statistical data
  • Non-Government statistical data
  • Press (press releases, statements, ...)
  • Government reports
  • Non-Government reports
  • Directory
  • Other:

Number of pages

  • 1 page
  • 2 to 9 pages
  • 10+ pages
  • 100+ pages

Other observations

  • Collection includes PDFs made from scanned documents
  • PDFs include hand-written text

PDF Generation

  • Human authored
  • Machine generated
  • God only knows

Type of data embedded in PDF

  • Simple table of data
  • Complex table of data
  • Multiple tables of data from document
  • Table data greater than one page in length
  • Highly-structured form data
  • Loosely-structured form data
  • Has human-written text
  • Structure text of a report (e.g., headings, subheadings, ...)
  • Other:

Desired output of data

  • CSV
  • JSON
  • text version (e.g., markdown)

Tool

What tool(s) are you using to extract the data?

Tool How we used it
tesseract Evaluating OCR re-processing
tesseract_html_parser.py New custom app used to regenerate XHTML output file with text and positional data

Notes

In the long run it would be useful to have an OCR tool that could be run locally that could handle any PDF, including text and tables embedded in images. Positional data for properly displaying the original format would be important to preserve. The FCC PDF was already formatted, which makes it easier to compare the results of the verification to verify the accuracy of a new tool.

How

How did you extract the desired data that produced the best results? Three steps from a command line on a Linux laptop:

  1. Ran app ghostscript to create a "tiff" file from the "PDF"
  2. Ran app tesseract to create an XHTML file embbeded with hOCR meta data
  3. Ran app tesseract_html_parser.py to re-format with positional div tags

Improvements

What would have to be changed/added to the tool or process to achieve success? The TODO list is long

  1. Thorough verification of the output still needs to be performed.
  2. The python script itself has yet to be uploaded to github and needs a link on this page
  3. Rather than using separate steps, this could be more closely integrated by using tesseract as a tie in library
  4. There's a chance that this code could be integrated into a more sophisticated tool with a multi-platform installer and UI

Results quality

  • 99%+
  • 90%+
  • 80%+
  • 50% to 75%
  • less than 50%
  • utter crap

Speed

How fast is the data extracted?

  • < 10 seconds
  • < 30 seconds
  • < 1 minute
  • < 5 minutes
  • < 10 minutes
  • < 20 minutes
  • Other:

Notes

The translated text looks close to the original, and the formatting seems about right. Subscripts and superscripts can not be represented in text. Horizontal lines were not translated into dashes/underscores. There were not many troublesome images, fonts, or tables that are known to trip up tessearct in the one test PDF document used.

There are so many possible variations on the desired outputs that I was unsure what needed to be produced in the end.

Code

Please list code, tips and howto's of your processing pipeline.

TODO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment