1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
How would you categorize the PDFs?
PDF URL | Document Title |
---|---|
http://transition.fcc.gov/Daily_Releases/Daily_Business/2013/db1220/DA-13-2423A1.pdf | Federal Communications Commission DA 13-2423 |
What tool(s) are you using to extract the data?
Tool | How we used it |
---|---|
tesseract | Evaluating OCR re-processing |
tesseract_html_parser.py | New custom app used to regenerate XHTML output file with text and positional data |
In the long run it would be useful to have an OCR tool that could be run locally that could handle any PDF, including text and tables embedded in images. Positional data for properly displaying the original format would be important to preserve. The FCC PDF was already formatted, which makes it easier to compare the results of the verification to verify the accuracy of a new tool.
How did you extract the desired data that produced the best results? Three steps from a command line on a Linux laptop:
What would have to be changed/added to the tool or process to achieve success? The TODO list is long
How fast is the data extracted?
The translated text looks close to the original, and the formatting seems about right. Subscripts and superscripts can not be represented in text. Horizontal lines were not translated into dashes/underscores. There were not many troublesome images, fonts, or tables that are known to trip up tessearct in the one test PDF document used.
There are so many possible variations on the desired outputs that I was unsure what needed to be produced in the end.