jsnyderjsnyderper/1_pdfliberation_hackathon_activity.md

## 1_pdfliberation_hackathon_activity.md

      
    Raw
  

              1_pdfliberation_hackathon_activity.md
            
          
    Start Here

1. Log into GitHub

2. Fork this Gist

3. Edit your version to share your team's activity
PDF Liberation Hackpad

IRC: https://webchat.freenode.net/ Channel: #sunlightlabs

GitHub Markdown-Cheatsheet

  
## 2_who.md

      
    Raw
  

              2_who.md
            
          
    Who

Who is working together?


Name
Email
Twitter
Organization


Joshua Snyder
.
.
Independent


## 3_challenge.md

      
    Raw
  

              3_challenge.md
            
          
    Challenge

Which challenge are you working on?

 Federal Communications Commission Daily Releases


## 4_pdfs.md

      
    Raw
  

              4_pdfs.md
            
          
    PDF Samples

How would you categorize the PDFs?
Sample documents


PDF URL
Document Title


http://transition.fcc.gov/Daily_Releases/Daily_Business/2013/db1220/DA-13-2423A1.pdf
Federal Communications Commission DA 13-2423


Content category


 Disclosure (filing, forms, report, ...)
 Legislative doc (laws, analysis, ...)
 Financial (statements, reports)
 Government statistical data
 Non-Government statistical data
 Press (press releases, statements, ...)
 Government reports
 Non-Government reports
 Directory
 Other:

Number of pages


 1 page
 2 to 9 pages
 10+ pages
 100+ pages

Other observations


 Collection includes PDFs made from scanned documents
 PDFs include hand-written text

PDF Generation


 Human authored
 Machine generated
 God only knows


## 5_data.md

      
    Raw
  

              5_data.md
            
          
    Type of data embedded in PDF


 Simple table of data
 Complex table of data
 Multiple tables of data from document
 Table data greater than one page in length
 Highly-structured form data
 Loosely-structured form data
 Has human-written text
 Structure text of a report (e.g., headings, subheadings, ...)
 Other:


Desired output of data


 CSV
 JSON
 text version (e.g., markdown)


## 6_tools.md

      
    Raw
  

              6_tools.md
            
          
    Tool

What tool(s) are you using to extract the data?


Tool
How we used it


tesseract
Evaluating OCR re-processing


tesseract_html_parser.py
New custom app used to regenerate XHTML output file with text and positional data


Notes

In the long run it would be useful to have an OCR tool that could be run locally that could handle any PDF, including text and tables embedded in images. Positional data for properly displaying the original format would be important to preserve. The FCC PDF was already formatted, which makes it easier to compare the results of the verification to verify the accuracy of a new tool.

  
## 7_how.md

      
    Raw
  

              7_how.md
            
          
    How

How did you extract the desired data that produced the best results?
Three steps from a command line on a Linux laptop:

Ran app ghostscript to create a "tiff" file from the "PDF"
Ran app tesseract to create an XHTML file embbeded with hOCR meta data
Ran app tesseract_html_parser.py to re-format with positional div tags

Improvements

What would have to be changed/added to the tool or process to achieve success?
The TODO list is long

Thorough verification of the output still needs to be performed.
The python script itself has yet to be uploaded to github and needs a link on this page
Rather than using separate steps, this could be more closely integrated by using tesseract as a tie in library
There's a chance that this code could be integrated into a more sophisticated tool with a multi-platform installer and UI


## 8_results.md

      
    Raw
  

              8_results.md
            
          
    Results quality


 99%+
 90%+
 80%+
 50% to 75%
 less than 50%
 utter crap

Speed

How fast is the data extracted?

 < 10 seconds
 < 30 seconds
 < 1 minute
 < 5 minutes
 < 10 minutes
 < 20 minutes
 Other:

Notes

The translated text looks close to the original, and the formatting seems about right. Subscripts and superscripts can not be represented in text. Horizontal lines were not translated into dashes/underscores. There were not many troublesome images, fonts, or tables that are known to trip up tessearct in the one test PDF document used.
There are so many possible variations on the desired outputs that I was unsure what needed to be produced in the end.

  
## 9_code.md

      
    Raw
  

              9_code.md
            
          
    Code

Please list code, tips and howto's of your processing pipeline.
TODO
Tool	How we used it
tesseract	Evaluating OCR re-processing
tesseract_html_parser.py	New custom app used to regenerate XHTML output file with text and positional data