copiesofcopies/1_pdfliberation_hackathon_activity.md

## 1_pdfliberation_hackathon_activity.md

      
    Raw
  

              1_pdfliberation_hackathon_activity.md
            
          
    Start Here

1. Log into GitHub

2. Fork this Gist

3. Edit your version to share your team's activity
PDF Liberation Hackpad

IRC: https://webchat.freenode.net/ Channel: #sunlightlabs

GitHub Markdown-Cheatsheet

  
## 2_who.md

      
    Raw
  

              2_who.md
            
          
    Who

Who is working together?


Name
Email
Twitter
Organization


Aaron Williamson
aaron@copiesofcopies.org
@copiesofcopies
N/A


## 3_challenge.md

      
    Raw
  

              3_challenge.md
            
          
    Challenge

Which challenge are you working on?

 Amnesty International Annual Reports – Torture Incident Database
 Comprehensive Annual Financial Reports
 Federal Communications Commission Daily Releases
 House of Representatives Financial Disclosures (OpenSecrets.org)
 IRS Form 990 – Not-for-Profit Organization Reports
 New York City Council and Community Board Documents
 New York City Economic Development Commission Monthly Snapshot
 New York City Environmental Impact Statements
 US Foreign Aid Reports (USAID)
 Other: List/Describe here


## 4_pdfs.md

      
    Raw
  

              4_pdfs.md
            
          
    PDF Samples

How would you categorize the PDFs?
Sample documents


PDF URL
Document Title


http://www.nyc.gov/html/dcp/pdf/env_review/rockefeller/noc_deis.pdf
Rockefeller University New River Building and Fitness Center Draft Environmental Impact Statement 11/1/2013


Content category


 Disclosure (filing, forms, report, ...)
 Legislative doc (laws, analysis, ...)
 Financial (statements, reports)
 Government statistical data
 Non-Government statistical data
 Press (press releases, statements, ...)
 Government reports
 Non-Government reports
 Directory
 Other:

Number of pages


 1 page
 2 to 9 pages
 10+ pages
 100+ pages

Other observations


 Collection includes PDFs made from scanned documents
 PDFs include hand-written text (only signatures)

PDF Generation


 Human authored
 Machine generated
 God only knows


## 5_data.md

      
    Raw
  

              5_data.md
            
          
    Type of data embedded in PDF


 Simple table of data
 Complex table of data
 Multiple tables of data from document
 Table data greater than one page in length
 Highly-structured form data
 Loosely-structured form data
 Has human-written text
 Structure text of a report (e.g., headings, subheadings, ...)
 Other: formal letter with headers & footers

Desired output of data


 CSV
 JSON
 text version (e.g., markdown)


## 6_tools.md

      
    Raw
  

              6_tools.md
            
          
    Tool

What tool(s) are you using to extract the data?


Tool
How we used it


Tabula
Tabula seems to work for extracting most of the tables if they're in the correct orientation


Tesseract
Tesseract makes quick work of the letters themselves


ruby-tesseract-ocr
This library promises access to the orientation of individual elements of a scanned document, so it could be useful for identifying rotated tables, but so far I haven't had any luck with it. This is where I'm hung up.


Notes


## 7_how.md

      
    Raw
  

              7_how.md
            
          
    How

How did you extract the desired data that produced the best results?
Tesseract is great for the text, Tabula for the tables, but I'm still working on putting them together to render the whole document.
Improvements

What would have to be changed/added to the tool or process to achieve success?
I need a way to identify rotated tables. I've looked at a couple of techniques:

Tesseract run with psm -1 will output an estimated rotation value (I think that 0 = none, 1 = rotated 90 deg. clockwise, 2 = 180 deg., 3 = 270 deg.) to STDOUT, and so this could be grepped for the value, but I'd far prefer to access the value through a library call, which is why I've been trying to get ruby-tesseract-ocr to work.
Someone suggested a hack for determining orientation from the image: apply a Gaussian blur, shrink the text, then convert to GIF. Rotate the converted text 90 deg. and convert again. Since GIF compresses better horizontally, the image in the correct orientation should be smaller. There are several problems with this technique, not least of which is that it (presumably) can't discern between tables in the correct orientation and those flipped 180 deg.

Once I've solved the reoriented table problem, I still need to figure out the best way to insert the tables into the rendered text.

  
## 8_results.md

      
    Raw
  

              8_results.md
            
          
    Results quality


 99%+
 90%+
 80%+
 50% to 75%
 less than 50%
 utter crap

This is a guess -- I haven't rendered any complete documents because I haven't started working on the (relatively) less complex problem of rendering the entire document + the already-correctly-oriented tables. I don't think this will be horribly difficult once the table orientation problem is solved.
Speed

How fast is the data extracted?

 < 10 seconds
 < 30 seconds
 < 1 minute
 < 5 minutes
 < 10 minutes
 < 20 minutes
 Other:

Notes


## 9_code.md

      
    Raw
  

              9_code.md
            
          
    Code

Please list code, tips and howto's of your processing pipeline.
No useful code yet, tips are above. I'll keep hacking on this and report back.
Tool	How we used it
Tabula	Tabula seems to work for extracting most of the tables if they're in the correct orientation
Tesseract	Tesseract makes quick work of the letters themselves
ruby-tesseract-ocr	This library promises access to the orientation of individual elements of a scanned document, so it could be useful for identifying rotated tables, but so far I haven't had any luck with it. This is where I'm hung up.