mroswell/1_pdfliberation_hackathon_activity.md

## 1_pdfliberation_hackathon_activity.md

      
    Raw
  

              1_pdfliberation_hackathon_activity.md
            
          
    Start Here

1. Log into GitHub

2. Fork this Gist

3. Edit your version to share your team's activity
PDF Liberation Hackpad

IRC: https://webchat.freenode.net/ Channel: #sunlightlabs

GitHub Markdown-Cheatsheet

  
## 2_who.md

      
    Raw
  

              2_who.md
            
          
    Who

Who is working together?


Name
Email
Twitter
Organization


Marjorie Roswell
mroswell@gmail.com
@mroswell
Organization Name


## 3_challenge.md

      
    Raw
  

              3_challenge.md
            
          
    Challenge

Which challenge are you working on?

 Amnesty International Annual Reports – Torture Incident Database
 Comprehensive Annual Financial Reports
 Federal Communications Commission Daily Releases
 House of Representatives Financial Disclosures (OpenSecrets.org)
 IRS Form 990 – Not-for-Profit Organization Reports
 New York City Council and Community Board Documents
 New York City Economic Development Commission Monthly Snapshot
 New York City Environmental Impact Statements
 US Foreign Aid Reports (USAID)
 Other: List/Describe here


## 4_pdfs.md

      
    Raw
  

              4_pdfs.md
            
          
    PDF Samples

How would you categorize the PDFs?
Sample documents


PDF URL
Document Title


https://www.dropbox.com/sh/cwb299emamxnqcc/982J2HOqyr/Electronic/N00029139_2012.pdf
House of Representatives Financial Disclosure - image


Content category


 Disclosure (filing, forms, report, ...)
 Legislative doc (laws, analysis, ...)
 Financial (statements, reports)
 Government statistical data
 Non-Government statistical data
 Press (press releases, statements, ...)
 Government reports
 Non-Government reports
 Directory
 Other:

Number of pages


 1 page
 2 to 9 pages
 10+ pages
 100+ pages

Other observations


 Collection includes PDFs made from scanned documents
 PDFs include hand-written text

PDF Generation


 Human authored
 Machine generated
 God only knows


## 5_data.md

      
    Raw
  

              5_data.md
            
          
    Type of data embedded in PDF


 Simple table of data
 Complex table of data
 Multiple tables of data from document
 Table data greater than one page in length
 Highly-structured form data
 Loosely-structured form data
 Has human-written text
 Structure text of a report (e.g., headings, subheadings, ...)
 Other: scanned


Desired output of data


 CSV
 JSON
 text version (e.g., markdown)


## 6_tools.md

      
    Raw
  

              6_tools.md
            
          
    Tool

What tool(s) are you using to extract the data?


Tool
How we used it


ABBYY
used the curl_recognize.sh shell script


Notes

Ross found that the python script (when he ran it on a text PDF) delivered 404 errors. so used the shell script. needed to update with application ID and password, provided by ABBYY by email, upon registering an application The script ran quickly. Got about 60% of the document. The rest would need to be done by hand.

  
## 7_how.md

      
    Raw
  

              7_how.md
            
          
    How

How did you extract the desired data that produced the best results?
Improvements

What would have to be changed/added to the tool or process to achieve success?

  
## 8_results.md

      
    Raw
  

              8_results.md
            
          
    Results quality


 99%+
 90%+
 80%+
 50% to 75%
 less than 50%
 utter crap

Speed

How fast is the data extracted?

 < 10 seconds
 < 30 seconds
 < 1 minute
 < 5 minutes
 < 10 minutes
 < 20 minutes
 Other:

Notes

Convenient, and MUCH better than having to retype everything.

  
## 9_code.md

      
    Raw
  

              9_code.md
            
          
    Code

Please list code, tips and howto's of your processing pipeline.
produce RTF (to retain formatting)
./curl_recognize.sh N00029139_2012.pdf N00029139_2012.rtf -rtf
produce markdown
textutil -convert html N00029139_2012.rtf | pandoc -f html -t markdown -o result.md
Results are available at:
https://github.com/mroswell/pdf-liberation-examples