paulschreiber/1_pdfliberation_hackathon_activity.md

## 1_pdfliberation_hackathon_activity.md

      
    Raw
  

              1_pdfliberation_hackathon_activity.md
            
          
    Start Here

1. Log into GitHub

2. Fork this Gist

3. Edit your version to share your team's activity
PDF Liberation Hackpad

IRC: https://webchat.freenode.net/ Channel: #sunlightlabs

GitHub Markdown-Cheatsheet

  
## 2_who.md

      
    Raw
  

              2_who.md
            
          
    Who

Who is working together?


Name
Email
Twitter
Organization


Paul Schreiber
paulschreiber@gmail.com
@paulschreiber
self


## 3_challenge.md

      
    Raw
  

              3_challenge.md
            
          
    Challenge

Which challenge are you working on?

 Amnesty International Annual Reports – Torture Incident Database
 Comprehensive Annual Financial Reports
 Federal Communications Commission Daily Releases
 House of Representatives Financial Disclosures (OpenSecrets.org)
 IRS Form 990 – Not-for-Profit Organization Reports
 New York City Council and Community Board Documents
 New York City Economic Development Commission Monthly Snapshot
 New York City Environmental Impact Statements
 US Foreign Aid Reports (USAID)
 Other: List/Describe here


## 4_pdfs.md

      
    Raw
  

              4_pdfs.md
            
          
    PDF Samples

How would you categorize the PDFs?
Sample documents


PDF URL
Document Title


http://www.domain.org/docs/docurl.pdf
Report of Economic Data 2012


Content category


 Disclosure (filing, forms, report, ...)
 Legislative doc (laws, analysis, ...)
 Financial (statements, reports)
 Government statistical data
 Non-Government statistical data
 Press (press releases, statements, ...)
 Government reports
 Non-Government reports
 Directory
 Other:

Number of pages


 1 page
 2 to 9 pages
 10+ pages
 100+ pages

Other observations


 Collection includes PDFs made from scanned documents
 PDFs include hand-written text

PDF Generation


 Human authored
 Machine generated
 God only knows


## 5_data.md

      
    Raw
  

              5_data.md
            
          
    Type of data embedded in PDF


 Simple table of data
 Complex table of data
 Multiple tables of data from document
 Table data greater than one page in length
 Highly-structured form data
 Loosely-structured form data
 Has human-written text
 Structure text of a report (e.g., headings, subheadings, ...)
 Other:


Desired output of data


 CSV
 JSON
 text version (e.g., markdown)


## 6_tools.md

      
    Raw
  

              6_tools.md
            
          
    Tool

What tool(s) are you using to extract the data?


Tool
How we used it


Automator
We used it to extract raw text from the PDF


Python
We used it to extract the torture-related data from the full text


Notes


## 7_how.md

      
    Raw
  

              7_how.md
            
          
    How

We used a OS X Automator workflow to extract the full text (PDFs were several hundred pages) and then used Python to extract the country names and sections about torture.
Improvements

I've only tried this with the machine-generated PDFs, not with the scanned-image (but OCRd) PDFs. It will likely need tweaking for that. For the really old reports, an automated parsing system isn't feasible, and this will require a human to read over and interpret the data.
Extracting rich text (RTF?) instead of plain text will make it easier to find headings and subheadings, instead of using heuristics around length and capitalization.

  
## 8_results.md

      
    Raw
  

              8_results.md
            
          
    Results quality


 99%+
 90%+
 80%+
 50% to 75%
 less than 50%
 utter crap

Speed

How fast is the data extracted?

 < 10 seconds
 < 30 seconds
 < 1 minute
 < 5 minutes
 < 10 minutes
 < 20 minutes
 Other:

Notes

Less than 30 seconds per document, but many documents.

  
## 9_code.md

      
    Raw
  

              9_code.md
            
          
    Code


Find non-scanned PDFs using Google
Select PDFs in the Finder
Run the Automator workflow
Run the Python script (it doesn't yet take a parameter for the input file, so I've been editing the script on each run)
Tool	How we used it
Automator	We used it to extract raw text from the PDF
Python	We used it to extract the torture-related data from the full text