gregelin/1_pdfliberation_hackathon_activity.md

## 1_pdfliberation_hackathon_activity.md

      
    Raw
  

              1_pdfliberation_hackathon_activity.md
            
          
    Start Here

1. Log into GitHub

2. Fork this Gist

3. Edit your version to share your team's activity
PDF Liberation Hackpad

IRC: https://webchat.freenode.net/ Channel: #sunlightlabs

GitHub Markdown-Cheatsheet

  
## 2_who.md

      
    Raw
  

              2_who.md
            
          
    Who

Who is working together?


Name
Email
Twitter
Organization


First Last
myemail@somedomain.com
@mytwitterid
Organization Name


## 3_challenge.md

      
    Raw
  

              3_challenge.md
            
          
    Challenge

Which challenge are you working on?

 Amnesty International Annual Reports – Torture Incident Database
 Comprehensive Annual Financial Reports
 Federal Communications Commission Daily Releases
 House of Representatives Financial Disclosures (OpenSecrets.org)
 IRS Form 990 – Not-for-Profit Organization Reports
 New York City Council and Community Board Documents
 New York City Economic Development Commission Monthly Snapshot
 New York City Environmental Impact Statements
 US Foreign Aid Reports (USAID)
 Other: List/Describe here


## 4_pdfs.md

      
    Raw
  

              4_pdfs.md
            
          
    PDF Samples

How would you categorize the PDFs?
Sample documents


PDF URL
Document Title


http://www.domain.org/docs/docurl.pdf
Report of Economic Data 2012


Content category


 Disclosure (filing, forms, report, ...)
 Legislative doc (laws, analysis, ...)
 Financial (statements, reports)
 Government statistical data
 Non-Government statistical data
 Press (press releases, statements, ...)
 Government reports
 Non-Government reports
 Directory
 Other:

Number of pages


 1 page
 2 to 9 pages
 10+ pages
 100+ pages

Other observations


 Collection includes PDFs made from scanned documents
 PDFs include hand-written text

PDF Generation


 Human authored
 Machine generated
 God only knows


## 5_data.md

      
    Raw
  

              5_data.md
            
          
    Type of data embedded in PDF


 Simple table of data
 Complex table of data
 Multiple tables of data from document
 Table data greater than one page in length
 Highly-structured form data
 Loosely-structured form data
 Has human-written text
 Structure text of a report (e.g., headings, subheadings, ...)
 Other:


Desired output of data


 CSV
 JSON
 text version (e.g., markdown)


## 6_tools.md

      
    Raw
  

              6_tools.md
            
          
    Tool

What tool(s) are you using to extract the data?


Tool
How we used it


Tabula
We used it to manually select and extract a table of data


Notes


## 7_how.md

      
    Raw
  

              7_how.md
            
          
    How

How did you extract the desired data that produced the best results?
Improvements

What would have to be changed/added to the tool or process to achieve success?

  
## 8_results.md

      
    Raw
  

              8_results.md
            
          
    Results quality


 99%+
 90%+
 80%+
 50% to 75%
 less than 50%
 utter crap

Speed

How fast is the data extracted?

 < 10 seconds
 < 30 seconds
 < 1 minute
 < 5 minutes
 < 10 minutes
 < 20 minutes
 Other:

Notes


## 9_code.md

      
    Raw
  

              9_code.md
            
          
    Code

Please list code, tips and howto's of your processing pipeline.