ruxi/1_pdfliberation_hackathon_activity.md

## 1_pdfliberation_hackathon_activity.md

      
    Raw
  

              1_pdfliberation_hackathon_activity.md
            
          
    Start Here

1. Log into GitHub

2. Fork this Gist

3. Edit your version to share your team's activity
PDF Liberation Hackpad

IRC: https://webchat.freenode.net/ Channel: #sunlightlabs

GitHub Markdown-Cheatsheet

  
## 2_who.md

      
    Raw
  

              2_who.md
            
          
    Who

Who is working together?


Name
Email
Twitter
Organization


Anna Lukasiak
adlukasiak@gmail.com
@adlukasiak
Open JC


## 3_challenge.md

      
    Raw
  

              3_challenge.md
            
          
    Challenge

Which challenge are you working on?

 Amnesty International Annual Reports – Torture Incident Database
 Comprehensive Annual Financial Reports
 Federal Communications Commission Daily Releases
 House of Representatives Financial Disclosures (OpenSecrets.org)
 IRS Form 990 – Not-for-Profit Organization Reports
 New York City Council and Community Board Documents
 New York City Economic Development Commission Monthly Snapshot
 New York City Environmental Impact Statements
 US Foreign Aid Reports (USAID)
 Other: List/Describe here - Jersey City, NJ Historical budget and other financial data from http://www.cityofjerseycity.com/pub-info.aspx?id=2430


## 4_pdfs.md

      
    Raw
  

              4_pdfs.md
            
          
    PDF Documents

How would you categorize the PDFs?


File Name
text or image
No Pages
Category


CY2011AnnualAudit.pdf
image
304
Audit


CY2011AnnualDebtStatement.pdf
image
17
Annual Debt Statement


CY2011AnnualFinancialStatement.pdf
text
94
Financial Statements


CY2011Budget(Introduced).pdf
image
66
Budget


CY2011BudgetProjections.pdf
image
2
Budget Summaries and Projections


CY2011SurplusProjections.pdf
image
1
Budget Summaries and Projections


CY2012AnnualAudit.pdf
image
333
Audit


CY2012AnnualDebtStatement.pdf
image
18
Annual Debt Statement


CY2012Budget(Adopted).pdf
image
73
Budget


CY2012Budget(Introduced).pdf
image
72
Budget


CY2012BudgetAmendmentIntroduced.pdf
image
7
Budget


CY2013Budget(Adopted).pdf
image
72
Budget


CY2013Budget(Introduced).pdf
image
147
Budget


FY2006AnnualAudit.pdf
text
223
Audit


FY2007AnnualAudit.pdf
text
213
Audit


FY2007AnnualFinancialStatement.pdf
image
82
Financial Statements


FY2007Budget(Adopted).pdf
image
91
Budget


FY2008AnnualAudit.pdf
text
250
Audit


FY2008Budget(Adopted).pdf
image
75
Budget


FY2008Budget(Introduced).pdf
image
73
Budget


FY2009AnnualAudit.pdf
image
240
Audit


FY2009Budget(Adopted).pdf
image
77
Budget


FY2010AnnualAudit.pdf
image
236
Audit


TY2010AnnualAudit.pdf
text
268
Audit


TY2010AnnualFinancialStatement.pdf
image
91
Financial Statements


TY2010CorrectiveActionPlan.pdf
image
14
Audit


FY2010AnnualFinancialStatement.pdf
image
88
Financial Statements


FY2010Budget(Adopted).pdf
image
75
Budget


FY2010Budget(Introduced).pdf
image
75
Budget


FY2010TransitionYearBudget(Adopted).pdf
image
65
Budget


AnnualFinancialStatement2012.pdf
image
102
Financial Statements


AnnualFinancialStatement2009.pdf
image
87
Financial Statements


AnnualFinancialStatement2008.pdf
image
89
Financial Statements


correctiveactionplan2008.pdf
image
24
Audit


correctiveactionplan2007.pdf
image
29
Audit


correctiveactionplan2006.pdf
image
32
Audit


CY2011BudgetIntroduced.pdf
image
66
Budget


Category
No Files
No Pages


text
22
2379


image
15
1492


total
37
3871


Content category


 Disclosure (filing, forms, report, ...)
 Legislative doc (laws, analysis, ...)
 Financial (statements, reports)
 Government statistical data
 Non-Government statistical data
 Press (press releases, statements, ...)
 Government reports
 Non-Government reports
 Directory
 Other:

Number of pages


 1 page
 2 to 9 pages
 10+ pages
 100+ pages

Other observations


 Collection includes PDFs made from scanned documents
 PDFs include hand-written text

PDF Generation


 Human authored
 Machine generated
 God only knows


## 5_data.md

      
    Raw
  

              5_data.md
            
          
    Type of data embedded in PDF


 Simple table of data
 Complex table of data
 Multiple tables of data from document
 Table data greater than one page in length
 Highly-structured form data
 Loosely-structured form data
 Has human-written text
 Structure text of a report (e.g., headings, subheadings, ...)
 Other:


Desired output of data


 CSV
 JSON
 text version (e.g., markdown)


## 6_tools.md

      
    Raw
  

              6_tools.md
            
          
    Tools

What tool(s) are you using to extract the data?


Tool
How we used it


ABBYY Cloud OCR SDK
The python script calls ABBYY api for files that are not searchable


Tabula
To test, we used it to manually select and extract a table of data.  There are over 30 files, looking to automate it.


Python script
Building python script to automate all steps for the 37 files


Notes

The python script will grab file names from the official Jersey City Website, call ABBYY and TABULA api's, and scrape the results.  To make the budget data useful for interactive visualization, it's ideal to create hierarchical json files.  To do this, need to extract revenue and spending data and ignore the subtotals plus need to link each spending number to an account, program, division and department.

  
## 7_how.md

      
    Raw
  

              7_how.md
            
          
    How

How did you extract the desired data that produced the best results?

Created a python script that grabs the files names from the website
Downloaded the files to a local project directory ./downloads Beware, the proces takes 30 mins!
Run ABBY Cloud OCR SDK api for files that are not searchable to convert them to searchable pdf's
The script uses strings filename | grep Font to test if the file is searchable
Manually run command line tabula for few sample files.  Needs to be automated.
Need to create data scraper to convert csv into hierarchical json files.

Improvements

What would have to be changed/added to the tool or process to achieve success?

Complete steps 5 and 6 above
Resolve ABBYY OCR api hanging for documents over 100 pages


## 8_results.md

      
    Raw
  

              8_results.md
            
          
    Results quality


 99%+
 90%+
 80%+
 50% to 75%
 less than 50%
 utter crap

Speed

How fast is the data extracted?

 < 10 seconds
 < 30 seconds
 < 1 minute
 < 5 minutes
 < 10 minutes
 < 20 minutes
 Other: downloading the 37 files takes about 30 minutes.  The ABBYY OCR process runs 1.5 min for 20 page document, ~5 min for less than 100 pages document.  Need to develop special handling for documents over 100 pages.

Notes


## 9_code.md

      
    Raw
  

              9_code.md
            
          
    Code

https://github.com/pdfliberation/Jersey-City-Budget-PDF-Liberation
File Name	text or image	No Pages	Category
CY2011AnnualAudit.pdf	image	304	Audit
CY2011AnnualDebtStatement.pdf	image	17	Annual Debt Statement
CY2011AnnualFinancialStatement.pdf	text	94	Financial Statements
CY2011Budget(Introduced).pdf	image	66	Budget
CY2011BudgetProjections.pdf	image	2	Budget Summaries and Projections
CY2011SurplusProjections.pdf	image	1	Budget Summaries and Projections
CY2012AnnualAudit.pdf	image	333	Audit
CY2012AnnualDebtStatement.pdf	image	18	Annual Debt Statement
CY2012Budget(Adopted).pdf	image	73	Budget
CY2012Budget(Introduced).pdf	image	72	Budget
CY2012BudgetAmendmentIntroduced.pdf	image	7	Budget
CY2013Budget(Adopted).pdf	image	72	Budget
CY2013Budget(Introduced).pdf	image	147	Budget
FY2006AnnualAudit.pdf	text	223	Audit
FY2007AnnualAudit.pdf	text	213	Audit
FY2007AnnualFinancialStatement.pdf	image	82	Financial Statements
FY2007Budget(Adopted).pdf	image	91	Budget
FY2008AnnualAudit.pdf	text	250	Audit
FY2008Budget(Adopted).pdf	image	75	Budget
FY2008Budget(Introduced).pdf	image	73	Budget
FY2009AnnualAudit.pdf	image	240	Audit
FY2009Budget(Adopted).pdf	image	77	Budget
FY2010AnnualAudit.pdf	image	236	Audit
TY2010AnnualAudit.pdf	text	268	Audit
TY2010AnnualFinancialStatement.pdf	image	91	Financial Statements
TY2010CorrectiveActionPlan.pdf	image	14	Audit
FY2010AnnualFinancialStatement.pdf	image	88	Financial Statements
FY2010Budget(Adopted).pdf	image	75	Budget
FY2010Budget(Introduced).pdf	image	75	Budget
FY2010TransitionYearBudget(Adopted).pdf	image	65	Budget
AnnualFinancialStatement2012.pdf	image	102	Financial Statements
AnnualFinancialStatement2009.pdf	image	87	Financial Statements
AnnualFinancialStatement2008.pdf	image	89	Financial Statements
correctiveactionplan2008.pdf	image	24	Audit
correctiveactionplan2007.pdf	image	29	Audit
correctiveactionplan2006.pdf	image	32	Audit
CY2011BudgetIntroduced.pdf	image	66	Budget
Tool	How we used it
ABBYY Cloud OCR SDK	The python script calls ABBYY api for files that are not searchable
Tabula	To test, we used it to manually select and extract a table of data. There are over 30 files, looking to automate it.
Python script	Building python script to automate all steps for the 37 files