Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save ruxi/b306a871262d157adc95 to your computer and use it in GitHub Desktop.
Save ruxi/b306a871262d157adc95 to your computer and use it in GitHub Desktop.

Challenge

Which challenge are you working on?

  • Amnesty International Annual Reports – Torture Incident Database
  • Comprehensive Annual Financial Reports
  • Federal Communications Commission Daily Releases
  • House of Representatives Financial Disclosures (OpenSecrets.org)
  • IRS Form 990 – Not-for-Profit Organization Reports
  • New York City Council and Community Board Documents
  • New York City Economic Development Commission Monthly Snapshot
  • New York City Environmental Impact Statements
  • US Foreign Aid Reports (USAID)
  • Other: List/Describe here - Jersey City, NJ Historical budget and other financial data from http://www.cityofjerseycity.com/pub-info.aspx?id=2430

PDF Documents

How would you categorize the PDFs?

File Name text or image No Pages Category
CY2011AnnualAudit.pdf image 304 Audit
CY2011AnnualDebtStatement.pdf image 17 Annual Debt Statement
CY2011AnnualFinancialStatement.pdf text 94 Financial Statements
CY2011Budget(Introduced).pdf image 66 Budget
CY2011BudgetProjections.pdf image 2 Budget Summaries and Projections
CY2011SurplusProjections.pdf image 1 Budget Summaries and Projections
CY2012AnnualAudit.pdf image 333 Audit
CY2012AnnualDebtStatement.pdf image 18 Annual Debt Statement
CY2012Budget(Adopted).pdf image 73 Budget
CY2012Budget(Introduced).pdf image 72 Budget
CY2012BudgetAmendmentIntroduced.pdf image 7 Budget
CY2013Budget(Adopted).pdf image 72 Budget
CY2013Budget(Introduced).pdf image 147 Budget
FY2006AnnualAudit.pdf text 223 Audit
FY2007AnnualAudit.pdf text 213 Audit
FY2007AnnualFinancialStatement.pdf image 82 Financial Statements
FY2007Budget(Adopted).pdf image 91 Budget
FY2008AnnualAudit.pdf text 250 Audit
FY2008Budget(Adopted).pdf image 75 Budget
FY2008Budget(Introduced).pdf image 73 Budget
FY2009AnnualAudit.pdf image 240 Audit
FY2009Budget(Adopted).pdf image 77 Budget
FY2010AnnualAudit.pdf image 236 Audit
TY2010AnnualAudit.pdf text 268 Audit
TY2010AnnualFinancialStatement.pdf image 91 Financial Statements
TY2010CorrectiveActionPlan.pdf image 14 Audit
FY2010AnnualFinancialStatement.pdf image 88 Financial Statements
FY2010Budget(Adopted).pdf image 75 Budget
FY2010Budget(Introduced).pdf image 75 Budget
FY2010TransitionYearBudget(Adopted).pdf image 65 Budget
AnnualFinancialStatement2012.pdf image 102 Financial Statements
AnnualFinancialStatement2009.pdf image 87 Financial Statements
AnnualFinancialStatement2008.pdf image 89 Financial Statements
correctiveactionplan2008.pdf image 24 Audit
correctiveactionplan2007.pdf image 29 Audit
correctiveactionplan2006.pdf image 32 Audit
CY2011BudgetIntroduced.pdf image 66 Budget
Category No Files No Pages
text 22 2379
image 15 1492
total 37 3871

Content category

  • Disclosure (filing, forms, report, ...)
  • Legislative doc (laws, analysis, ...)
  • Financial (statements, reports)
  • Government statistical data
  • Non-Government statistical data
  • Press (press releases, statements, ...)
  • Government reports
  • Non-Government reports
  • Directory
  • Other:

Number of pages

  • 1 page
  • 2 to 9 pages
  • 10+ pages
  • 100+ pages

Other observations

  • Collection includes PDFs made from scanned documents
  • PDFs include hand-written text

PDF Generation

  • Human authored
  • Machine generated
  • God only knows

Type of data embedded in PDF

  • Simple table of data
  • Complex table of data
  • Multiple tables of data from document
  • Table data greater than one page in length
  • Highly-structured form data
  • Loosely-structured form data
  • Has human-written text
  • Structure text of a report (e.g., headings, subheadings, ...)
  • Other:

Desired output of data

  • CSV
  • JSON
  • text version (e.g., markdown)

Tools

What tool(s) are you using to extract the data?

Tool How we used it
ABBYY Cloud OCR SDK The python script calls ABBYY api for files that are not searchable
Tabula To test, we used it to manually select and extract a table of data. There are over 30 files, looking to automate it.
Python script Building python script to automate all steps for the 37 files

Notes

The python script will grab file names from the official Jersey City Website, call ABBYY and TABULA api's, and scrape the results. To make the budget data useful for interactive visualization, it's ideal to create hierarchical json files. To do this, need to extract revenue and spending data and ignore the subtotals plus need to link each spending number to an account, program, division and department.

How

How did you extract the desired data that produced the best results?

  1. Created a python script that grabs the files names from the website
  2. Downloaded the files to a local project directory ./downloads Beware, the proces takes 30 mins!
  3. Run ABBY Cloud OCR SDK api for files that are not searchable to convert them to searchable pdf's
  4. The script uses strings filename | grep Font to test if the file is searchable
  5. Manually run command line tabula for few sample files. Needs to be automated.
  6. Need to create data scraper to convert csv into hierarchical json files.

Improvements

What would have to be changed/added to the tool or process to achieve success?

  • Complete steps 5 and 6 above
  • Resolve ABBYY OCR api hanging for documents over 100 pages

Results quality

  • 99%+
  • 90%+
  • 80%+
  • 50% to 75%
  • less than 50%
  • utter crap

Speed

How fast is the data extracted?

  • < 10 seconds
  • < 30 seconds
  • < 1 minute
  • < 5 minutes
  • < 10 minutes
  • < 20 minutes
  • Other: downloading the 37 files takes about 30 minutes. The ABBYY OCR process runs 1.5 min for 20 page document, ~5 min for less than 100 pages document. Need to develop special handling for documents over 100 pages.

Notes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment