1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
Who is working together?
Name | Organization | ||
---|---|---|---|
Marjorie Roswell | mroswell@gmail.com | @mroswell | Organization Name |
Which challenge are you working on?
How would you categorize the PDFs?
PDF URL | Document Title |
---|---|
https://www.dropbox.com/sh/cwb299emamxnqcc/982J2HOqyr/Electronic/N00029139_2012.pdf | House of Representatives Financial Disclosure - image |
What tool(s) are you using to extract the data?
Tool | How we used it |
---|---|
ABBYY | used the curl_recognize.sh shell script |
Ross found that the python script (when he ran it on a text PDF) delivered 404 errors. so used the shell script. needed to update with application ID and password, provided by ABBYY by email, upon registering an application The script ran quickly. Got about 60% of the document. The rest would need to be done by hand.
Please list code, tips and howto's of your processing pipeline.
produce RTF (to retain formatting) ./curl_recognize.sh N00029139_2012.pdf N00029139_2012.rtf -rtf
produce markdown textutil -convert html N00029139_2012.rtf | pandoc -f html -t markdown -o result.md
Results are available at: https://github.com/mroswell/pdf-liberation-examples