1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
Who is working together?
Name | Organization | ||
---|---|---|---|
Aaron Williamson | aaron@copiesofcopies.org | @copiesofcopies | N/A |
Which challenge are you working on?
How would you categorize the PDFs?
PDF URL | Document Title |
---|---|
http://www.nyc.gov/html/dcp/pdf/env_review/rockefeller/noc_deis.pdf | Rockefeller University New River Building and Fitness Center Draft Environmental Impact Statement 11/1/2013 |
What tool(s) are you using to extract the data?
Tool | How we used it |
---|---|
Tabula | Tabula seems to work for extracting most of the tables if they're in the correct orientation |
Tesseract | Tesseract makes quick work of the letters themselves |
ruby-tesseract-ocr | This library promises access to the orientation of individual elements of a scanned document, so it could be useful for identifying rotated tables, but so far I haven't had any luck with it. This is where I'm hung up. |
How did you extract the desired data that produced the best results?
Tesseract is great for the text, Tabula for the tables, but I'm still working on putting them together to render the whole document.
What would have to be changed/added to the tool or process to achieve success?
I need a way to identify rotated tables. I've looked at a couple of techniques:
psm -1
will output an estimated rotation value (I think that 0 = none, 1 = rotated 90 deg. clockwise, 2 = 180 deg., 3 = 270 deg.) to STDOUT, and so this could be grepped for the value, but I'd far prefer to access the value through a library call, which is why I've been trying to get ruby-tesseract-ocr to work.Once I've solved the reoriented table problem, I still need to figure out the best way to insert the tables into the rendered text.
This is a guess -- I haven't rendered any complete documents because I haven't started working on the (relatively) less complex problem of rendering the entire document + the already-correctly-oriented tables. I don't think this will be horribly difficult once the table orientation problem is solved.
How fast is the data extracted?