Marvinmw/texttopdftools.md

## texttopdftools.md

      
    Raw
  

              texttopdftools.md
            
          
    Command Line Text to PDF Tools

pdftotext

Documentation: http://www.foolabs.com/xpdf/download.html
Mac install

brew install poppler
Basic Run Command

pdftotext document.pdf text_file.txt
pdf2txt.py

http://www.unixuser.org/~euske/python/pdfminer/
http://euske.github.io/pdfminer/index.html
Mac install

pip install pdfminer
Basic Run Command

pdf2txt.py -o text_file.txt document.pdf
calibre ebook-convert (didn't work)

http://calibre-ebook.com/
(http://manual.calibre-ebook.com/cli/ebook-convert.html)
Mac install

http://calibre-ebook.com/download_osx OSX download
Basic Run Command (without changing user profile)

/Applications/calibre.app/Contents/MacOS/ebook-convert document.pdf document_calibre.txt
Tika

http://tika.apache.org/
Mac install

Download the jar file
Basic Run Command

java -jar tika-app-1.7.jar --text document.pdf >  text_file.txt
Ghostscript

http://www.ghostscript.com/
Mac install

brew install ghostscript
Basic Run Command

gs  -dBATCH  -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=text_file.txt document.pdf
pdf2line

It is based on "pdftotext" from the Xpdf suite, but with a different
layout algorithm that preserves relative column position and line spacing.
Tesseract & GS method

Mac Install

brew install ghostscript
brew install Tesseract
Basic Run Command

Script which converts PDF into TIFF w/ ghostscript and then TIFF to txt with Tesseract for more info http://benschmidt.org/dighist13/?page_id=129

pdfbox

https://pdfbox.apache.org/1.8/commandline.html
Mac install

Download the jar file
Basic Run Command

java -jar pdfbox-app-1.8.8.jar ExtractText document.pdf textfile.txt
Useful links

ORC tools

http://jocr.sourceforge.net/
http://www.gnu.org/software/ocrad/ocrad.html
https://wiki.gnome.org/action/show/Apps/OCRFeeder?action=show&redirect=OCRFeeder
Best practices

http://stackoverflow.com/questions/10854858/best-practices-for-searchable-archive-of-thousands-of-documents-pdf-and-or-xml
https://help.ubuntu.com/community/OCR