Skip to content

Instantly share code, notes, and snippets.

@Marvinmw
Forked from geramirez/texttopdftools.md
Created November 24, 2022 09:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Marvinmw/908c6283f12be1c7b24eb86b69a57376 to your computer and use it in GitHub Desktop.
Save Marvinmw/908c6283f12be1c7b24eb86b69a57376 to your computer and use it in GitHub Desktop.
Command Line Text to PDF Tools

Command Line Text to PDF Tools

pdftotext

Documentation: http://www.foolabs.com/xpdf/download.html

Mac install

brew install poppler

Basic Run Command

pdftotext document.pdf text_file.txt

pdf2txt.py

http://www.unixuser.org/~euske/python/pdfminer/ http://euske.github.io/pdfminer/index.html

Mac install

pip install pdfminer

Basic Run Command

pdf2txt.py -o text_file.txt document.pdf

calibre ebook-convert (didn't work)

http://calibre-ebook.com/ (http://manual.calibre-ebook.com/cli/ebook-convert.html)

Mac install

http://calibre-ebook.com/download_osx OSX download

Basic Run Command (without changing user profile)

/Applications/calibre.app/Contents/MacOS/ebook-convert document.pdf document_calibre.txt

Tika

http://tika.apache.org/

Mac install

Download the jar file

Basic Run Command

java -jar tika-app-1.7.jar --text document.pdf > text_file.txt

Ghostscript

http://www.ghostscript.com/

Mac install

brew install ghostscript

Basic Run Command

gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=text_file.txt document.pdf

pdf2line

It is based on "pdftotext" from the Xpdf suite, but with a different layout algorithm that preserves relative column position and line spacing.

Tesseract & GS method

Mac Install

brew install ghostscript brew install Tesseract

Basic Run Command

Script which converts PDF into TIFF w/ ghostscript and then TIFF to txt with Tesseract for more info http://benschmidt.org/dighist13/?page_id=129

pdfbox

https://pdfbox.apache.org/1.8/commandline.html

Mac install

Download the jar file

Basic Run Command

java -jar pdfbox-app-1.8.8.jar ExtractText document.pdf textfile.txt

Useful links

ORC tools

http://jocr.sourceforge.net/ http://www.gnu.org/software/ocrad/ocrad.html https://wiki.gnome.org/action/show/Apps/OCRFeeder?action=show&redirect=OCRFeeder

Best practices

http://stackoverflow.com/questions/10854858/best-practices-for-searchable-archive-of-thousands-of-documents-pdf-and-or-xml

https://help.ubuntu.com/community/OCR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment