Skip to content

Instantly share code, notes, and snippets.

@dllud
dllud / pdfocr
Created February 9, 2014 01:10
pdfocr - script to transform a PDF containing a scanned book into a searchable PDF
#!/bin/bash
# This is a script to transform a PDF containing a scanned book into a searchable PDF.
# Based on previous script and many good tips by Konrad Voelkel:
# http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/
# http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/
# Depends on convert (ImageMagick), pdftk and hocr2pdf (ExactImage).
# $ sudo apt-get install imagemagick pdftk exactimage
# You also need at least one OCR software which can be either tesseract or cuneiform.
# $ sudo apt-get install tesseract-ocr