Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable
#!/bin/sh
# Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable.
# Hacked together using tips from these websites:
# http://www.jlaundry.com/2012/ocr-a-scanned-pdf-with-tesseract/
# http://askubuntu.com/questions/27097/how-to-print-a-regular-file-to-pdf-from-command-line
# Dependencies: pdftk, tesseract, imagemagick, enscript, ps2pdf
# Would be nice to use hocr2pdf instead so that the text lines up with the PDF image.
# http://www.exactcode.com/site/open_source/exactimage/hocr2pdf/
cp $1 $1.bak
pdftk $1 burst output tesspage_%02d.pdf
for file in `ls tesspage*`
do
PAGE=$(basename "$file" .pdf)
# Convert the PDF page into a TIFF file
convert -monochrome -density 600 $file "$PAGE".tif
# OCR the TIFF file and save text to output.txt
tesseract "$PAGE".tif output
# Turn text file outputed by tesseract into a PDF, then put it in background of original page
enscript output.txt -B -o - | ps2pdf - output.pdf && pdftk $file background output.pdf output new-"$file"
# Clean up
rm output*
rm "$file"
rm *.tif
done
pdftk new* cat output $1
@norpol

This comment has been minimized.

Copy link

@norpol norpol commented May 12, 2017

Make sure you read this script before using. Removes stuff via. wildcards.

@scruss

This comment has been minimized.

Copy link

@scruss scruss commented Jun 4, 2018

tesseract can now produce PDF with embedded text directly using the PDF config option. It's used something like this:

tesseract input.tif outputbase pdf

which would create outputbase.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment