Skip to content

Instantly share code, notes, and snippets.

@wcaleb
Created November 6, 2013 14:41
Show Gist options
  • Star 18 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save wcaleb/7337097 to your computer and use it in GitHub Desktop.
Save wcaleb/7337097 to your computer and use it in GitHub Desktop.
Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable
#!/bin/sh
# Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable.
# Hacked together using tips from these websites:
# http://www.jlaundry.com/2012/ocr-a-scanned-pdf-with-tesseract/
# http://askubuntu.com/questions/27097/how-to-print-a-regular-file-to-pdf-from-command-line
# Dependencies: pdftk, tesseract, imagemagick, enscript, ps2pdf
# Would be nice to use hocr2pdf instead so that the text lines up with the PDF image.
# http://www.exactcode.com/site/open_source/exactimage/hocr2pdf/
cp $1 $1.bak
pdftk $1 burst output tesspage_%02d.pdf
for file in `ls tesspage*`
do
PAGE=$(basename "$file" .pdf)
# Convert the PDF page into a TIFF file
convert -monochrome -density 600 $file "$PAGE".tif
# OCR the TIFF file and save text to output.txt
tesseract "$PAGE".tif output
# Turn text file outputed by tesseract into a PDF, then put it in background of original page
enscript output.txt -B -o - | ps2pdf - output.pdf && pdftk $file background output.pdf output new-"$file"
# Clean up
rm output*
rm "$file"
rm *.tif
done
pdftk new* cat output $1
@norpol
Copy link

norpol commented May 12, 2017

Make sure you read this script before using. Removes stuff via. wildcards.

@scruss
Copy link

scruss commented Jun 4, 2018

tesseract can now produce PDF with embedded text directly using the PDF config option. It's used something like this:

tesseract input.tif outputbase pdf

which would create outputbase.pdf

@ramack19
Copy link

tesseract can now produce PDF with embedded text directly using the PDF config option. It's used something like this:

tesseract input.tif outputbase pdf

which would create outputbase.pdf

scruss,
Thank you for stating that! That simplifies the process significantly! Plus I now have all the packages on our server needed to convert PDFs to embedded text PDFs. I do not have to go through our IT approval process to get ocrmypdf installed, tesseract can do it.
Thanks!

@fprochazka
Copy link

I would say that the most modern variant is ocrmypdf, which is a nice wrapper above tesseract and is adding some extra features. Its natively in linux repos.

@scruss
Copy link

scruss commented Mar 8, 2023

ocrmypdf

That's what I mostly use now. But this gist served me well for years

@ramack19
Copy link

ramack19 commented Mar 9, 2023

I would say that the most modern variant is ocrmypdf, which is a nice wrapper above tesseract and is adding some extra features. Its natively in linux repos.

Available...yes and no. ocrmypdf isn't available on all corporate repos, but tesseract is more available. I ran into this at a former workplace that did a lot of DoD type work and had a pretty restrictive Linux VM. ocrmypdf wasn't readily available, however tesseract was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment