Skip to content

Instantly share code, notes, and snippets.

@davidpfahler
Created September 26, 2021 18:28
Show Gist options
  • Save davidpfahler/199e92f4eb5609f7c4fd54fd255b0196 to your computer and use it in GitHub Desktop.
Save davidpfahler/199e92f4eb5609f7c4fd54fd255b0196 to your computer and use it in GitHub Desktop.
OCR PDF file (aka make it searchable) using tesseract (super hacky, use at your own risk!)
#!/bin/sh
mkdir -p __searchable__
y="`pwd`/$1"
echo Will create a searchable PDF for $y
x=`basename "$y"`
name=${x%.*}
mkdir "$name"
cd "$name"
# splitting to individual pages
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o out_%04d.jpg -f "$y"
# process each page
for f in $( ls *.jpg ); do
# extract text
tesseract $f ${f%.*} -l deu --psm 3 pdf
rm $f
done
# combine all pages back to a single file
gs -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile="../__searchable__/${name}.pdf" *.pdf
cd ..
rm -rf "${name}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment