Skip to content

Instantly share code, notes, and snippets.

@mlovic
Created September 6, 2016 17:15
Show Gist options
  • Save mlovic/1854a6872e3d88df73efedeb948248de to your computer and use it in GitHub Desktop.
Save mlovic/1854a6872e3d88df73efedeb948248de to your computer and use it in GitHub Desktop.
Shell script that takes a pdf of scanned text, and uses the Tesseract OCR library to produce a text file
#!/bin/bash
#
# Takes pdf of scanned text, and uses the Tesseract OCR library to produce
# a text version.
set -e
[ -z "$1" ] && echo "USAGE: pdf2txt INPUT_PATH [OUTPUT_PATH]" && exit 1
input="$1"
tmpdir=$(mktemp -d /tmp/ocr.XXXX)
pdftk "$input" dump_data | grep NumberOfPages
echo 'splitting up pdf...'
pdftk "$input" burst output $tmpdir/%04d.pdf
rm $tmpdir/doc_data.txt
for f in $tmpdir/*.pdf
do
echo "Converting $f"
convert -density 300 "$f" -quality 90 -limit memory 3GB -limit disk 10GB ${f%.*}.png
done
echo 'Done converting'
#convert -density 300 "$input" -quality 90 -limit memory 3GB -limit disk 10GB $tmpdir/%04d.png
for f in $tmpdir/*.png
do
echo "Reading $f"
tesseract $f ${f%.*}.txt > /dev/null 2>&1
done
if [ -z "$2" ]
then
output=${input%.*}.txt
else
output=$2
fi
echo "" > $output
counter=1
for f in $tmpdir/*.txt
do
echo -e "\n## Page $counter\n" >> $output
cat $f >> $output
counter=$((counter+1))
done
rm $tmpdir/*
rmdir $tmpdir
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment