Skip to content

Instantly share code, notes, and snippets.

@staeff
Created March 18, 2015 20:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save staeff/042958cbe269adaca66a to your computer and use it in GitHub Desktop.
Save staeff/042958cbe269adaca66a to your computer and use it in GitHub Desktop.
OCR with tesseract on the command line
#!/bin/sh
STARTPAGE=1 # set to pagenumber of the first page of PDF you wish to convert
ENDPAGE=86 # set to pagenumber of the last page of PDF you wish to convert
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)
touch $OUTPUT
for i in `seq $STARTPAGE $ENDPAGE`; do
convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page.tif
echo processing page $i
tesseract page.tif tempoutput
cat tempoutput.txt >> $OUTPUT
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment