Skip to content

Instantly share code, notes, and snippets.

@allixender
Created August 29, 2015 10:32
Show Gist options
  • Save allixender/93eb47b7d93a4677e338 to your computer and use it in GitHub Desktop.
Save allixender/93eb47b7d93a4677e338 to your computer and use it in GitHub Desktop.
Parallel for Tesseract OCR and lulti-page pdfs
#!/bin/bash
#ocrpdftotext
# adjusted implementation of http://ubuntuforums.org/showthread.php?t=880471
DPI=300
TESS_LANG=eng
FILENAME=${@}
TMP_NAME=`basename "$FILENAME" .pdf`
OUTPUT_FILENAME=${TMP_NAME}-out-${DPI}.txt
PAGES=`pdfinfo "$FILENAME" | grep Pages | sed -r "s/^[^0-9]*([0-9]+)$/\1/"`
for i in `seq 1 $PAGES`; do
convert -density ${DPI} -depth 8 -background white -flatten +matte ${FILENAME}\[$(($i - 1 ))\] "${TMP_NAME}-${i}.tif"
done
parallel "tesseract {} {.} " ::: ${TMP_NAME}-*.tif
for i in `seq 1 $PAGES`; do cat ${TMP_NAME}-${i}.txt; done >> "${OUTPUT_FILENAME}"
@allixender
Copy link
Author

When using programs that use GNU Parallel to process data for publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; and it won't cost you a cent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment