Skip to content

Instantly share code, notes, and snippets.

@gtfierro
Last active May 11, 2022 16:57
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save gtfierro/8324883 to your computer and use it in GitHub Desktop.
Save gtfierro/8324883 to your computer and use it in GitHub Desktop.
Quick shell script for parallel OCR on PDFs using ghostscript and tesseract
#!/bin/bash
# requires ghostscript (http://www.ghostscript.com/)
# requires ImageMagick
# requires tesseract (https://code.google.com/p/tesseract-ocr/)
# requires GNU parallel (https://www.gnu.org/software/parallel/)
# all of these are typically available through yum/apt/brew/etc.
# number of cores over which the process will be parallelized
num_cores=$1
# converts each of the PDFs into TIFF images so that tesseract can interact with them
ind . -name '*.pdf' | parallel --gnu -j $NUMCORES convert -depth 8 -density 200 {}[0-19] {}.tif
# runs OCR on the found TIFF files and converts them to text. Assumes English, but you can supply
# extra arguments to tesseract
find . -name '*.tif' | parallel -j $NUMCORES tesseract -l eng {} {}
@wchen38
Copy link

wchen38 commented Mar 2, 2018

hey, i try to run your script and got the following error, anyway you can help me resolve this?
I have a pdf and this script in the same folder and ran $bash run_ocr.sh 1
and got the following error:
"parallel: Error: Parsing of --jobs/-j/--max-procs/-P failed."

@ole-tange
Copy link

ole-tange commented Mar 20, 2018

If you just want to run 1 proc per CPU core:

find . -name '*.pdf' | parallel convert -depth 8 -density 200 {} {.}.tif
find . -name '*.tif' | parallel tesseract -l eng {} {}

@jon4thin
Copy link

If you just want to run 1 proc per CPU core:

find . -name '*.pdf' | parallel convert -depth 8 -density 200 {} {.}.tif find . -name '*.tif' | parallel tesseract -l eng {} {}

what exactly did you do to call each job on a separate core?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment