Install ImageMagick for image conversion:
brew install imagemagick
Install tesseract for OCR:
brew install tesseract --all-languages
Or install without --all-languages
and install them manually as needed.
Make sure the input image is a grayscale .tif
and fairly large. ~500x150 was too small, while ~2000*500 worked very well.
convert input.png -resize 400% -type Grayscale input.tif
OCR it. The default language is English. Language codes are 3 chars per man tesseract
.
tesseract -l eng input.tif output
This creates output.txt
.
If you landed here looking to convert a scanned PDF to an OCRable format:
I found that imagemagick's PDF-to-TIFF output was all garbled / distorted. Couldn't find the right flag to increase the resolution, so I tried Ghostscript instead (which imagemagick might use under the hood):
gs -q -r300x300 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=output.tif myscan.pdf -c quit
buyer beware, do:
man gs
first