Instantly share code, notes, and snippets.

Embed
What would you like to do?
OCR on OS X with tesseract

Install ImageMagick for image conversion:

brew install imagemagick

Install tesseract for OCR:

brew install tesseract --all-languages

Or install without --all-languages and install them manually as needed.

Make sure the input image is a grayscale .tif and fairly large. ~500x150 was too small, while ~2000*500 worked very well.

convert input.png -resize 400% -type Grayscale input.tif

OCR it. The default language is English. Language codes are 3 chars per man tesseract.

tesseract -l eng input.tif output

This creates output.txt.

@robindang

This comment has been minimized.

Copy link

robindang commented May 2, 2014

thank you

@scruss

This comment has been minimized.

Copy link

scruss commented Jun 27, 2014

I needed to build this as a prerequisite:

brew install leptonica --with-libtiff

before Tesseract would load TIFF files.

I'm looking forward to Tesseract 3.03 under Homebrew (I think you can build it with --HEAD) as it supports writing the image + text as a PDF:

tesseract infile.tif outfile pdf
@brikis98

This comment has been minimized.

Copy link

brikis98 commented Apr 21, 2015

Thanks for posting this, saved me lots of time :)

@GregBaugues

This comment has been minimized.

Copy link

GregBaugues commented Jun 19, 2015

Also thank you. This was great.

@coolya

This comment has been minimized.

Copy link

coolya commented Oct 8, 2015

👍

@mikedewar

This comment has been minimized.

Copy link

mikedewar commented Jan 8, 2016

👍

@sterlingwes

This comment has been minimized.

Copy link

sterlingwes commented Jan 31, 2016

If you landed here looking to convert a scanned PDF to an OCRable format:

I found that imagemagick's PDF-to-TIFF output was all garbled / distorted. Couldn't find the right flag to increase the resolution, so I tried Ghostscript instead (which imagemagick might use under the hood):

gs -q -r300x300 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=output.tif myscan.pdf -c quit

buyer beware, do: man gs first

@pconerly

This comment has been minimized.

Copy link

pconerly commented Feb 28, 2016

FYI, the --all-languages flag has been depreciated in favor of --with-all-languages. Thanks for the guide!

@stoivo

This comment has been minimized.

Copy link

stoivo commented Apr 26, 2016

Is there a way I can install only norwegian language?

@flyingmrwang

This comment has been minimized.

Copy link

flyingmrwang commented Apr 26, 2016

I have tried as your tutorial. But it shows:
"dyld: Library not loaded: /usr/local/opt/leptonica/lib/liblept.4.dylib
Referenced from: /usr/local/bin/tesseract
Reason: image not found
Trace/BPT trap: 5"
Can anyone tell me how to solve it?

@wangchuande

This comment has been minimized.

Copy link

wangchuande commented May 31, 2016

thanks a lot

@chuckyukai

This comment has been minimized.

Copy link

chuckyukai commented Jul 13, 2016

tesseract: --all-languages was deprecated; using --with-all-languages instead!

@yangboz

This comment has been minimized.

Copy link

yangboz commented Nov 22, 2016

Warning: tesseract: --all-languages was deprecated; using --with-all-languages instead!

@ekovacs

This comment has been minimized.

Copy link

ekovacs commented Jan 30, 2017

👍 🥇
thank you! you saved me much frustration and time!!!

@JohnTian

This comment has been minimized.

Copy link

JohnTian commented May 27, 2017

👍 😄

@Sentinel-Prime

This comment has been minimized.

Copy link

Sentinel-Prime commented Jun 22, 2017

Thanks A lot , you saved my life.

@Sentinel-Prime

This comment has been minimized.

Copy link

Sentinel-Prime commented Jun 23, 2017

what if I want to convert images to text files in bulk ?

@DJLunacy

This comment has been minimized.

Copy link

DJLunacy commented Jul 17, 2017

Does converting a png to tiff increase the odds of accurate OCR?

@BVijayKrishna

This comment has been minimized.

Copy link

BVijayKrishna commented Sep 28, 2017

It's working thanks for the help

@HjoshM

This comment has been minimized.

Copy link

HjoshM commented Feb 16, 2018

A long time ago, I installed tesseract 3.05.01 for OCR using HomeBrew:

brew install --with-training-tools tesseract

How do I update it to the latest? I thought by regularly running the following, this would be done:

brew update
brew upgrade
brew outdated

However, my tesseract has not been updated at all...

@kenshinji

This comment has been minimized.

Copy link

kenshinji commented Mar 28, 2018

@flyingmrwang did you figure it out? I encountered same issue.

@olimorris

This comment has been minimized.

Copy link

olimorris commented Oct 29, 2018

Love this tip. I created a bash function to make it even easier to run on english text:

function extract-text {
    FILEPATH=$1
    convert $1 -resize 400% -type Grayscale $1.tif
    tesseract -l eng $1.tif output
}
@sshaw

This comment has been minimized.

Copy link

sshaw commented Nov 24, 2018

For imagemagick: stable 7.0.8-14 on OS X 10.9.5 I had to install with --with-fontconfig:

brew install imagemagick --with-fontconfig

Compare your results with/without resize as resize can take more resources for no OCR improvement.

For OCR improvements see Improving Quality or try training.

@simonkeng

This comment has been minimized.

Copy link

simonkeng commented Jan 13, 2019

Is there a brew route for getting & running the latest version of tesseract (LSTM-based, 4.0.0)?

I'm currently using:

$ tesseract --version                                                                                                   
tesseract 3.05.01
 leptonica-1.74.4
  libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

..on macOS, High Sierra 10.13.4 17E202 x86_64. Thanks!

@gabedot

This comment has been minimized.

Copy link

gabedot commented Feb 5, 2019

All languages option is not working

brew install tesseract --all-languages
Error: invalid option: --all-languages

brew --version
Homebrew 2.0.0
Homebrew/homebrew-core (git revision a761; last commit 2019-02-05)
Homebrew/homebrew-cask (git revision 1e6e6; last commit 2019-02-05)

@varenc

This comment has been minimized.

Copy link

varenc commented Feb 6, 2019

@gabedot

Homebrew recently decided to remove all options from the homebrew-core Formula's. Though as of right now tesseract now includes all languages by default so just remove the option and you should get all languages. This makes tesseract 680MB by default though so think this should change in the future.

In the medium to short term, you can install tesseract with all language support with this
brew install https://github.com/Homebrew/homebrew-core/raw/10708da5492fa4da6fbf2618210681953219409f/Formula/tesseract.rb though that's just a reference to a particular version of the Formula so won't receive future updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment