Skip to content

Instantly share code, notes, and snippets.

@jvillemare
Last active September 26, 2021 18:29
Show Gist options
  • Save jvillemare/f538b289ed3d847e3feb3abb8ac88d71 to your computer and use it in GitHub Desktop.
Save jvillemare/f538b289ed3d847e3feb3abb8ac88d71 to your computer and use it in GitHub Desktop.
OCR images on MacOS with one command and open-source Tesseract

OCR Scan images on MacOS for free, and easy

Scanning images with OCR (Optical Character Recognition) is immensely helpful to find what you're looking for later solely by using the text in the image when searching. OCR is big money, so of course, there's no easy way to do it with a nice UI. Many of these apps cost $10, $20, or more, which is unreasonable.

Tesseract is a free, open-source OCR application that many of the paid apps "borrow", repackage, and sell at a high mark up. Unfortunately, when I say application, I mean a command line interface. So, it's not terribly intuitive. But we can simplify it. And in the process, spite Adobe and others for trying to resell something that's so incredibly helpful:

Open the Terminal app, type, and hit enter to install tesseract.

brew install tesseract

If that didn't work, you don't have Homebrew installed, and you need to run the following command:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

this comes from the Homebrew website. It's basically a package manager like apt or apt-get that installs ("brews") applications for you.

Now, we need to add an aliased command. We can do that with.

cd ~ && nano .bash_profile

Gets you to script that runs every time you start a bash shell.

On MacOS, you might be using the new, default zsh (Z shell). I recommend you switch back to bash (since it's superior) by

  1. Clicking Terminal in the upper-left hand corner
  2. Click 'Preferences...'
  3. Shells open with
  4. Enter in the command field /bin/bash. Restart Terminal, and retry the above command.

Now, in the .bash_profile file, append at the bottom of the file

alias convertpdf='for i in *; do tesseract "$i" "$i" -l eng pdf; done'

This basically means that every time you run the aliased command convertpdf, bash will run every file in the current directory through tesseract.

Hit Ctrl + X, and hit y and Enter to save the file.

Restart Terminal. Congratulations, its setup!

Use Example

Now say you took a lot of screenshots of something. Put them in a folder on your Desktop. Lets say you called this folder on your Desktop screenshots. Open the Terminal app, and change directory (cd Desktop/screenshots/) to it. Once in that folder, just type convertpdf, and every image will be converted to a PDF.


The Sad Facts

Tesseract is a one-trick pony, so it only converts images. And if you use that exact command, it will convert those images to PDFs with overlayed, searchable text. A gold standard that not many "free" OCR converters do for you online.

What's bad is that it converts every single image to its own individual PDF. And now you have a new problem: You probably want to combine the PDFs instead of having tens or hundreds of PDFs of the same document.

Unfortunately, there's no app on the Mac App Store that is:

  1. Free
  2. Does NOT contain in-app purchases
  3. Combines PDFs
  4. Preserves the text overlay layer that makes searchable PDFs actually useful

This seems like a supremely low bar to hit, but life is often disappointing. You might think the "free" Adobe Acrobat program might be able to combine PDFs. Since, you know, Adobe invented PDFs in 1993, and they're widely used. About 20% of the Panama Papers were PDFs. But unfortunately, the 500 megabyte Adobe Acrobat program will not combine PDFs unless you A) sign into an Adobe account, and B) pay the same cost as a monthly Netflix subscription.

The native Preview can let you combine PDFs, but it doesn't preserve the text overlay layer.

There are other hacky solutions like this online, like this gist of a shell script, this repo of a Python script, and others. But I tested the Python script, it doesn't work (even with some tinkering.) The shell script looks over-engineered. The solutoin presented here is simple and general enough that it should work across different macOSes, and hopefully into the future.

I recommend you just organize these many PDFs into a folder, name it smart, and it will be helpful when searching for it, later.

@davidpfahler
Copy link

Hi @jvillemare, I hope this helps with your "sad facts". Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment