jvillemare/readme.md

## readme.md

      
    Raw
  

              readme.md
            
          
    OCR Scan images on MacOS for free, and easy

Scanning images with OCR (Optical Character Recognition) is immensely helpful to find
what you're looking for later solely by using the text in the image when searching.
OCR is big money, so of course, there's no easy way to do it with a nice UI. Many of
these apps cost $10, $20, or more, which is unreasonable.
Tesseract is a free, open-source OCR application that many of the paid apps "borrow",
repackage, and sell at a high mark up. Unfortunately, when I say application, I mean
a command line interface. So, it's not terribly intuitive. But we can simplify it.
And in the process, spite Adobe and others for trying to resell something that's so
incredibly helpful:
Open the Terminal app, type, and hit enter to install tesseract.
brew install tesseract

If that didn't work, you don't have Homebrew installed, and you need to run the
following command:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

this comes from the Homebrew website. It's basically a package
manager like apt or apt-get that installs ("brews") applications for you.
Now, we need to add an aliased command. We can do that with.
cd ~ && nano .bash_profile

Gets you to script that runs every time you start a bash shell.
On MacOS, you might be using the new, default zsh (Z shell). I recommend you
switch back to bash (since it's superior) by

Clicking Terminal in the upper-left hand corner
Click 'Preferences...'
Shells open with
Enter in the command field /bin/bash. Restart Terminal, and retry the above command.

Now, in the .bash_profile file, append at the bottom of the file
alias convertpdf='for i in *; do tesseract "$i" "$i" -l eng pdf; done'

This basically means that every time you run the aliased command convertpdf,
bash will run every file in the current directory through tesseract.
Hit Ctrl + X, and hit y and Enter to save the file.
Restart Terminal. Congratulations, its setup!
Use Example

Now say you took a lot of screenshots of something. Put
them in a folder on your Desktop. Lets say you called this folder on your
Desktop screenshots. Open the Terminal app, and change directory
(cd Desktop/screenshots/) to it. Once in that folder, just type convertpdf,
and every image will be converted to a PDF.

The Sad Facts

Tesseract is a one-trick pony, so it only converts images. And if you use
that exact command, it will convert those images to PDFs with overlayed,
searchable text. A gold standard that not many "free" OCR converters do
for you online.
What's bad is that it converts every single image to its own individual
PDF. And now you have a new problem: You probably want to combine the PDFs
instead of having tens or hundreds of PDFs of the same document.
Unfortunately, there's no app on the Mac App Store that is:

Free
Does NOT contain in-app purchases
Combines PDFs
Preserves the text overlay layer that makes searchable PDFs actually useful

This seems like a supremely low bar to hit, but life is often disappointing.
You might think the "free" Adobe Acrobat program might be able to combine PDFs.
Since, you know, Adobe invented PDFs in 1993,
and they're widely used. About 20% of the Panama Papers were PDFs.
But unfortunately, the 500 megabyte Adobe Acrobat program will not combine PDFs
unless you A) sign into an Adobe account, and B) pay the same cost as a monthly Netflix subscription.
The native Preview can let you combine PDFs, but it doesn't preserve the text overlay
layer.
There are other hacky solutions like this online, like this gist of a shell script,
this repo of a Python script,
and others. But I tested the Python script, it doesn't work (even with some
tinkering.) The shell script looks over-engineered. The solutoin presented
here is simple and general enough that it should work across different macOSes,
and hopefully into the future.
I recommend you just organize these many PDFs into a folder, name it smart, and
it will be helpful when searching for it, later.