Skip to content

Instantly share code, notes, and snippets.

View jvillemare's full-sized avatar
🏠
Working from home

James Villemarette jvillemare

🏠
Working from home
View GitHub Profile
@jvillemare
jvillemare / convert.py
Created May 17, 2021 12:50
Basic Python Script for running Tesseract OCR on PDFs
import os # for magick and tesseract commands
import time # for epoch time
import calendar # for epoch time
from PyPDF2 import PdfFileMerger
dir_files = [f for f in os.listdir(".") if os.path.isfile(os.path.join(".", f))]
epoch_time = int(calendar.timegm(time.gmtime()))
print(dir_files)
for file in dir_files: # look at every file in the current directory
@jvillemare
jvillemare / readme.md
Last active September 26, 2021 18:29
OCR images on MacOS with one command and open-source Tesseract

OCR Scan images on MacOS for free, and easy

Scanning images with OCR (Optical Character Recognition) is immensely helpful to find what you're looking for later solely by using the text in the image when searching. OCR is big money, so of course, there's no easy way to do it with a nice UI. Many of these apps cost $10, $20, or more, which is unreasonable.

Tesseract is a free, open-source OCR application that many of the paid apps "borrow", repackage, and sell at a high mark up. Unfortunately, when I say application, I mean a command line interface. So, it's not terribly intuitive. But we can simplify it.