Skip to content

Instantly share code, notes, and snippets.

@jvillemare
Created May 17, 2021 12:50
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save jvillemare/81887e9c53253c16e7ce0f9c60250779 to your computer and use it in GitHub Desktop.
Save jvillemare/81887e9c53253c16e7ce0f9c60250779 to your computer and use it in GitHub Desktop.
Basic Python Script for running Tesseract OCR on PDFs
import os # for magick and tesseract commands
import time # for epoch time
import calendar # for epoch time
from PyPDF2 import PdfFileMerger
dir_files = [f for f in os.listdir(".") if os.path.isfile(os.path.join(".", f))]
epoch_time = int(calendar.timegm(time.gmtime()))
print(dir_files)
for file in dir_files: # look at every file in the current directory
if file.endswith('.pdf'): # if it is a PDF, use it
print('Working on converting: ' + file)
# setup
file = file.replace('.pdf', '') # get just the filepath without the extension
folder = str(int(epoch_time)) + '_' + file # generate a folder name for temporary images
combined = folder + '/' + file # come up with temporary export path
# create folder
if not os.path.exists(folder): # make the temporary folder
os.makedirs(folder)
# convert PDF to PNG(s)
magick = 'convert -density 150 "' + file + '.pdf" "' + combined + '-%04d.png"'
print(magick)
os.system(magick)
# convert PNG(s) to PDF(s) with OCR data
pngs = [f for f in os.listdir(folder) if os.path.isfile(os.path.join(folder, f))]
for pic in pngs:
if pic.endswith('.png'):
combined_pic = folder + '/' + pic
print(combined_pic)
tesseract = 'tesseract "' + combined_pic + '" "' + combined_pic + '-ocr" PDF'
print(tesseract)
os.system(tesseract)
# combine OCR'd PDFs into one
ocr_pdfs = [f for f in os.listdir(folder) if os.path.isfile(os.path.join(folder, f))]
merger = PdfFileMerger()
for pdf in ocr_pdfs:
if pdf.endswith('.pdf'):
merger.append(folder + '/' + pdf)
merger.write(file + '-ocr-combined.pdf')
merger.close()
@aliafshany
Copy link

hi
Could you do this in non-English languages as well?

best

@BromTeque
Copy link

@aliafshany

Add the language parameter "-l " on line 30 and you should be good to go.

tesseract = 'tesseract "' + combined_pic + '" "' + combined_pic + '-ocr" -l <lan> PDF'

Remember to install the corresponding language pack for tesseract-ocr.

@coder-curious
Copy link

@BromTeque
hi , I want to run the script for pdf pages containing both english and a non-english language. i realized it can be done using the '+' parameter between the language codes, however it also assigns relative priority to the languages (based on which is mentioned before & after the + symbol).. the way to resolve that is probably using langdetect but I can't figure out how to code that .. can you please help me ?

@BromTeque
Copy link

@coder-curious

I don't think I'll be able to help you. I'm a very busy individual at the moment. It sounds like you almost got it. A bit more tinkering and you'll probably figure it out. Good luck!

@kentsin
Copy link

kentsin commented Jun 23, 2022

Thank you. The video is very helpful

What if I want txt instead of PDF?

Thanks again.

@LetMeInRuhe28
Copy link

Hi, how do i add . tif files to your code with Imagemagick?

@imperor1
Copy link

imperor1 commented Mar 8, 2023

Hey guys, pay attention to the command pdffilemerger,it is deprecated.
Change "from PyPDF2 import PdfMerger" to "from PyPDF2 import PdfMerger" and "PdfFileMerger()" to "PdfMerger()"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment