Skip to content

Instantly share code, notes, and snippets.

@mermelstein
Created May 3, 2024 00:46
Show Gist options
  • Save mermelstein/27ec13eda12c8c394a5a2c73948af56c to your computer and use it in GitHub Desktop.
Save mermelstein/27ec13eda12c8c394a5a2c73948af56c to your computer and use it in GitHub Desktop.
extract text from pdf when the text isn't easy to copy
from PIL import Image
import pytesseract
from pdf2image import convert_from_path
# Convert the PDF to a list of images
images = convert_from_path('path_to_pdf.pdf')
# Process each image with Tesseract
for i, img in enumerate(images):
text = pytesseract.image_to_string(img, lang='eng')
with open(f'page_{i+1}.txt', 'w') as f:
f.write(text)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment