Skip to content

Instantly share code, notes, and snippets.

@vgalin
Last active February 8, 2021 18:00
Show Gist options
  • Save vgalin/53df38b17c14a61499ab27b65d399da8 to your computer and use it in GitHub Desktop.
Save vgalin/53df38b17c14a61499ab27b65d399da8 to your computer and use it in GitHub Desktop.
From paper to searchable PDF

From sheet of paper to searchable PDF file

0. Prerequisites

If you don't have it yet, get pip by installing Python.

Then use pip to install the two following tools:

Name PyPI link Install command (paste this into your terminal)
img2pdf https://pypi.org/project/img2pdf/ pip install img2pdf
OCRmyPDF https://pypi.org/project/ocrmypdf/4.1/ pip install ocrmypdf

Note : if the install of OCRmyPDF fails on your Windows machine, you may want to use WSL or install OCRmyPDF's Docker image (more info here).

1. Scan the pages or take pictures of them.

If you don't have a scanner or don't want to spend too much time on this step, you can take pictures of the pages using (for example) your phone's camera. You'll get better results by placing your photo-taking-device directly above the pages.

2. Transfer the pictures to your computer

You can use an USB cable, cloud storage, etc.
Put all the pictures into a unique folder on your computer.

3. Run img2pdf

Use img2pdf to transform our multiple pictures into one PDF file (out.pdf).
From the directory you put all the pictures in, run the following command :

img2pdf *.jpg -o out.pdf

4. Run OCRmyPDF

Use OCRmyPDF to perform OCR (Optical Character Recognition) on the PDF file you just generated.
From the directory you put all the pictures in, run the following command :

ocrmypdf out.pdf result.pdf --deskew --clean --remove-background --sidecar

This command will create two files:

  • result.pdf, a searchable and copy-pastable PDF
  • result.pdf.txt, a text file containing all the OCR-ed sentenses, words and characters.

Note: You can play with ocrmypdf's flags to get a result that better suits you. For instance if you don't want a .txt file with your PDF, remove the --sidecar flag. You can also try to remove the --clean flag if your original document contained colored content that you want to keep, but this may result in a bad OCR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment