If you don't have it yet, get pip
by installing Python.
Then use pip
to install the two following tools:
Name | PyPI link | Install command (paste this into your terminal) |
---|---|---|
img2pdf | https://pypi.org/project/img2pdf/ | pip install img2pdf |
OCRmyPDF | https://pypi.org/project/ocrmypdf/4.1/ | pip install ocrmypdf |
Note : if the install of OCRmyPDF fails on your Windows machine, you may want to use WSL or install OCRmyPDF's Docker image (more info here).
If you don't have a scanner or don't want to spend too much time on this step, you can take pictures of the pages using (for example) your phone's camera. You'll get better results by placing your photo-taking-device directly above the pages.
You can use an USB cable, cloud storage, etc.
Put all the pictures into a unique folder on your computer.
Use img2pdf to transform our multiple pictures into one PDF file (out.pdf
).
From the directory you put all the pictures in, run the following command :
img2pdf *.jpg -o out.pdf
Use OCRmyPDF to perform OCR (Optical Character Recognition) on the PDF file you just generated.
From the directory you put all the pictures in, run the following command :
ocrmypdf out.pdf result.pdf --deskew --clean --remove-background --sidecar
This command will create two files:
result.pdf
, a searchable and copy-pastable PDFresult.pdf.txt
, a text file containing all the OCR-ed sentenses, words and characters.
Note: You can play with ocrmypdf's flags to get a result that better suits you.
For instance if you don't want a .txt
file with your PDF, remove the --sidecar
flag.
You can also try to remove the --clean
flag if your original document contained colored content that you want to keep, but this may result in a bad OCR.