vgalin/paper_to_pdf.md

## paper_to_pdf.md

      
    Raw
  

              paper_to_pdf.md
            
          
    From sheet of paper to searchable PDF file

0. Prerequisites

If you don't have it yet, get pip by installing Python.
Then use pip to install the two following tools:


Name
PyPI link
Install command (paste this into your terminal)


img2pdf
https://pypi.org/project/img2pdf/
pip install img2pdf


OCRmyPDF
https://pypi.org/project/ocrmypdf/4.1/
pip install ocrmypdf


Note : if the install of OCRmyPDF fails on your Windows machine, you may want to use WSL or install OCRmyPDF's Docker image (more info here).
1. Scan the pages or take pictures of them.

If you don't have a scanner or don't want to spend too much time on this step, you can take pictures of the pages using (for example) your phone's camera.
You'll get better results by placing your photo-taking-device directly above the pages.
2. Transfer the pictures to your computer

You can use an USB cable, cloud storage, etc.

Put all the pictures into a unique folder on your computer.
3. Run img2pdf

Use img2pdf to transform our multiple pictures into one PDF file (out.pdf).

From the directory you put all the pictures in, run the following command :
img2pdf *.jpg -o out.pdf
4. Run OCRmyPDF

Use OCRmyPDF to perform OCR (Optical Character Recognition) on the PDF file you just generated.

From the directory you put all the pictures in, run the following command :
ocrmypdf out.pdf result.pdf --deskew --clean --remove-background --sidecar
This command will create two files:

result.pdf, a searchable and copy-pastable PDF
result.pdf.txt, a text file containing all the OCR-ed sentenses, words and characters.

Note: You can play with ocrmypdf's flags to get a result that better suits you.
For instance if you don't want a .txt file with your PDF, remove the --sidecar flag.
You can also try to remove the --clean flag if your original document contained colored content that you want to keep, but this may result in a bad OCR.
Name	PyPI link	Install command (paste this into your terminal)
img2pdf	https://pypi.org/project/img2pdf/	`pip install img2pdf`
OCRmyPDF	https://pypi.org/project/ocrmypdf/4.1/	`pip install ocrmypdf`