Skip to content

Instantly share code, notes, and snippets.

@ppurka
Last active August 1, 2022 14:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ppurka/b2f760a4b04cbf7613593bb513a2858a to your computer and use it in GitHub Desktop.
Save ppurka/b2f760a4b04cbf7613593bb513a2858a to your computer and use it in GitHub Desktop.
Scanning with OCR on Linux

Scanning documents to PDF with OCR text on Linux

Software Required

Below is the list of software I used. They were used on a Gentoo Linux installation.

  • Scanner software: For example, xsane. I used scanimage command line software from sane-backends. These are the package names in Gentoo Linux, and in some other Linux distributions as well.
  • Scantailor: This is a graphical application that can batch process images to predefined settings. It helps split double page PDF files into single page files and it also helps clean up the image. The website is at Github.
  • Tesseract: This is the main program that can run OCR on images. It can read in an image file and can output a text embedded PDF file. The website is on Github.
  • pdfunite: This program is from the Poppler project that helps join multiple PDF files.

The Process

First, we scan the pages of the document or two page book into a TIFF image format using scanimage.

  • Use scanimage -L to get the list of devices.

  • The resolution, and the coordinates of the image are custom and are determined based on the device that is doing the scanning. To obtain this list, run the following command

    scanimage --help -d "<device name>"

    In the case of my scanner, it output the following:

    Scan mode:
      --mode Lineart|Gray|Color [Lineart]
          Selects the scan mode (e.g., lineart, monochrome, or color).
      --resolution 75|100|200|300|600|1200dpi [75]
          Sets the resolution of the scanned image.
      --source Flatbed [Flatbed]
          Selects the scan source (such as a document-feeder).
    Advanced:
      --brightness 0..2000 [1000] [advanced]
          Controls the brightness of the acquired image.
      --contrast 0..2000 [1000] [advanced]
          Controls the contrast of the acquired image.
      --compression JPEG [JPEG] [advanced]
          Selects the scanner compression method for faster scans, possibly at
          the expense of image quality.
      --jpeg-quality 0..100 [inactive]
          Sets the scanner JPEG compression factor. Larger numbers mean better
          compression, and smaller numbers mean better image quality.
    Geometry:
      -l 0..215.9mm [0]
          Top-left x position of scan area.
      -t 0..297.011mm [0]
          Top-left y position of scan area.
      -x 0..215.9mm [215.9]
          Width of scan-area.
      -y 0..297.011mm [297.011]
          Height of scan-area.
    

In my case, I chose the resolution 300 as a good compromise between scanning speed and output. It also seems to be a resolution at which tesseract gives good results. At a resolution of 600, the scanning speed was roughly 70 seconds per page, and resulted in a file size of 30MB per page! In contrast, a resolution of 300 resulted in a file size of only 7.5 MB per page and took 8 seconds to scan each page. The final PDF file created by tesseract had a size of about 60KB per page. I also adjusted the origin, width, and height of the scanned image, based on the size of the document I was scanning. Finally, the mode is set to Gray instead of Color because I intend to run OCR using tesseract.

Command used:

scanimage -d "<device name>" \
    --mode Gray --resolution 300 --format=tiff -t 10 -x 200 -y 280 > out.tiff

Example (two page) output from scanimage:

initial scan

Next, we run scantailor to process all the TIFF images output by scanimage. This is a strictly graphical application. The output is present in the "project directory/out" directory by default.

Unfortunately, I couldn't find a way to batch process all images. We need to go "up" or "down" each image to apply each step. We can use a script using xdotool to automate mouse clicks. For example, if we have 50 images to process, then we can do this:

#             every 1000ms   50 times  left click
xdotool click --delay 1000 --repeat 50 1

Keeping with the 300 DPI setting at which the image was scanned, the output from scantailor is also saved at 300 DPI. This can be set in the default parameters setting.

Initial input when opening scanimage project. Because of my default settings, the image is automatically rotated anti-clockwise upon opening. We need to click on each step on the left one by one to apply the steps. Then we can go down each image on the right.

before scantailor

Output after completing all the steps in scantailor. Note that the two pages have been separated into individual pages.

after scantailor

Next up is to run the tesseract program for each image output by scantailor. In my case each input scanned image had two pages and scantailor had automatically detected the page boundary and changed the output. Hence, tesseract needs to be run on each output image. For example, an image aaa.tiff in the "project directory" is output as out/out_1L.tif and out/out_1R.tif in the out/ subdirectory. We run the following tesseract command. Here, -l provides the language, then we give the input file name, followed by the basename of the output file and finally the format of the output file.

tesseract -l eng out_1L.tif out_1L pdf
tesseract -l eng out_2R.tif out_2R pdf

This outputs out_1L.pdf that contains the OCR'd text embedded and selectable inside the PDF file. We can join both the files now using pdfunite.

pdfunite out_*.pdf out.pdf

Example output showing that the text can be searched in Okular:

final PDF output

Scripts used

I used some scripts to help scan and process several documents in one go.

  1. First bash script written was to scan documents, automatically numbering them. The output files are in the current directory.

    declare     prefix="out"
    declare -i  start=1
    while true; do
        fname=${prefix}_$(printf "%04d" $start)
        scanimage -d "$device" --mode Gray --resolution 300 --format=tiff \
                  -t 10 -x 200 -y 280 > "$fname".tiff &&
             ((start++))
        echo -ne "Output file $fname.tiff. Press Enter to continue with next scan. Ctrl-C to cancel\r"
        read -n 1 -s fname
    done
  2. Second bash script was used after running scantailor. It was used to automatically create OCR'd PDF file from the output of scantailor. The output of scantailor was available in the out/ subdirectory of the current directory.

    pushd out >& /dev/null
    declare firstpage="$( ls -v1 *.tif | head -n 1 )"
    declare lastpage="$(  ls -v1 *.tif | tail -n 1 )"
    declare prefix="${lastpage%%_*}"
    
    firstpage="${firstpage#*_}"; firstpage="${firstpage%%_*}"
    lastpage="${lastpage#*_}";   lastpage="${lastpage%%_*}"
    
    for tif in *.tif; do
        echo -ne "Creating PDF with OCR for $tif\r"
        tesseract -l eng $tif ${tif%.tif} pdf >& /dev/null
    done
    echo -e "Final PDF output in file: ${prefix}_${firstpage}_${lastpage}.pdf"
    pdfunite *.pdf ${prefix}_${firstpage}_${lastpage}.pdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment