ppurka/scanning_with_OCR-00.md

## scanning_with_OCR-00.md

      
    Raw
  

              scanning_with_OCR-00.md
            
          
    Scanning documents to PDF with OCR text on Linux

Software Required

Below is the list of software I used. They were used on a Gentoo Linux
installation.

Scanner software: For example, xsane. I used
scanimage command line software from
sane-backends. These are the package names in
Gentoo Linux, and in some other Linux distributions as well.
Scantailor: This is a graphical application that can batch process images
to predefined settings. It helps split double page PDF files into single page
files and it also helps clean up the image. The website is at
Github.
Tesseract: This is the main program that can run OCR on images. It can
read in an image file and can output a text embedded PDF file. The website is
on Github.
pdfunite: This program is from the
Poppler project that helps join multiple
PDF files.

The Process

First, we scan the pages of the document or two page book into a TIFF image
format using scanimage.


Use scanimage -L to get the list of devices.


The resolution, and the coordinates of the image are custom and are determined
based on the device that is doing the scanning. To obtain this list, run the
following command
scanimage --help -d "<device name>"
In the case of my scanner, it output the following:
Scan mode:
  --mode Lineart|Gray|Color [Lineart]
      Selects the scan mode (e.g., lineart, monochrome, or color).
  --resolution 75|100|200|300|600|1200dpi [75]
      Sets the resolution of the scanned image.
  --source Flatbed [Flatbed]
      Selects the scan source (such as a document-feeder).
Advanced:
  --brightness 0..2000 [1000] [advanced]
      Controls the brightness of the acquired image.
  --contrast 0..2000 [1000] [advanced]
      Controls the contrast of the acquired image.
  --compression JPEG [JPEG] [advanced]
      Selects the scanner compression method for faster scans, possibly at
      the expense of image quality.
  --jpeg-quality 0..100 [inactive]
      Sets the scanner JPEG compression factor. Larger numbers mean better
      compression, and smaller numbers mean better image quality.
Geometry:
  -l 0..215.9mm [0]
      Top-left x position of scan area.
  -t 0..297.011mm [0]
      Top-left y position of scan area.
  -x 0..215.9mm [215.9]
      Width of scan-area.
  -y 0..297.011mm [297.011]
      Height of scan-area.


In my case, I chose the resolution 300 as a good compromise between scanning
speed and output. It also seems to be a resolution at which tesseract gives good
results. At a resolution of 600, the scanning speed was roughly 70 seconds per
page, and resulted in a file size of 30MB per page! In contrast, a resolution of
300 resulted in a file size of only 7.5 MB per page and took 8 seconds to scan
each page. The final PDF file created by tesseract had a size of about 60KB per
page. I also adjusted the origin, width, and height of the scanned image, based
on the size of the document I was scanning. Finally, the mode is set to Gray
instead of Color because I intend to run OCR using tesseract.
Command used:
scanimage -d "<device name>" \
    --mode Gray --resolution 300 --format=tiff -t 10 -x 200 -y 280 > out.tiff
Example (two page) output from scanimage:

Next, we run scantailor to process all the TIFF images output by scanimage.
This is a strictly graphical application. The output is present in the "project
directory/out" directory by default.
Unfortunately, I couldn't find a way to batch process all images. We need to go
"up" or "down" each image to apply each step. We can use a script using
xdotool to automate mouse
clicks. For example, if we have 50 images to process, then we can do this:
#             every 1000ms   50 times  left click
xdotool click --delay 1000 --repeat 50 1
Keeping with the 300 DPI setting at which the image was scanned, the output from
scantailor is also saved at 300 DPI. This can be set in the default parameters
setting.
Initial input when opening scanimage project. Because of my default settings,
the image is automatically rotated anti-clockwise upon opening. We need to click
on each step on the left one by one to apply the steps. Then we can go down each
image on the right.

Output after completing all the steps in scantailor. Note that the two pages
have been separated into individual pages.

Next up is to run the tesseract program for each image output by scantailor. In
my case each input scanned image had two pages and scantailor had automatically
detected the page boundary and changed the output. Hence, tesseract needs to be
run on each output image. For example, an image aaa.tiff in the "project
directory" is output as out/out_1L.tif and out/out_1R.tif in the out/
subdirectory. We run the following tesseract command. Here, -l provides the
language, then we give the input file name, followed by the basename of the
output file and finally the format of the output file.
tesseract -l eng out_1L.tif out_1L pdf
tesseract -l eng out_2R.tif out_2R pdf
This outputs out_1L.pdf that contains the OCR'd text embedded and selectable
inside the PDF file. We can join both the files now using pdfunite.
pdfunite out_*.pdf out.pdf
Example output showing that the text can be searched in
Okular:

Scripts used

I used some scripts to help scan and process several documents in one go.


First bash script written was to scan documents, automatically numbering
them. The output files are in the current directory.
declare     prefix="out"
declare -i  start=1
while true; do
    fname=${prefix}_$(printf "%04d" $start)
    scanimage -d "$device" --mode Gray --resolution 300 --format=tiff \
              -t 10 -x 200 -y 280 > "$fname".tiff &&
         ((start++))
    echo -ne "Output file $fname.tiff. Press Enter to continue with next scan. Ctrl-C to cancel\r"
    read -n 1 -s fname
done


Second bash script was used after running scantailor. It was used to
automatically create OCR'd PDF file from the output of scantailor. The
output of scantailor was available in the out/ subdirectory of the current
directory.
pushd out >& /dev/null
declare firstpage="$( ls -v1 *.tif | head -n 1 )"
declare lastpage="$(  ls -v1 *.tif | tail -n 1 )"
declare prefix="${lastpage%%_*}"

firstpage="${firstpage#*_}"; firstpage="${firstpage%%_*}"
lastpage="${lastpage#*_}";   lastpage="${lastpage%%_*}"

for tif in *.tif; do
    echo -ne "Creating PDF with OCR for $tif\r"
    tesseract -l eng $tif ${tif%.tif} pdf >& /dev/null
done
echo -e "Final PDF output in file: ${prefix}_${firstpage}_${lastpage}.pdf"
pdfunite *.pdf ${prefix}_${firstpage}_${lastpage}.pdf


## scanning_with_OCR-01.png

      
    Raw
  

              scanning_with_OCR-01.png
            
          
## scanning_with_OCR-02.png

      
    Raw
  

              scanning_with_OCR-02.png
            
          
## scanning_with_OCR-03.png

      
    Raw
  

              scanning_with_OCR-03.png
            
          
## scanning_with_OCR-04.png

      
    Raw
  

              scanning_with_OCR-04.png