Below is the list of software I used. They were used on a Gentoo Linux installation.
- Scanner software: For example, xsane. I used
scanimage
command line software from sane-backends. These are the package names in Gentoo Linux, and in some other Linux distributions as well. - Scantailor: This is a graphical application that can batch process images to predefined settings. It helps split double page PDF files into single page files and it also helps clean up the image. The website is at Github.
- Tesseract: This is the main program that can run OCR on images. It can read in an image file and can output a text embedded PDF file. The website is on Github.
- pdfunite: This program is from the Poppler project that helps join multiple PDF files.
First, we scan the pages of the document or two page book into a TIFF image format using scanimage.
-
Use
scanimage -L
to get the list of devices. -
The resolution, and the coordinates of the image are custom and are determined based on the device that is doing the scanning. To obtain this list, run the following command
scanimage --help -d "<device name>"
In the case of my scanner, it output the following:
Scan mode: --mode Lineart|Gray|Color [Lineart] Selects the scan mode (e.g., lineart, monochrome, or color). --resolution 75|100|200|300|600|1200dpi [75] Sets the resolution of the scanned image. --source Flatbed [Flatbed] Selects the scan source (such as a document-feeder). Advanced: --brightness 0..2000 [1000] [advanced] Controls the brightness of the acquired image. --contrast 0..2000 [1000] [advanced] Controls the contrast of the acquired image. --compression JPEG [JPEG] [advanced] Selects the scanner compression method for faster scans, possibly at the expense of image quality. --jpeg-quality 0..100 [inactive] Sets the scanner JPEG compression factor. Larger numbers mean better compression, and smaller numbers mean better image quality. Geometry: -l 0..215.9mm [0] Top-left x position of scan area. -t 0..297.011mm [0] Top-left y position of scan area. -x 0..215.9mm [215.9] Width of scan-area. -y 0..297.011mm [297.011] Height of scan-area.
In my case, I chose the resolution 300 as a good compromise between scanning
speed and output. It also seems to be a resolution at which tesseract gives good
results. At a resolution of 600, the scanning speed was roughly 70 seconds per
page, and resulted in a file size of 30MB per page! In contrast, a resolution of
300 resulted in a file size of only 7.5 MB per page and took 8 seconds to scan
each page. The final PDF file created by tesseract had a size of about 60KB per
page. I also adjusted the origin, width, and height of the scanned image, based
on the size of the document I was scanning. Finally, the mode is set to Gray
instead of Color
because I intend to run OCR using tesseract.
Command used:
scanimage -d "<device name>" \
--mode Gray --resolution 300 --format=tiff -t 10 -x 200 -y 280 > out.tiff
Example (two page) output from scanimage:
Next, we run scantailor to process all the TIFF images output by scanimage. This is a strictly graphical application. The output is present in the "project directory/out" directory by default.
Unfortunately, I couldn't find a way to batch process all images. We need to go "up" or "down" each image to apply each step. We can use a script using xdotool to automate mouse clicks. For example, if we have 50 images to process, then we can do this:
# every 1000ms 50 times left click
xdotool click --delay 1000 --repeat 50 1
Keeping with the 300 DPI setting at which the image was scanned, the output from scantailor is also saved at 300 DPI. This can be set in the default parameters setting.
Initial input when opening scanimage project. Because of my default settings, the image is automatically rotated anti-clockwise upon opening. We need to click on each step on the left one by one to apply the steps. Then we can go down each image on the right.
Output after completing all the steps in scantailor. Note that the two pages have been separated into individual pages.
Next up is to run the tesseract program for each image output by scantailor. In
my case each input scanned image had two pages and scantailor had automatically
detected the page boundary and changed the output. Hence, tesseract needs to be
run on each output image. For example, an image aaa.tiff
in the "project
directory" is output as out/out_1L.tif
and out/out_1R.tif
in the out/
subdirectory. We run the following tesseract command. Here, -l
provides the
language, then we give the input file name, followed by the basename of the
output file and finally the format of the output file.
tesseract -l eng out_1L.tif out_1L pdf
tesseract -l eng out_2R.tif out_2R pdf
This outputs out_1L.pdf
that contains the OCR'd text embedded and selectable
inside the PDF file. We can join both the files now using pdfunite
.
pdfunite out_*.pdf out.pdf
Example output showing that the text can be searched in Okular:
I used some scripts to help scan and process several documents in one go.
-
First bash script written was to scan documents, automatically numbering them. The output files are in the current directory.
declare prefix="out" declare -i start=1 while true; do fname=${prefix}_$(printf "%04d" $start) scanimage -d "$device" --mode Gray --resolution 300 --format=tiff \ -t 10 -x 200 -y 280 > "$fname".tiff && ((start++)) echo -ne "Output file $fname.tiff. Press Enter to continue with next scan. Ctrl-C to cancel\r" read -n 1 -s fname done
-
Second bash script was used after running scantailor. It was used to automatically create OCR'd PDF file from the output of scantailor. The output of scantailor was available in the
out/
subdirectory of the current directory.pushd out >& /dev/null declare firstpage="$( ls -v1 *.tif | head -n 1 )" declare lastpage="$( ls -v1 *.tif | tail -n 1 )" declare prefix="${lastpage%%_*}" firstpage="${firstpage#*_}"; firstpage="${firstpage%%_*}" lastpage="${lastpage#*_}"; lastpage="${lastpage%%_*}" for tif in *.tif; do echo -ne "Creating PDF with OCR for $tif\r" tesseract -l eng $tif ${tif%.tif} pdf >& /dev/null done echo -e "Final PDF output in file: ${prefix}_${firstpage}_${lastpage}.pdf" pdfunite *.pdf ${prefix}_${firstpage}_${lastpage}.pdf