Skip to content

Instantly share code, notes, and snippets.

@gusbrs
Last active January 24, 2022 12:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gusbrs/0473e50c8ce471e11cd3a63c1d779e59 to your computer and use it in GitHub Desktop.
Save gusbrs/0473e50c8ce471e11cd3a63c1d779e59 to your computer and use it in GitHub Desktop.
** Digitization of books
:Sources:
- File formats:
+ [[https://www.succeed-project.eu/outputs][Succeed Project - Outputs]] (Support Action Centre of Competence in
Digitisation)
* [[file:~/Gustavo/Library/Outros/Digitalização/Succeed_2014_Recommendations for metadata and data formats for online availability and long-term preservation.pdf][Succeed_Recommendations for data formats for long-term preservation]]
+ https://en.wikipedia.org/wiki/TIFF
:END:
- Scan
+ with ~scanimage~ batch mode:
#+begin_src bash :results none
scanimage --device-name="brother4:bus2;dev1" \
--format=pnm --mode="True Gray" \
--resolution=300 \
--source="Automatic Document Feeder(centrally aligned)" \
--progress \
-l 0 -t 0 -x 210 -y 297 \
--brightness 5 \
--contrast 15 \
--batch="%04d-recto.pnm" \
--batch-increment=2 \
--batch-start=1
#+end_src
* Resolution: 300 (normal), 400-600 (small font). I have made some light
testing, and increasing resolution to 600 does not appreciably improve
OCR results, but does take a large toll on scan and processing time. I
also did some light testing with smaller resolutions (150), and the only
potential gain from it would be not to do the potentially distortive
step of rescaling the image, but the visual "feel" of both cases is
indistinguishable, and a 300+ resolution is recommended by ~tesseract~.
* Mode: "True Gray" for best results, both of the OCR and the visual of
the page. "Black & White" is another possibility, in principle, may be
useful for file size reasons and, specially for xerox source
(unfortunately, we cannot set contrast with it). Don't use "Gray[Error
Diffusion]" though, if needed, use "Black & White" and a higher
resolution.
* Do the cropping at scan time, by setting proper scan coordinates, this
is done once per book, and should be the same to all pages, so that they
have the same size.
* Always include the "recto/verso" in the file name, even when not
scanning sideways (and thus, no rotation is needed). It is easier to
recover from ADF mistakes.
* Adjust the numbers at =batch-increment= (2 or -2) and =batch-start= as
needed. And take care, ~scanimage~ overwrites without warning, and
there's no way to prevent it.
* Brightness/contrast should be set relatively high, so that most of the
inter-word/inter-line gray areas should be gone, but not overly so. How
much is needed depends on the color of the paper: ranges of brightness
0-15 and contrast 10-30 usually produce good results. Take care with
very light/small fonts (a superscript, a subscript, a derivative, etc.),
they may become barely legible when brightness and contrast are too
high.
:Alternate_methods:
* with ~scanadf~:
#+begin_src bash :results none
scanadf --device-name="brother4:bus2;dev1" \
--mode="True Gray" \
--resolution=300 \
--source="Automatic Document Feeder(centrally aligned)" \
-l 0 -t 0 -x 210 -y 297 \
--brightness 10 \
--contrast 20 \
--output-file "%04d.pnm"
#+end_src
- This does pretty much the same as ~scanimage~ batch mode, but we can run
a script after each image, but cannot change the step of the file name
counter...
* with ~gscan2pdf~:
- Even when ~gscan2pdf~ has some tight ropes in setting up post-processing
tools, it can be convenient to organize the scanning of a large set:
to shuffle facing and reverse sides, and to batch crop the images with
some visual feedback.
- The deal breaker in the post-processing area is that ~gscan2pdf~ is
bound to the hOCR method of interweaving the OCR text and the scanned
image, and its results are just bad. Certainly much inferior relative
to which ~tesseract~ itself is able to achieve with its =pdf= config.
- Ah... convenient, convenient, it just hangs on large sets.
:END:
- Copy the files to single numbering
#+begin_src bash
for f in *-recto.pnm; do cp "$f" "${f/%-recto.pnm/-1up.pnm}" ; done
for f in *-verso.pnm; do cp "$f" "${f/%-verso.pnm/-1up.pnm}" ; done
#+end_src
+ *or* rotate (if needed)
#+begin_src bash
for f in *-recto.pnm; do pnmflip -cw "$f" > "${f/%-recto.pnm/-1up.pnm}" ; done
for f in *-verso.pnm; do pnmflip -ccw "$f" > "${f/%-verso.pnm/-1up.pnm}" ; done
#+end_src
- Check scan
#+begin_src bash
img2pdf -o checkscan.pdf *-1up.pnm
#+end_src
+ check the pages *sequence*, to see if the ADF didn't skip anything.
- OCR (tesseract)
:Sources:
+ [[https://tesseract-ocr.github.io/tessdoc/FAQ.html#how-to-process-multiple-images-in-a-single-run][tessdoc - Process multiple images in a single run]]
+ [[https://tesseract-ocr.github.io/tessdoc/FAQ.html#how-do-i-integrate-original-image-file-and-detected-text-into-pdf][tessdoc - Integrate original image file and detected text into pdf]]
+ https://github.com/tesseract-ocr/tesseract/issues/660#issuecomment-545995625
:END:
#+begin_src bash
ls *-1up.pnm > ocrlist.txt
tesseract ocrlist.txt ocrtext --dpi 300 -l por -c textonly_pdf=1 pdf
#+end_src
+ we should specify the resolution explicitly, since ~tesseract~ is not able
to retrieve it directly from the =.pnm= format, and would have to guess.
+ and don't forget to specify the correct *language*.
- Downscale pnm
#+begin_src bash
for f in *-1up.pnm; do
pnmscale 0.5 "$f" > "${f/%-1up.pnm/-low.pnm}"
done
#+end_src
- Convert pnm to tif
#+begin_src bash
for f in *-low.pnm; do
pnmtotiff "$f" > "${f/%.pnm/.tif}"
done
#+end_src
+ Another alternative would be to use =pnmtopng= to convert the files to =.png=
then bundle them to =.pdf= using =img2pdf=. Using =zip= compression for the
tiffs, the file sizes seem to be equivalent, and =tiff= appears to be
preferred for archival purposes in general (though the PDF does not embed
the actual file, but rather transforms the image to an internal format.
~pdfimages~, for example, cannot know the image in the PDF was originally a
=tiff=, and we can only extract one by specifying the format. Hopefully
these transformations are lossless... but all this means that using either
=tiff= or =png= here doesn't really make any difference).
- Convert tif to pdf
#+begin_src bash
tiffcp -c zip *-low.tif ocrimage.tif
tiff2pdf -z -r o -x 150 -y 150 -o ocrimage.pdf ocrimage.tif
#+end_src
+ I'm using =zip= compression here, the alternative would be =lzw= (not
available for ~tiff2pdf~?), but zip seems to be producing smaller files
(ca. 15-20% smaller).
- Check number of pages
#+begin_src bash
pdfinfo ocrimage.pdf | grep Pages
pdfinfo ocrtext.pdf | grep Pages
#+end_src
- Underlay OCR to page scans on pdf
#+begin_src bash
qpdf ocrimage.pdf --underlay ocrtext.pdf -- ocrboth.pdf
#+end_src
- Scan covers
#+begin_src bash
scanimage --device-name="brother4:bus2;dev1" \
--format=pnm --mode="24bit Color" \
--resolution=150 \
--source="FlatBed" \
--progress \
-l 0 -t 0 -x 210 -y 297 \
--output-file="cover.pnm"
pnmflip -cw cover.pnm > cover-1up.pnm
pnmtotiff cover-1up.pnm > cover.tif
tiff2pdf -z -r o -x 150 -y 150 -o cover.pdf cover.tif
#+end_src
+ Use *the same crop coordinates* as the original scan for the regular pages.
+ If it exists, do the same for the back cover: =back.pdf=.
- Bundle pages
#+begin_src bash
qpdf --empty --pages cover.pdf ocrboth.pdf back.pdf -- final.pdf
#+end_src
- Clean (unpaper)
:Sources:
+ https://github.com/unpaper/unpaper/blob/main/doc/basic-concepts.md
+ https://github.com/unpaper/unpaper/blob/main/doc/image-processing.md
+ https://github.com/unpaper/unpaper/blob/main/doc/file-formats.md
+ https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
+ https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy
:END:
+ I had started under the premisse that pre-processing the scanned images
would be a necessity and would improve results significantly. However,
while this may be true for lesser quality scans (e.g. from xerox copies),
this is *not* the case for the scans from the original books with the ADF.
In this case, there is no significant gain in OCR results, and no
significant change in file sizes. Also no significant change in
"sharpness"/"readability" of the images. Despite that, ~unpaper~ requires a
lot of extra effort, and some risk. Because the number of errors is
significant, and thus the results must be checked page by page and, of
course, a visual examination is error prone. Errors abound in deskewing
(with some very bad results), are significant in centering/mask-scan too
and, finally, are to fear in noise/filters removing things from the pages
we would not want to. All in all, only use ~unpaper~ if really needed.
+ Thus, only for sources which require it:
#+begin_src bash
unpaper --layout single --dpi 300 --verbose --overwrite \
"%04d.pnm" "%04d_unp.pnm"
#+end_src
+ The filter defaults of unpaper are usually good for a well scanned paper,
such as the one we should have at this point. One thing that may need
adjustment is when the between the line spaces have some "shadow" gray
areas. Ideally, this is better handled by increasing the contrast in the
scanning step. But, failing that, we can improve things with a tighter
settings for the gray and blur filters. Settings such as
=--grayfilter-size 5 --grayfilter-step 2 --grayfilter-threshold 0.6= and/or
=--blurfilter-size 5 --blurfilter-step 2 --blurfilter-intensity 0.1= do
improve this kind of problem. The grayfilter is actually more effective
here, provided we are working with a true gray image, of course. The
noisefilter may also be of use, but so far I perceive little effect from
it in grayscale images (it is prominent though for B&W images).
+ Another situation which may require adjustment is the use of "highlighter"
and pencil annotations in the original (shame on me...). The grayfilter
makes ugly dents on them, but it turns out it is not a good idea to
disable the filter completely. Hence, let it work for the border areas
and restrict it for the block of text by reducing the granularity of the
filter, e.g. =--grayfilter-size 100=.
- Check clean step
#+begin_src bash
img2pdf -o checkclean.pdf *_unp.pnm
#+end_src
+ check the visuals of each page, to see if ~unpaper~ did not miss anything
obvious.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment