-
-
Save gusbrs/0473e50c8ce471e11cd3a63c1d779e59 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
** Digitization of books | |
:Sources: | |
- File formats: | |
+ [[https://www.succeed-project.eu/outputs][Succeed Project - Outputs]] (Support Action Centre of Competence in | |
Digitisation) | |
* [[file:~/Gustavo/Library/Outros/Digitalização/Succeed_2014_Recommendations for metadata and data formats for online availability and long-term preservation.pdf][Succeed_Recommendations for data formats for long-term preservation]] | |
+ https://en.wikipedia.org/wiki/TIFF | |
:END: | |
- Scan | |
+ with ~scanimage~ batch mode: | |
#+begin_src bash :results none | |
scanimage --device-name="brother4:bus2;dev1" \ | |
--format=pnm --mode="True Gray" \ | |
--resolution=300 \ | |
--source="Automatic Document Feeder(centrally aligned)" \ | |
--progress \ | |
-l 0 -t 0 -x 210 -y 297 \ | |
--brightness 5 \ | |
--contrast 15 \ | |
--batch="%04d-recto.pnm" \ | |
--batch-increment=2 \ | |
--batch-start=1 | |
#+end_src | |
* Resolution: 300 (normal), 400-600 (small font). I have made some light | |
testing, and increasing resolution to 600 does not appreciably improve | |
OCR results, but does take a large toll on scan and processing time. I | |
also did some light testing with smaller resolutions (150), and the only | |
potential gain from it would be not to do the potentially distortive | |
step of rescaling the image, but the visual "feel" of both cases is | |
indistinguishable, and a 300+ resolution is recommended by ~tesseract~. | |
* Mode: "True Gray" for best results, both of the OCR and the visual of | |
the page. "Black & White" is another possibility, in principle, may be | |
useful for file size reasons and, specially for xerox source | |
(unfortunately, we cannot set contrast with it). Don't use "Gray[Error | |
Diffusion]" though, if needed, use "Black & White" and a higher | |
resolution. | |
* Do the cropping at scan time, by setting proper scan coordinates, this | |
is done once per book, and should be the same to all pages, so that they | |
have the same size. | |
* Always include the "recto/verso" in the file name, even when not | |
scanning sideways (and thus, no rotation is needed). It is easier to | |
recover from ADF mistakes. | |
* Adjust the numbers at =batch-increment= (2 or -2) and =batch-start= as | |
needed. And take care, ~scanimage~ overwrites without warning, and | |
there's no way to prevent it. | |
* Brightness/contrast should be set relatively high, so that most of the | |
inter-word/inter-line gray areas should be gone, but not overly so. How | |
much is needed depends on the color of the paper: ranges of brightness | |
0-15 and contrast 10-30 usually produce good results. Take care with | |
very light/small fonts (a superscript, a subscript, a derivative, etc.), | |
they may become barely legible when brightness and contrast are too | |
high. | |
:Alternate_methods: | |
* with ~scanadf~: | |
#+begin_src bash :results none | |
scanadf --device-name="brother4:bus2;dev1" \ | |
--mode="True Gray" \ | |
--resolution=300 \ | |
--source="Automatic Document Feeder(centrally aligned)" \ | |
-l 0 -t 0 -x 210 -y 297 \ | |
--brightness 10 \ | |
--contrast 20 \ | |
--output-file "%04d.pnm" | |
#+end_src | |
- This does pretty much the same as ~scanimage~ batch mode, but we can run | |
a script after each image, but cannot change the step of the file name | |
counter... | |
* with ~gscan2pdf~: | |
- Even when ~gscan2pdf~ has some tight ropes in setting up post-processing | |
tools, it can be convenient to organize the scanning of a large set: | |
to shuffle facing and reverse sides, and to batch crop the images with | |
some visual feedback. | |
- The deal breaker in the post-processing area is that ~gscan2pdf~ is | |
bound to the hOCR method of interweaving the OCR text and the scanned | |
image, and its results are just bad. Certainly much inferior relative | |
to which ~tesseract~ itself is able to achieve with its =pdf= config. | |
- Ah... convenient, convenient, it just hangs on large sets. | |
:END: | |
- Copy the files to single numbering | |
#+begin_src bash | |
for f in *-recto.pnm; do cp "$f" "${f/%-recto.pnm/-1up.pnm}" ; done | |
for f in *-verso.pnm; do cp "$f" "${f/%-verso.pnm/-1up.pnm}" ; done | |
#+end_src | |
+ *or* rotate (if needed) | |
#+begin_src bash | |
for f in *-recto.pnm; do pnmflip -cw "$f" > "${f/%-recto.pnm/-1up.pnm}" ; done | |
for f in *-verso.pnm; do pnmflip -ccw "$f" > "${f/%-verso.pnm/-1up.pnm}" ; done | |
#+end_src | |
- Check scan | |
#+begin_src bash | |
img2pdf -o checkscan.pdf *-1up.pnm | |
#+end_src | |
+ check the pages *sequence*, to see if the ADF didn't skip anything. | |
- OCR (tesseract) | |
:Sources: | |
+ [[https://tesseract-ocr.github.io/tessdoc/FAQ.html#how-to-process-multiple-images-in-a-single-run][tessdoc - Process multiple images in a single run]] | |
+ [[https://tesseract-ocr.github.io/tessdoc/FAQ.html#how-do-i-integrate-original-image-file-and-detected-text-into-pdf][tessdoc - Integrate original image file and detected text into pdf]] | |
+ https://github.com/tesseract-ocr/tesseract/issues/660#issuecomment-545995625 | |
:END: | |
#+begin_src bash | |
ls *-1up.pnm > ocrlist.txt | |
tesseract ocrlist.txt ocrtext --dpi 300 -l por -c textonly_pdf=1 pdf | |
#+end_src | |
+ we should specify the resolution explicitly, since ~tesseract~ is not able | |
to retrieve it directly from the =.pnm= format, and would have to guess. | |
+ and don't forget to specify the correct *language*. | |
- Downscale pnm | |
#+begin_src bash | |
for f in *-1up.pnm; do | |
pnmscale 0.5 "$f" > "${f/%-1up.pnm/-low.pnm}" | |
done | |
#+end_src | |
- Convert pnm to tif | |
#+begin_src bash | |
for f in *-low.pnm; do | |
pnmtotiff "$f" > "${f/%.pnm/.tif}" | |
done | |
#+end_src | |
+ Another alternative would be to use =pnmtopng= to convert the files to =.png= | |
then bundle them to =.pdf= using =img2pdf=. Using =zip= compression for the | |
tiffs, the file sizes seem to be equivalent, and =tiff= appears to be | |
preferred for archival purposes in general (though the PDF does not embed | |
the actual file, but rather transforms the image to an internal format. | |
~pdfimages~, for example, cannot know the image in the PDF was originally a | |
=tiff=, and we can only extract one by specifying the format. Hopefully | |
these transformations are lossless... but all this means that using either | |
=tiff= or =png= here doesn't really make any difference). | |
- Convert tif to pdf | |
#+begin_src bash | |
tiffcp -c zip *-low.tif ocrimage.tif | |
tiff2pdf -z -r o -x 150 -y 150 -o ocrimage.pdf ocrimage.tif | |
#+end_src | |
+ I'm using =zip= compression here, the alternative would be =lzw= (not | |
available for ~tiff2pdf~?), but zip seems to be producing smaller files | |
(ca. 15-20% smaller). | |
- Check number of pages | |
#+begin_src bash | |
pdfinfo ocrimage.pdf | grep Pages | |
pdfinfo ocrtext.pdf | grep Pages | |
#+end_src | |
- Underlay OCR to page scans on pdf | |
#+begin_src bash | |
qpdf ocrimage.pdf --underlay ocrtext.pdf -- ocrboth.pdf | |
#+end_src | |
- Scan covers | |
#+begin_src bash | |
scanimage --device-name="brother4:bus2;dev1" \ | |
--format=pnm --mode="24bit Color" \ | |
--resolution=150 \ | |
--source="FlatBed" \ | |
--progress \ | |
-l 0 -t 0 -x 210 -y 297 \ | |
--output-file="cover.pnm" | |
pnmflip -cw cover.pnm > cover-1up.pnm | |
pnmtotiff cover-1up.pnm > cover.tif | |
tiff2pdf -z -r o -x 150 -y 150 -o cover.pdf cover.tif | |
#+end_src | |
+ Use *the same crop coordinates* as the original scan for the regular pages. | |
+ If it exists, do the same for the back cover: =back.pdf=. | |
- Bundle pages | |
#+begin_src bash | |
qpdf --empty --pages cover.pdf ocrboth.pdf back.pdf -- final.pdf | |
#+end_src | |
- Clean (unpaper) | |
:Sources: | |
+ https://github.com/unpaper/unpaper/blob/main/doc/basic-concepts.md | |
+ https://github.com/unpaper/unpaper/blob/main/doc/image-processing.md | |
+ https://github.com/unpaper/unpaper/blob/main/doc/file-formats.md | |
+ https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html | |
+ https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy | |
:END: | |
+ I had started under the premisse that pre-processing the scanned images | |
would be a necessity and would improve results significantly. However, | |
while this may be true for lesser quality scans (e.g. from xerox copies), | |
this is *not* the case for the scans from the original books with the ADF. | |
In this case, there is no significant gain in OCR results, and no | |
significant change in file sizes. Also no significant change in | |
"sharpness"/"readability" of the images. Despite that, ~unpaper~ requires a | |
lot of extra effort, and some risk. Because the number of errors is | |
significant, and thus the results must be checked page by page and, of | |
course, a visual examination is error prone. Errors abound in deskewing | |
(with some very bad results), are significant in centering/mask-scan too | |
and, finally, are to fear in noise/filters removing things from the pages | |
we would not want to. All in all, only use ~unpaper~ if really needed. | |
+ Thus, only for sources which require it: | |
#+begin_src bash | |
unpaper --layout single --dpi 300 --verbose --overwrite \ | |
"%04d.pnm" "%04d_unp.pnm" | |
#+end_src | |
+ The filter defaults of unpaper are usually good for a well scanned paper, | |
such as the one we should have at this point. One thing that may need | |
adjustment is when the between the line spaces have some "shadow" gray | |
areas. Ideally, this is better handled by increasing the contrast in the | |
scanning step. But, failing that, we can improve things with a tighter | |
settings for the gray and blur filters. Settings such as | |
=--grayfilter-size 5 --grayfilter-step 2 --grayfilter-threshold 0.6= and/or | |
=--blurfilter-size 5 --blurfilter-step 2 --blurfilter-intensity 0.1= do | |
improve this kind of problem. The grayfilter is actually more effective | |
here, provided we are working with a true gray image, of course. The | |
noisefilter may also be of use, but so far I perceive little effect from | |
it in grayscale images (it is prominent though for B&W images). | |
+ Another situation which may require adjustment is the use of "highlighter" | |
and pencil annotations in the original (shame on me...). The grayfilter | |
makes ugly dents on them, but it turns out it is not a good idea to | |
disable the filter completely. Hence, let it work for the border areas | |
and restrict it for the block of text by reducing the granularity of the | |
filter, e.g. =--grayfilter-size 100=. | |
- Check clean step | |
#+begin_src bash | |
img2pdf -o checkclean.pdf *_unp.pnm | |
#+end_src | |
+ check the visuals of each page, to see if ~unpaper~ did not miss anything | |
obvious. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment