gusbrs/bookscan.txt Secret

## bookscan.txt
** Digitization of books
:Sources:
- File formats:
  + [[https://www.succeed-project.eu/outputs][Succeed Project - Outputs]] (Support Action Centre of Competence in
    Digitisation)
    * [[file:~/Gustavo/Library/Outros/Digitalização/Succeed_2014_Recommendations for metadata and data formats for online availability and long-term preservation.pdf][Succeed_Recommendations for data formats for long-term preservation]]
  + https://en.wikipedia.org/wiki/TIFF
:END:

- Scan
  + with ~scanimage~ batch mode:
    #+begin_src bash :results none
    scanimage --device-name="brother4:bus2;dev1" \
              --format=pnm --mode="True Gray" \
              --resolution=300 \
              --source="Automatic Document Feeder(centrally aligned)" \
              --progress \
              -l 0 -t 0 -x 210 -y 297 \
              --brightness 5 \
              --contrast 15 \
              --batch="%04d-recto.pnm" \
              --batch-increment=2 \
              --batch-start=1
    #+end_src
    * Resolution: 300 (normal), 400-600 (small font).  I have made some light
      testing, and increasing resolution to 600 does not appreciably improve
      OCR results, but does take a large toll on scan and processing time.  I
      also did some light testing with smaller resolutions (150), and the only
      potential gain from it would be not to do the potentially distortive
      step of rescaling the image, but the visual "feel" of both cases is
      indistinguishable, and a 300+ resolution is recommended by ~tesseract~.
    * Mode: "True Gray" for best results, both of the OCR and the visual of
      the page.  "Black & White" is another possibility, in principle, may be
      useful for file size reasons and, specially for xerox source
      (unfortunately, we cannot set contrast with it).  Don't use "Gray[Error
      Diffusion]" though, if needed, use "Black & White" and a higher
      resolution.
    * Do the cropping at scan time, by setting proper scan coordinates, this
      is done once per book, and should be the same to all pages, so that they
      have the same size.
    * Always include the "recto/verso" in the file name, even when not
      scanning sideways (and thus, no rotation is needed).  It is easier to
      recover from ADF mistakes.
    * Adjust the numbers at =batch-increment= (2 or -2) and =batch-start= as
      needed.  And take care, ~scanimage~ overwrites without warning, and
      there's no way to prevent it.
    * Brightness/contrast should be set relatively high, so that most of the
      inter-word/inter-line gray areas should be gone, but not overly so.  How
      much is needed depends on the color of the paper: ranges of brightness
      0-15 and contrast 10-30 usually produce good results.  Take care with
      very light/small fonts (a superscript, a subscript, a derivative, etc.),
      they may become barely legible when brightness and contrast are too
      high.
    :Alternate_methods:
    * with ~scanadf~:
      #+begin_src bash :results none
      scanadf --device-name="brother4:bus2;dev1" \
              --mode="True Gray" \
              --resolution=300 \
              --source="Automatic Document Feeder(centrally aligned)" \
              -l 0 -t 0 -x 210 -y 297 \
              --brightness 10 \
              --contrast 20 \
              --output-file "%04d.pnm"
      #+end_src
      - This does pretty much the same as ~scanimage~ batch mode, but we can run
        a script after each image, but cannot change the step of the file name
        counter...
    * with ~gscan2pdf~:
      - Even when ~gscan2pdf~ has some tight ropes in setting up post-processing
        tools, it can be convenient to organize the scanning of a large set:
        to shuffle facing and reverse sides, and to batch crop the images with
        some visual feedback.
      - The deal breaker in the post-processing area is that ~gscan2pdf~ is
        bound to the hOCR method of interweaving the OCR text and the scanned
        image, and its results are just bad.  Certainly much inferior relative
        to which ~tesseract~ itself is able to achieve with its =pdf= config.
      - Ah... convenient, convenient, it just hangs on large sets.
    :END:
- Copy the files to single numbering
  #+begin_src bash
  for f in *-recto.pnm; do cp "$f" "${f/%-recto.pnm/-1up.pnm}" ; done
  for f in *-verso.pnm; do cp "$f" "${f/%-verso.pnm/-1up.pnm}" ; done
  #+end_src
  + *or* rotate (if needed)
    #+begin_src bash
    for f in *-recto.pnm; do pnmflip -cw "$f" > "${f/%-recto.pnm/-1up.pnm}" ; done
    for f in *-verso.pnm; do pnmflip -ccw "$f" > "${f/%-verso.pnm/-1up.pnm}" ; done
    #+end_src
- Check scan
  #+begin_src bash
  img2pdf -o checkscan.pdf *-1up.pnm
  #+end_src
  + check the pages *sequence*, to see if the ADF didn't skip anything.
- OCR (tesseract)
  :Sources:
  + [[https://tesseract-ocr.github.io/tessdoc/FAQ.html#how-to-process-multiple-images-in-a-single-run][tessdoc - Process multiple images in a single run]]
  + [[https://tesseract-ocr.github.io/tessdoc/FAQ.html#how-do-i-integrate-original-image-file-and-detected-text-into-pdf][tessdoc - Integrate original image file and detected text into pdf]]
  + https://github.com/tesseract-ocr/tesseract/issues/660#issuecomment-545995625
  :END:
  #+begin_src bash
  ls *-1up.pnm > ocrlist.txt
  tesseract ocrlist.txt ocrtext --dpi 300 -l por -c textonly_pdf=1 pdf
  #+end_src
  + we should specify the resolution explicitly, since ~tesseract~ is not able
    to retrieve it directly from the =.pnm= format, and would have to guess.
  + and don't forget to specify the correct *language*.
- Downscale pnm
  #+begin_src bash
  for f in *-1up.pnm; do
      pnmscale 0.5 "$f" > "${f/%-1up.pnm/-low.pnm}"
  done
  #+end_src
- Convert pnm to tif
  #+begin_src bash
  for f in *-low.pnm; do
      pnmtotiff "$f" > "${f/%.pnm/.tif}"
  done
  #+end_src
  + Another alternative would be to use =pnmtopng= to convert the files to =.png=
    then bundle them to =.pdf= using =img2pdf=.  Using =zip= compression for the
    tiffs, the file sizes seem to be equivalent, and =tiff= appears to be
    preferred for archival purposes in general (though the PDF does not embed
    the actual file, but rather transforms the image to an internal format.
    ~pdfimages~, for example, cannot know the image in the PDF was originally a
    =tiff=, and we can only extract one by specifying the format.  Hopefully
    these transformations are lossless... but all this means that using either
    =tiff= or =png= here doesn't really make any difference).
- Convert tif to pdf
  #+begin_src bash
  tiffcp -c zip *-low.tif ocrimage.tif
  tiff2pdf -z -r o -x 150 -y 150 -o ocrimage.pdf ocrimage.tif
  #+end_src
  + I'm using =zip= compression here, the alternative would be =lzw= (not
    available for ~tiff2pdf~?), but zip seems to be producing smaller files
    (ca. 15-20% smaller).
- Check number of pages
  #+begin_src bash
  pdfinfo ocrimage.pdf | grep Pages
  pdfinfo ocrtext.pdf | grep Pages
  #+end_src
- Underlay OCR to page scans on pdf
  #+begin_src bash
  qpdf ocrimage.pdf --underlay ocrtext.pdf -- ocrboth.pdf
  #+end_src
- Scan covers
  #+begin_src bash
  scanimage --device-name="brother4:bus2;dev1" \
            --format=pnm --mode="24bit Color" \
            --resolution=150 \
            --source="FlatBed" \
            --progress \
            -l 0 -t 0 -x 210 -y 297 \
            --output-file="cover.pnm"
  pnmflip -cw cover.pnm > cover-1up.pnm
  pnmtotiff cover-1up.pnm > cover.tif
  tiff2pdf -z -r o -x 150 -y 150 -o cover.pdf cover.tif
  #+end_src
  + Use *the same crop coordinates* as the original scan for the regular pages.
  + If it exists, do the same for the back cover: =back.pdf=.
- Bundle pages
  #+begin_src bash
  qpdf --empty --pages cover.pdf ocrboth.pdf back.pdf -- final.pdf
  #+end_src


- Clean (unpaper)
  :Sources:
  + https://github.com/unpaper/unpaper/blob/main/doc/basic-concepts.md
  + https://github.com/unpaper/unpaper/blob/main/doc/image-processing.md
  + https://github.com/unpaper/unpaper/blob/main/doc/file-formats.md
  + https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
  + https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy
  :END:
  + I had started under the premisse that pre-processing the scanned images
    would be a necessity and would improve results significantly.  However,
    while this may be true for lesser quality scans (e.g. from xerox copies),
    this is *not* the case for the scans from the original books with the ADF.
    In this case, there is no significant gain in OCR results, and no
    significant change in file sizes.  Also no significant change in
    "sharpness"/"readability" of the images.  Despite that, ~unpaper~ requires a
    lot of extra effort, and some risk.  Because the number of errors is
    significant, and thus the results must be checked page by page and, of
    course, a visual examination is error prone.  Errors abound in deskewing
    (with some very bad results), are significant in centering/mask-scan too
    and, finally, are to fear in noise/filters removing things from the pages
    we would not want to.  All in all, only use ~unpaper~ if really needed.
  + Thus, only for sources which require it:
    #+begin_src bash
    unpaper --layout single --dpi 300 --verbose --overwrite \
            "%04d.pnm" "%04d_unp.pnm"
    #+end_src
  + The filter defaults of unpaper are usually good for a well scanned paper,
    such as the one we should have at this point.  One thing that may need
    adjustment is when the between the line spaces have some "shadow" gray
    areas.  Ideally, this is better handled by increasing the contrast in the
    scanning step.  But, failing that, we can improve things with a tighter
    settings for the gray and blur filters.  Settings such as
    =--grayfilter-size 5 --grayfilter-step 2 --grayfilter-threshold 0.6= and/or
    =--blurfilter-size 5 --blurfilter-step 2 --blurfilter-intensity 0.1= do
    improve this kind of problem.  The grayfilter is actually more effective
    here, provided we are working with a true gray image, of course.  The
    noisefilter may also be of use, but so far I perceive little effect from
    it in grayscale images (it is prominent though for B&W images).
  + Another situation which may require adjustment is the use of "highlighter"
    and pencil annotations in the original (shame on me...).  The grayfilter
    makes ugly dents on them, but it turns out it is not a good idea to
    disable the filter completely.  Hence, let it work for the border areas
    and restrict it for the block of text by reducing the granularity of the
    filter, e.g. =--grayfilter-size 100=.
- Check clean step
  #+begin_src bash
  img2pdf -o checkclean.pdf *_unp.pnm
  #+end_src
  + check the visuals of each page, to see if ~unpaper~ did not miss anything
    obvious.
	** Digitization of books
	:Sources:
	- File formats:
	+ [[https://www.succeed-project.eu/outputs][Succeed Project - Outputs]] (Support Action Centre of Competence in
	Digitisation)
	* [[file:~/Gustavo/Library/Outros/Digitalização/Succeed_2014_Recommendations for metadata and data formats for online availability and long-term preservation.pdf][Succeed_Recommendations for data formats for long-term preservation]]
	+ https://en.wikipedia.org/wiki/TIFF
	:END:

	- Scan
	+ with ~scanimage~ batch mode:
	#+begin_src bash :results none
	scanimage --device-name="brother4:bus2;dev1" \
	--format=pnm --mode="True Gray" \
	--resolution=300 \
	--source="Automatic Document Feeder(centrally aligned)" \
	--progress \
	-l 0 -t 0 -x 210 -y 297 \
	--brightness 5 \
	--contrast 15 \
	--batch="%04d-recto.pnm" \
	--batch-increment=2 \
	--batch-start=1
	#+end_src
	* Resolution: 300 (normal), 400-600 (small font). I have made some light
	testing, and increasing resolution to 600 does not appreciably improve
	OCR results, but does take a large toll on scan and processing time. I
	also did some light testing with smaller resolutions (150), and the only
	potential gain from it would be not to do the potentially distortive
	step of rescaling the image, but the visual "feel" of both cases is
	indistinguishable, and a 300+ resolution is recommended by ~tesseract~.
	* Mode: "True Gray" for best results, both of the OCR and the visual of
	the page. "Black & White" is another possibility, in principle, may be
	useful for file size reasons and, specially for xerox source
	(unfortunately, we cannot set contrast with it). Don't use "Gray[Error
	Diffusion]" though, if needed, use "Black & White" and a higher
	resolution.
	* Do the cropping at scan time, by setting proper scan coordinates, this
	is done once per book, and should be the same to all pages, so that they
	have the same size.
	* Always include the "recto/verso" in the file name, even when not
	scanning sideways (and thus, no rotation is needed). It is easier to
	recover from ADF mistakes.
	* Adjust the numbers at =batch-increment= (2 or -2) and =batch-start= as
	needed. And take care, ~scanimage~ overwrites without warning, and
	there's no way to prevent it.
	* Brightness/contrast should be set relatively high, so that most of the
	inter-word/inter-line gray areas should be gone, but not overly so. How
	much is needed depends on the color of the paper: ranges of brightness
	0-15 and contrast 10-30 usually produce good results. Take care with
	very light/small fonts (a superscript, a subscript, a derivative, etc.),
	they may become barely legible when brightness and contrast are too
	high.
	:Alternate_methods:
	* with ~scanadf~:
	#+begin_src bash :results none
	scanadf --device-name="brother4:bus2;dev1" \
	--mode="True Gray" \
	--resolution=300 \
	--source="Automatic Document Feeder(centrally aligned)" \
	-l 0 -t 0 -x 210 -y 297 \
	--brightness 10 \
	--contrast 20 \
	--output-file "%04d.pnm"
	#+end_src
	- This does pretty much the same as ~scanimage~ batch mode, but we can run
	a script after each image, but cannot change the step of the file name
	counter...
	* with ~gscan2pdf~:
	- Even when ~gscan2pdf~ has some tight ropes in setting up post-processing
	tools, it can be convenient to organize the scanning of a large set:
	to shuffle facing and reverse sides, and to batch crop the images with
	some visual feedback.
	- The deal breaker in the post-processing area is that ~gscan2pdf~ is
	bound to the hOCR method of interweaving the OCR text and the scanned
	image, and its results are just bad. Certainly much inferior relative
	to which ~tesseract~ itself is able to achieve with its =pdf= config.
	- Ah... convenient, convenient, it just hangs on large sets.
	:END:
	- Copy the files to single numbering
	#+begin_src bash
	for f in *-recto.pnm; do cp "$f" "${f/%-recto.pnm/-1up.pnm}" ; done
	for f in *-verso.pnm; do cp "$f" "${f/%-verso.pnm/-1up.pnm}" ; done
	#+end_src
	+ or rotate (if needed)
	#+begin_src bash
	for f in *-recto.pnm; do pnmflip -cw "$f" > "${f/%-recto.pnm/-1up.pnm}" ; done
	for f in *-verso.pnm; do pnmflip -ccw "$f" > "${f/%-verso.pnm/-1up.pnm}" ; done
	#+end_src
	- Check scan
	#+begin_src bash
	img2pdf -o checkscan.pdf *-1up.pnm
	#+end_src
	+ check the pages sequence, to see if the ADF didn't skip anything.
	- OCR (tesseract)
	:Sources:
	+ [[https://tesseract-ocr.github.io/tessdoc/FAQ.html#how-to-process-multiple-images-in-a-single-run][tessdoc - Process multiple images in a single run]]
	+ [[https://tesseract-ocr.github.io/tessdoc/FAQ.html#how-do-i-integrate-original-image-file-and-detected-text-into-pdf][tessdoc - Integrate original image file and detected text into pdf]]
	+ https://github.com/tesseract-ocr/tesseract/issues/660#issuecomment-545995625
	:END:
	#+begin_src bash
	ls *-1up.pnm > ocrlist.txt
	tesseract ocrlist.txt ocrtext --dpi 300 -l por -c textonly_pdf=1 pdf
	#+end_src
	+ we should specify the resolution explicitly, since ~tesseract~ is not able
	to retrieve it directly from the =.pnm= format, and would have to guess.
	+ and don't forget to specify the correct language.
	- Downscale pnm
	#+begin_src bash
	for f in *-1up.pnm; do
	pnmscale 0.5 "$f" > "${f/%-1up.pnm/-low.pnm}"
	done
	#+end_src
	- Convert pnm to tif
	#+begin_src bash
	for f in *-low.pnm; do
	pnmtotiff "$f" > "${f/%.pnm/.tif}"
	done
	#+end_src
	+ Another alternative would be to use =pnmtopng= to convert the files to =.png=
	then bundle them to =.pdf= using =img2pdf=. Using =zip= compression for the
	tiffs, the file sizes seem to be equivalent, and =tiff= appears to be
	preferred for archival purposes in general (though the PDF does not embed
	the actual file, but rather transforms the image to an internal format.
	~pdfimages~, for example, cannot know the image in the PDF was originally a
	=tiff=, and we can only extract one by specifying the format. Hopefully
	these transformations are lossless... but all this means that using either
	=tiff= or =png= here doesn't really make any difference).
	- Convert tif to pdf
	#+begin_src bash
	tiffcp -c zip *-low.tif ocrimage.tif
	tiff2pdf -z -r o -x 150 -y 150 -o ocrimage.pdf ocrimage.tif
	#+end_src
	+ I'm using =zip= compression here, the alternative would be =lzw= (not
	available for ~tiff2pdf~?), but zip seems to be producing smaller files
	(ca. 15-20% smaller).
	- Check number of pages
	#+begin_src bash
	pdfinfo ocrimage.pdf \| grep Pages
	pdfinfo ocrtext.pdf \| grep Pages
	#+end_src
	- Underlay OCR to page scans on pdf
	#+begin_src bash
	qpdf ocrimage.pdf --underlay ocrtext.pdf -- ocrboth.pdf
	#+end_src
	- Scan covers
	#+begin_src bash
	scanimage --device-name="brother4:bus2;dev1" \
	--format=pnm --mode="24bit Color" \
	--resolution=150 \
	--source="FlatBed" \
	--progress \
	-l 0 -t 0 -x 210 -y 297 \
	--output-file="cover.pnm"
	pnmflip -cw cover.pnm > cover-1up.pnm
	pnmtotiff cover-1up.pnm > cover.tif
	tiff2pdf -z -r o -x 150 -y 150 -o cover.pdf cover.tif
	#+end_src
	+ Use the same crop coordinates as the original scan for the regular pages.
	+ If it exists, do the same for the back cover: =back.pdf=.
	- Bundle pages
	#+begin_src bash
	qpdf --empty --pages cover.pdf ocrboth.pdf back.pdf -- final.pdf
	#+end_src


	- Clean (unpaper)
	:Sources:
	+ https://github.com/unpaper/unpaper/blob/main/doc/basic-concepts.md
	+ https://github.com/unpaper/unpaper/blob/main/doc/image-processing.md
	+ https://github.com/unpaper/unpaper/blob/main/doc/file-formats.md
	+ https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
	+ https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy
	:END:
	+ I had started under the premisse that pre-processing the scanned images
	would be a necessity and would improve results significantly. However,
	while this may be true for lesser quality scans (e.g. from xerox copies),
	this is not the case for the scans from the original books with the ADF.
	In this case, there is no significant gain in OCR results, and no
	significant change in file sizes. Also no significant change in
	"sharpness"/"readability" of the images. Despite that, ~unpaper~ requires a
	lot of extra effort, and some risk. Because the number of errors is
	significant, and thus the results must be checked page by page and, of
	course, a visual examination is error prone. Errors abound in deskewing
	(with some very bad results), are significant in centering/mask-scan too
	and, finally, are to fear in noise/filters removing things from the pages
	we would not want to. All in all, only use ~unpaper~ if really needed.
	+ Thus, only for sources which require it:
	#+begin_src bash
	unpaper --layout single --dpi 300 --verbose --overwrite \
	"%04d.pnm" "%04d_unp.pnm"
	#+end_src
	+ The filter defaults of unpaper are usually good for a well scanned paper,
	such as the one we should have at this point. One thing that may need
	adjustment is when the between the line spaces have some "shadow" gray
	areas. Ideally, this is better handled by increasing the contrast in the
	scanning step. But, failing that, we can improve things with a tighter
	settings for the gray and blur filters. Settings such as
	=--grayfilter-size 5 --grayfilter-step 2 --grayfilter-threshold 0.6= and/or
	=--blurfilter-size 5 --blurfilter-step 2 --blurfilter-intensity 0.1= do
	improve this kind of problem. The grayfilter is actually more effective
	here, provided we are working with a true gray image, of course. The
	noisefilter may also be of use, but so far I perceive little effect from
	it in grayscale images (it is prominent though for B&W images).
	+ Another situation which may require adjustment is the use of "highlighter"
	and pencil annotations in the original (shame on me...). The grayfilter
	makes ugly dents on them, but it turns out it is not a good idea to
	disable the filter completely. Hence, let it work for the border areas
	and restrict it for the block of text by reducing the granularity of the
	filter, e.g. =--grayfilter-size 100=.
	- Check clean step
	#+begin_src bash
	img2pdf -o checkclean.pdf *_unp.pnm
	#+end_src
	+ check the visuals of each page, to see if ~unpaper~ did not miss anything
	obvious.