Skip to content

Instantly share code, notes, and snippets.

@RobbieClarken
Last active July 6, 2020 12:08
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save RobbieClarken/4398029 to your computer and use it in GitHub Desktop.
Save RobbieClarken/4398029 to your computer and use it in GitHub Desktop.
Making an ebook from scanned images
  1. Use Scantailor to clean up the scanned images.

  2. Install the required tools with the following commands:

brew install --with-libtiff leptonica
brew install jbig2enc imagemagick tesseract
gem install iconv rmagick hpricot pdfbeads

Note: imagemagick may require the --build-from-source option.

  1. Generate OCR files with tesseract:
for f in *.tiff
do
  tesseract $f $(basename $f .tiff) hocr
done

Alternatively, you can use Cuneiform:

for f in *.tiff
do
  cuneiform -f hocr -o $(basename $f .tiff).html $f
done
  1. Create a metadata file that looks like:
Title: "How to Produce Beautiful Ebooks"
Author: "R. B. Clarken"
Subject: "Ebooks"
Keywords: "Books, Scanning, Digitising"
  1. Create a table of contents file that looks like:
"1 Introduction"                                                "1"
  "1.1 History"                                                 "1"
  "1.2 Choosing a scanner"                                      "3"
  1. Determine the label parameter to get the desired page numbering.

  2. Put it all together with pdfbeads:

pdfbeads --toc toc.txt --labels "0:%r;14:%D" --meta meta.txt *.tiff > out.pdf

Using pdftk

Download from www.pdflabs.com/tools/pdftk-server/.

Merge files

pdftk *.pdf cat output out.pdf

Update metadata

  1. Create a file called in.info with the following format:
InfoBegin
InfoKey: Author
InfoValue: R. B. Clarken
InfoBegin
InfoKey: Title
InfoValue: How to Produce Beautiful Ebooks
InfoBegin
InfoKey: Subject
InfoValue: Ebooks
InfoBegin
InfoKey: Keywords
InfoValue: Books, Scanning, Digitising
BookmarkBegin
BookmarkTitle: 1 Introduction
BookmarkLevel: 1
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: 1.1 History
BookmarkLevel: 2
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: 1.2 Choosing a scanner
BookmarkLevel: 2
BookmarkPageNumber: 3
  1. Run:
pdftk in.pdf update_info_utf8 in.info output out.pdf

OCR Existing PDF

gs -q -dNOPAUSE -dBATCH -r600x600 -sDEVICE=tiffg4 -sOutputFile=%03d.tiff in.pdf
for f in *.tiff; do tesseract $f $(basename $f .tiff) hocr; done
pdfbeads -o out.pdf *.tiff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment