-
Use Scantailor to clean up the scanned images.
-
Install the required tools with the following commands:
brew install --with-libtiff leptonica
brew install jbig2enc imagemagick tesseract
gem install iconv rmagick hpricot pdfbeads
Note: imagemagick
may require the --build-from-source
option.
- Generate OCR files with tesseract:
for f in *.tiff
do
tesseract $f $(basename $f .tiff) hocr
done
Alternatively, you can use Cuneiform:
for f in *.tiff
do
cuneiform -f hocr -o $(basename $f .tiff).html $f
done
- Create a metadata file that looks like:
Title: "How to Produce Beautiful Ebooks"
Author: "R. B. Clarken"
Subject: "Ebooks"
Keywords: "Books, Scanning, Digitising"
- Create a table of contents file that looks like:
"1 Introduction" "1"
"1.1 History" "1"
"1.2 Choosing a scanner" "3"
-
Determine the label parameter to get the desired page numbering.
-
Put it all together with pdfbeads:
pdfbeads --toc toc.txt --labels "0:%r;14:%D" --meta meta.txt *.tiff > out.pdf
Download from www.pdflabs.com/tools/pdftk-server/.
pdftk *.pdf cat output out.pdf
- Create a file called
in.info
with the following format:
InfoBegin
InfoKey: Author
InfoValue: R. B. Clarken
InfoBegin
InfoKey: Title
InfoValue: How to Produce Beautiful Ebooks
InfoBegin
InfoKey: Subject
InfoValue: Ebooks
InfoBegin
InfoKey: Keywords
InfoValue: Books, Scanning, Digitising
BookmarkBegin
BookmarkTitle: 1 Introduction
BookmarkLevel: 1
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: 1.1 History
BookmarkLevel: 2
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: 1.2 Choosing a scanner
BookmarkLevel: 2
BookmarkPageNumber: 3
- Run:
pdftk in.pdf update_info_utf8 in.info output out.pdf
gs -q -dNOPAUSE -dBATCH -r600x600 -sDEVICE=tiffg4 -sOutputFile=%03d.tiff in.pdf
for f in *.tiff; do tesseract $f $(basename $f .tiff) hocr; done
pdfbeads -o out.pdf *.tiff