RobbieClarken/ebook.md

## ebook.md

      
    Raw
  

              ebook.md
            
          
Use Scantailor to clean up the scanned images.


Install the required tools with the following commands:


brew install --with-libtiff leptonica
brew install jbig2enc imagemagick tesseract
gem install iconv rmagick hpricot pdfbeads
Note: imagemagick may require the --build-from-source option.

Generate OCR files with tesseract:

for f in *.tiff
do
  tesseract $f $(basename $f .tiff) hocr
done
Alternatively, you can use Cuneiform:
for f in *.tiff
do
  cuneiform -f hocr -o $(basename $f .tiff).html $f
done

Create a metadata file that looks like:

Title: "How to Produce Beautiful Ebooks"
Author: "R. B. Clarken"
Subject: "Ebooks"
Keywords: "Books, Scanning, Digitising"


Create a table of contents file that looks like:

"1 Introduction"                                                "1"
  "1.1 History"                                                 "1"
  "1.2 Choosing a scanner"                                      "3"


Determine the label parameter to get the desired page numbering.


Put it all together with pdfbeads:


pdfbeads --toc toc.txt --labels "0:%r;14:%D" --meta meta.txt *.tiff > out.pdf
Using pdftk

Download from www.pdflabs.com/tools/pdftk-server/.
Merge files

pdftk *.pdf cat output out.pdf

Update metadata


Create a file called in.info with the following format:

InfoBegin
InfoKey: Author
InfoValue: R. B. Clarken
InfoBegin
InfoKey: Title
InfoValue: How to Produce Beautiful Ebooks
InfoBegin
InfoKey: Subject
InfoValue: Ebooks
InfoBegin
InfoKey: Keywords
InfoValue: Books, Scanning, Digitising
BookmarkBegin
BookmarkTitle: 1 Introduction
BookmarkLevel: 1
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: 1.1 History
BookmarkLevel: 2
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: 1.2 Choosing a scanner
BookmarkLevel: 2
BookmarkPageNumber: 3


Run:

pdftk in.pdf update_info_utf8 in.info output out.pdf
OCR Existing PDF

gs -q -dNOPAUSE -dBATCH -r600x600 -sDEVICE=tiffg4 -sOutputFile=%03d.tiff in.pdf
for f in *.tiff; do tesseract $f $(basename $f .tiff) hocr; done
pdfbeads -o out.pdf *.tiff