tskinn/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Caveat
You're not going to get a beautiful EPUB out the other end - if that's what you're looking for, expect to do some manual clean-up work yourself.
Basic order of operations:

Convert your PDF to an OCR-friendly format
OCR that shit into plaintext
Convert that plaintext into your format of choice (in this case, an EPUB)

Tools of the trade:

Ghostscript (our PDF wrangler, or Imagemagick if you prefer)
Tesseract (open source OCR)
Pandoc

Install

These instructions assume you're running a Mac OS

you should already have Ghostscript (try gs -v)
brew install tesseract --all-languages - then go get a snack
brew install pandoc - this one's a beauty

Convert it

We're going to translate our PDF into a TIFF. Keep in mind this is really only useful for PDFs that consist solely of images. If your PDF contains text, you might want to avoid outputting to a raster image.
gs -q -r300x300 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif myscan.pdf -c quit
Notes:

the -r flag controls DPI
-sDEVICE in the above form outputs black & white which will suffice for our EPUB needs

You should really man gs, though.
Read it

Tesseract is going to read our image and spit out text - it's glorious.
tesseract -l eng mybook.tif mybook
(-l eng denotes that mybook.tif includes english text)
Massage it

This is the fun / awful part (depending on your personality). The outputted text will not contain any structure, and unstructured is exactly what an ebook is not.
Pandoc is quite awesome at converting Markdown to EPUB (among many other formats), but I'd stick with Markdown. Basically you'll want to skim your mybook.txt file and throw a # in front of chapter headers, remove extraneous text (ie: from page headers & footers), and add in any relevant images (pandoc sources images relative to your source txt file and puts them in the EPUB!).
Then:
pandoc mybook.txt -o mybook.epub