Skip to content

Instantly share code, notes, and snippets.

@tskinn
Forked from sterlingwes/README.md
Created August 5, 2020 04:43
Show Gist options
  • Save tskinn/8900a72a3f4647186c88e70baab52bc1 to your computer and use it in GitHub Desktop.
Save tskinn/8900a72a3f4647186c88e70baab52bc1 to your computer and use it in GitHub Desktop.
Converting a scanned PDF to EPUB ebook (or other format)

Caveat

You're not going to get a beautiful EPUB out the other end - if that's what you're looking for, expect to do some manual clean-up work yourself.

Basic order of operations:

  • Convert your PDF to an OCR-friendly format
  • OCR that shit into plaintext
  • Convert that plaintext into your format of choice (in this case, an EPUB)

Tools of the trade:

  • Ghostscript (our PDF wrangler, or Imagemagick if you prefer)
  • Tesseract (open source OCR)
  • Pandoc

Install

These instructions assume you're running a Mac OS

  • you should already have Ghostscript (try gs -v)
  • brew install tesseract --all-languages - then go get a snack
  • brew install pandoc - this one's a beauty

Convert it

We're going to translate our PDF into a TIFF. Keep in mind this is really only useful for PDFs that consist solely of images. If your PDF contains text, you might want to avoid outputting to a raster image.

gs -q -r300x300 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif myscan.pdf -c quit

Notes:

  • the -r flag controls DPI
  • -sDEVICE in the above form outputs black & white which will suffice for our EPUB needs

You should really man gs, though.

Read it

Tesseract is going to read our image and spit out text - it's glorious.

tesseract -l eng mybook.tif mybook

(-l eng denotes that mybook.tif includes english text)

Massage it

This is the fun / awful part (depending on your personality). The outputted text will not contain any structure, and unstructured is exactly what an ebook is not.

Pandoc is quite awesome at converting Markdown to EPUB (among many other formats), but I'd stick with Markdown. Basically you'll want to skim your mybook.txt file and throw a # in front of chapter headers, remove extraneous text (ie: from page headers & footers), and add in any relevant images (pandoc sources images relative to your source txt file and puts them in the EPUB!).

Then:

pandoc mybook.txt -o mybook.epub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment