Skip to content

Instantly share code, notes, and snippets.

@sterlingwes
Created January 31, 2016 02:05
Show Gist options
  • Star 20 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save sterlingwes/90894602ab69712a3fb7 to your computer and use it in GitHub Desktop.
Save sterlingwes/90894602ab69712a3fb7 to your computer and use it in GitHub Desktop.
Converting a scanned PDF to EPUB ebook (or other format)

Caveat

You're not going to get a beautiful EPUB out the other end - if that's what you're looking for, expect to do some manual clean-up work yourself.

Basic order of operations:

  • Convert your PDF to an OCR-friendly format
  • OCR that shit into plaintext
  • Convert that plaintext into your format of choice (in this case, an EPUB)

Tools of the trade:

  • Ghostscript (our PDF wrangler, or Imagemagick if you prefer)
  • Tesseract (open source OCR)
  • Pandoc

Install

These instructions assume you're running a Mac OS

  • you should already have Ghostscript (try gs -v)
  • brew install tesseract --all-languages - then go get a snack
  • brew install pandoc - this one's a beauty

Convert it

We're going to translate our PDF into a TIFF. Keep in mind this is really only useful for PDFs that consist solely of images. If your PDF contains text, you might want to avoid outputting to a raster image.

gs -q -r300x300 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif myscan.pdf -c quit

Notes:

  • the -r flag controls DPI
  • -sDEVICE in the above form outputs black & white which will suffice for our EPUB needs

You should really man gs, though.

Read it

Tesseract is going to read our image and spit out text - it's glorious.

tesseract -l eng mybook.tif mybook

(-l eng denotes that mybook.tif includes english text)

Massage it

This is the fun / awful part (depending on your personality). The outputted text will not contain any structure, and unstructured is exactly what an ebook is not.

Pandoc is quite awesome at converting Markdown to EPUB (among many other formats), but I'd stick with Markdown. Basically you'll want to skim your mybook.txt file and throw a # in front of chapter headers, remove extraneous text (ie: from page headers & footers), and add in any relevant images (pandoc sources images relative to your source txt file and puts them in the EPUB!).

Then:

pandoc mybook.txt -o mybook.epub

@malikbenkirane
Copy link

malikbenkirane commented Apr 23, 2023

brew install tesseract --all-languages

--all-languages tag is not supported by brew anymore.

https://tesseract-ocr.github.io/tessdoc/Installation.html#homebrew

==> Caveats
==> tesseract
This formula contains only the "eng", "osd", and "snum" language data files.
If you need any other supported languages, run `brew install tesseract-lang`.

so depending on your needs

brew install tesseract tesseract-lang

@grvm
Copy link

grvm commented May 26, 2023

For Arch Linux & its derivatives, this is how you can install the needed utilities. The commands remain the same.

sudo pacman -S ghostscript tesseract tesseract-data-eng pandoc-cli

tesseract-data-eng Adds the English Language data. These are all the available options:

tesseract-data-afr       tesseract-data-eus       tesseract-data-khm       tesseract-data-sin
tesseract-data-amh       tesseract-data-fao       tesseract-data-kir       tesseract-data-slk
tesseract-data-ara       tesseract-data-fas       tesseract-data-kmr       tesseract-data-slk_frak
tesseract-data-asm       tesseract-data-fil       tesseract-data-kor       tesseract-data-slv
tesseract-data-aze       tesseract-data-fin       tesseract-data-kor_vert  tesseract-data-snd
tesseract-data-aze_cyrl  tesseract-data-fra       tesseract-data-lao       tesseract-data-spa
tesseract-data-bel       tesseract-data-frk       tesseract-data-lat       tesseract-data-spa_old
tesseract-data-ben       tesseract-data-frm       tesseract-data-lav       tesseract-data-sqi
tesseract-data-bod       tesseract-data-fry       tesseract-data-lit       tesseract-data-srp
tesseract-data-bos       tesseract-data-gla       tesseract-data-ltz       tesseract-data-srp_latn
tesseract-data-bre       tesseract-data-gle       tesseract-data-mal       tesseract-data-sun
tesseract-data-bul       tesseract-data-glg       tesseract-data-mar       tesseract-data-swa
tesseract-data-cat       tesseract-data-grc       tesseract-data-mkd       tesseract-data-swe
tesseract-data-ceb       tesseract-data-guj       tesseract-data-mlt       tesseract-data-syr
tesseract-data-ces       tesseract-data-hat       tesseract-data-mon       tesseract-data-tam
tesseract-data-chi_sim   tesseract-data-heb       tesseract-data-mri       tesseract-data-tat
tesseract-data-chi_tra   tesseract-data-hin       tesseract-data-msa       tesseract-data-tel
tesseract-data-chr       tesseract-data-hrv       tesseract-data-mya       tesseract-data-tgk
tesseract-data-cos       tesseract-data-hun       tesseract-data-nep       tesseract-data-tgl
tesseract-data-cym       tesseract-data-hye       tesseract-data-nld       tesseract-data-tha
tesseract-data-dan       tesseract-data-iku       tesseract-data-nor       tesseract-data-tir
tesseract-data-dan_frak  tesseract-data-ind       tesseract-data-oci       tesseract-data-ton
tesseract-data-deu       tesseract-data-isl       tesseract-data-ori       tesseract-data-tur
tesseract-data-deu_frak  tesseract-data-ita       tesseract-data-osd       tesseract-data-uig
tesseract-data-div       tesseract-data-ita_old   tesseract-data-pan       tesseract-data-ukr
tesseract-data-dzo       tesseract-data-jav       tesseract-data-pol       tesseract-data-urd
tesseract-data-ell       tesseract-data-jpn       tesseract-data-por       tesseract-data-uzb
tesseract-data-eng       tesseract-data-jpn_vert  tesseract-data-pus       tesseract-data-uzb_cyrl
tesseract-data-enm       tesseract-data-kan       tesseract-data-que       tesseract-data-vie
tesseract-data-epo       tesseract-data-kat       tesseract-data-ron       tesseract-data-yid
tesseract-data-equ       tesseract-data-kat_old   tesseract-data-rus       tesseract-data-yor
tesseract-data-est       tesseract-data-kaz       tesseract-data-san

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment