Skip to content

Instantly share code, notes, and snippets.

@Chubek
Created March 29, 2024 14:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Chubek/3926de94eb3a03e243a96aff1761701f to your computer and use it in GitHub Desktop.
Save Chubek/3926de94eb3a03e243a96aff1761701f to your computer and use it in GitHub Desktop.
This is, hands-down, the best way to convert PDFs to EPUB (or any other format)

This document describes several shell pipelines for converting PDF files to any format.

I'm not sure if it's true for all people, but my e-reader sucks at displaying PDF --- which is, in all reality, a giant executable file (we'll discuss this soon). Also, there's dozens of other reasons one may wish to convert a PDF to a better 'text format'. Let's say, you wanna put it up on your website, feed it to a mathematical optimization model, feed it to an script, etc.

Before you read this document, yes, I know there is a utility, nay, dozens that converty PDFs directly to text (like pdftotext). I ALSO know that. there are millions, if not BILLIONS of crappy web services that serve you a malware on the platter alongisde converting the files. So let's not talk about them! It's about "owning" your software, read this!

What are PDF Files?

This is not meant to be a description or history of PDF files, you can consult Sahih Al-Bukhari for that. But to give context, and pretext, and subtext to this document;

PDF files are the spirtual successor to Adobe PostScript, I believe they started their life as 'executable, portable PostScript' files --- henece the 'P' standing for 'Portable'. But over the years, as the format was standardized, first by Adobe itself, and later, ISO (which currently issues the standard --- currently at version 6 AFAIK), it became a thing of its own making.

PDF files are explosive. Just watch this video:

https://www.youtube.com/watch?v=54XYqsf4JEY

It's very long. But worth the watch. So now that you are fully based & anti-pdf-pilled, let me explain how you can maintain your PDF archive using its daddy, PostScript.

What are PostScript Files?

PostScript was created in 1982 by Adobe, and it's the only worthwhile thing to come out of Adobe, a diamond of a sea of corporate shit. PDF is a binary format, it 'is' a language, mind you, but it 'is' a binary format, it's got an ISO, MIME-recognized magic signature.

I will use new-c.pdf as the first example in this document. Let's try this out.

$ wget https://c9x.me/compile/bib/new-c.pdf
$ file new-c.pdf
=>
new-c.pdf: PDF document, version 1.4, 12 pages

So PDF is kinda like a 'binary document'. But PostScript files, they are plain text files. The are regular, and most often, ASCII text files. And you can convert your PDF files to PostScript as easy as 1-2-3, as I explain below.

To me, PostScript is far more 'portable' than PDF. So I don't get why the hell PDF is the 'portable' format?

PostScript is a stack language. The only 'current' PostScript interpreter today is GNU GhostScript. But we won't need that ,at least not 'directly'. We'll use other utilities that indirectly invoke it.

Converting PDF to PostScript

There are two utilities that convert PDF to PostScript. One is pdf2ps, the other is pdftops. Sometimes one may work better than the other. So be vigilant, and try out both.

When you invoke pdf2ps or pdftops, a .ps file of the same name is created in that directory, unless you specify the output file.

$ pdftops new-c.pdf
$ file new-c.ps
=>
new-c.ps: PostScript document text conforming DSC level 3.0, Level 2

Let's not delve into level 3.0 and level 2 here, I don't know much about it, but AFAIK, PostScript has several levels. I don't care at this moment. The few times that I needed to directly interface with PS, I used libps. You seriously don't need to know PS at all to use it, when libps exists. of couse libps works one way around, not the other way around.

Converting the PostScript files to Plain Text Files

I know two utilities that do this. One is ps2ascii and the other one is pdtotext. For some reason, the latter has never worked for me. So let's just use ps2ascii.

$ ps2ascii new-c.ps > new-c.ascii
$ file new-c.ascii
=>
new-c.ascii: Nim source code, Unicode text, UTF-8 text, with CRLF line terminators

Converting the Plain Text Files to Other Formats with pandoc

Now, let's move on from ancient formats to modern formats. What if I want to convert this ASCII file to, say, Markdown, or HTML?

Well, this is why at the beginning, I said pdftotext is useless. Because it does not keep the formatting. This method does keep the formatting of the text. So if you look at new-c.ascii, you'll see that the indentations, and some of the formatting, it's all there.

pandoc, the pansexual document vore, is able to easily convert from plain text to dozens of formats. I won't discuss pansexual document here, so just install it if you have not, one could not simply live without it.

Let's say we wish to convert new-c.ascii to new-c.md:

cat new-c.ascii | pandoc -tcommonmark > new-c.md

And that is basically it.

Options, Mix-and-Matching Formats

You can basically skip the pansexual document if you wish. You an easily convert PS to many formats directly. I have utilities for converting PS to HTML, LaTex, and so on.

You can convert PS to say, HMTL or LaTeX, then use Pandoc to make a Markdown.

So these are just possibilities.

This was just a short document on the capabilities of PostScript. Enjoy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment