Skip to content

Instantly share code, notes, and snippets.

@rauschma
Last active July 22, 2021 13:15
Show Gist options
  • Save rauschma/629d9babf101f5972c7a4c17f882d3aa to your computer and use it in GitHub Desktop.
Save rauschma/629d9babf101f5972c7a4c17f882d3aa to your computer and use it in GitHub Desktop.

Using Pandoc to publish a book in multiple file formats: experiences and wishes

Pandoc was essential for publishing my book “JavaScript for impatient programmers”. The book exists in several versions:

  • Printable PDF (for a print-on-demand book on Amazon)
  • Screen PDF
  • Multi-page HTML
  • EPUB
  • MOBI

The homepage of “JavaScript for impatient programmers” contains previews of all artifacts.

In this document, I describe some of the challenges I’ve encountered while working on the book.

Limitations of HTML output

HTML output is missing several important features. All of them are supported when using LaTeX via Pandoc:

I wrote Lua filters to work around these limitations (excluding frontmatter), but it wasn’t easy.

Multi-file output

For HTML, I needed Pandoc to produce multiple files (kind of like the internals of the EPUBs that it produces).

The easiest workaround was to generate a single long HTML file and split it up, while updating cross-file links so that they also include filenames.

Related:

--extract-media paths: optionally relative to output (vs. relative to input)?

Problem

  • On one hand, we need to be in the same directory as the content, so that \includepdf{} works with relative paths.

    • Additional important benefit: That command can’t handle spaces in paths (which are common in absolute paths on macOS). That’s a bug that’s new in the latest version of XeLaTeX: https://github.com/ho-tex/oberdiek/issues/31
    • Alas, Pandoc’s intermediate LaTeX output also has to sit next to the content. That’s a weakness of LaTeX, not of Pandoc.
  • On the other hand, --extract-media path assumes we are inside the output directory:

    • pandoc --standalone -o ../out/chapter.html --extract-media=../out/img chapter.md
    • Input: ![](img/diagram.svg)
    • Actual output: <img src="../out/img/08830082d8bd4c323da4ec4f51fb2a20b2dcaae7.svg" />
    • Desired output: <img src="img/08830082d8bd4c323da4ec4f51fb2a20b2dcaae7.svg" />
proj/
  content/
    chapter.md
    img/
      diagram.svg
  out
    chapter.html
    img/
      08830082d8bd4c323da4ec4f51fb2a20b2dcaae7.svg

The workaround that I have used

  • Choose a long unique name for the extracted media directory.
  • Search-and-replace in the produced HTML output and copy the extracted directory into the output location.

How Pandoc could be changed to fix this problem

Introduce a different mode for --extract-media where paths to extracted files are relative to the output file. This different mode could be switched on via:

  • An option that otherwise does the same as --extract-media, but computes paths differently: --extract-media-relative-to-output
  • A separate option for specifying how --extract-media computes its paths:
    • --extract-media-mode=relative-to-input
    • --extract-media-mode=relative-to-output

If other options work similarly to --extract-media, it may make sense to introduce an option that works for all of them (instead of just for --extract-media).

Related:

Working with the filter API

Wishes:

  • At the moment, filters visit inlines in a separate pass. (This is a known problem and being worked on.)
  • Should Pandoc ever support cross-format numbering of headings, filters would benefit from having access to the numbers of headings.

I’ve found Lua difficult to work with (tables and output are frustratingly limited, etc.). I originally wanted to publish my Lua filters, but they don’t feel robust enough for me to do so. The solution will be to eventually rewrite the filters in either Haskell, Rust or TypeScript. Then I can publish them.

Various other wishes

  • The filter pandoc-crossref is important for supporting LaTeX’s floating images and tables for all output formats. It allows you to refer to them elegantly. It’d be great if this functionality could be built into Pandoc.

  • For images, I’m making a distinction:

    • Bitmap graphics (same across all file formats): .jpg, .png
    • Vector graphics (format-specific): no filename extension. The filename extension is then specified via --default-image-extension:
      • PDF: .pdf
      • EPUB, HTML: .svg
      • MOBI (via intermediate EPUB): .jpg Minor inconvenience: When previewing the Markdown in an editor, you don’t see the vector graphics. I’m not sure how to best fix this. Maybe with a mapping of image extensions: --image-extension-replace=svg/pdf (i.e., use .svg in Markdown input, but .pdf in PDF output).

Additional filters that I wrote

  1. A filter that converts links to page numbers (use case: print PDF):

    • Input: This phenomenon is called [_hoisting_](#hoisting).
    • Output: This phenomenon is called hoisting.
    • Print (no link, page number via LaTeX): This phenomenon is called hoisting (page 392).
  2. References that mention the section number and section title:

  3. Inserting breaks into inline code (to fix overflow problems in LaTeX):

    `Desc•.[[Con•fig•urable]]`{.break}
  4. Linking to inline IDs doesn’t work in LaTeX. Workaround supported by filter:

    [Hoisting]{#hoisting .texlabel} is an important term in this context.

    UPDATE: fixed in master

  5. Information boxes (“tip”, “warning”, etc.). Examples: https://exploringjs.com/impatient-js/ch_faq-book.html#notations-and-conventions

Conclusion

In general, I loved working with Pandoc. Especially its filters make it a flexible and powerful tool. It’s impressive how well they work.

The following features helped with creating the print PDF:

  • Black & white syntax highlighting
  • The option to convert links into footnotes

Further reading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment