Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 16 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save DannyQuah/04b46cd583f0e87cea7b5009adfb9c5d to your computer and use it in GitHub Desktop.
Save DannyQuah/04b46cd583f0e87cea7b5009adfb9c5d to your computer and use it in GitHub Desktop.
My Pandoc Markdown-PDF Workflow for Routine, Not Especially Technical Writing

My Pandoc Markdown-PDF Workflow for Routine, Not Especially Technical, Writing

by Danny Quah, Aug 2020 (revised Jan 2022)

TL;DR: I write technical articles in LaTeX. But shorter, non-technical writings are easier to do in Markdown. How do I produce PDF from Markdown documents? Answer: provide YAML information in the Markdown; run Pandoc (typically through a Makefile or Atom's Markdown Preview Enhanced). To make all this work, some adjustment is needed in Pandoc options and template files.

Pandoc is a filter that takes a written document in a particular format, and produces a version of that same document in yet a different format. I use Pandoc primarily to transform Markdown documents to PDF, but I also draw on Pandoc to convert Word or ODT documents to Markdown. And vice versa.

Available official Pandoc documentation is voluminous. So as a matter of logic the knowledge to generate PDF from Markdown, to the user's desired degree of control, is already extant, out there somewhere. But a user just beginning might not find a good starting point, and without the ability to produce something useful quickly to show for their efforts, that user can lose the incentive to discover more, experiment, and improve. This writeup provides such a starting point for the beginner.

Pandoc Basics

To do its job, Pandoc has many options available from the command line. Using those can be as easy as:

pandoc --read=odt --write=markdown oldarchive.odt -o mydocument.md

In this example, the input is the Open Document format file oldarchive.odt, and output the Markdown document mydocument.md. Alternatively, I might have given instruction:

pandoc --read=markdown --write=docx oldarchive.md -o mydocument.docx

where now input is the Markdown file oldarchive.md and output the Word document mydocument.docx.

The conversion will never be perfect, but in many cases the result provides a fine starting point for further fine-tuning.

Underlying Attributes

With conversion into odt or docx files, what you see when you render or display the file on-screen is also approximately what you will see when you print the file into PDF or hardcopy. In many Pandoc applications, that is all the user wants. The user, or their collaborators, will thereafter seek to make further changes only by editing the generated odt or docx file.

I myself prefer to work with or edit Markdown but people I work with insist on using Word. So I convert my Markdown file to docx, and thereafter that latter document is what we hand back and forth in our editing.

When the output is PDF, however, the result is no longer to be edited. Or, more accurately, to change the output file, users do not operate on the output file directly. Instead, a user will go back to the input file, make the alterations there, and then re-execute Pandoc to produce a new PDF. The PDF is literally just for show.

(To be clear, current technology does allow editing of a PDF file, by either special hooks or by its conversion to docx, and then editing that. This, however, is different from what I mean by going back to editing the input file and then re-generating the PDF.)

This workflow with a PDF endpoint will be familiar to TeX users. With Pandoc, however, it is not just TeX or LaTeX input or relatives that provide input for beautifully structured PDFs. Markdown documents can provide that input as well. Since Markdown is lighter-weight than TeX, it will not of course do everything that the latter can. However, for many routine, not especially technical, documents, Markdown provides a fine input engine as an alternative to TeX. The actual process behind the scenes, however, turns out to be one where the Markdown document gets translated into TeX (or LaTeX) code first, and then that is fed into TeX (or LaTeX) itself. The intermediate step, however, is invisible to the user.

(A purist who insists on writing everything in TeX might remember that LaTeX stands in relation to TeX much as I've described for Markdown to either LaTeX or TeX. My view is that for many routine, not especially technical, writing, Markdown is already perfectly serviceable. Pandoc for writers can be viewed as akin to lex and yacc for programmers. Who wants to code their own lexical analyzer in C from scratch every single time?)

Here, however, is both opportunity and potential pitfall. LaTeX and TeX are not translators that operate character by character or paragraph by paragraph. Instead, how they work is through structure: this is what allows changing a single option in LaTeX to alter the entire look of the document. But how does structural information get conveyed from Markdown to LaTeX? Where in a Markdown document is encoded, in a single place, the information that, say, every second-level heading is followed by a mediumskip? Or that all figures should float but be placed towards the top of the page closest to where they are first referenced?

This last feature might be observed to be empirically true in a specific document, but was that the intention of the author? Or did the figures just happen to come out that way? How can the author convey the information on what they intend here, in a way that Pandoc and thereafter LaTeX can use?

The answer is two-fold: first, through YAML information; second, through template files.

Markdown to PDF via LaTeX

For generating PDF, Pandoc will, behind the scenes, call on LaTeX or a related program. Pandoc can do all this invisibly, but obviously those called programs need to be installed somewhere on your system. Using this flow, by providing the right directives (not in Markdown itself but in another language that Markdown is able to work with) the Markdown document can provide information on the structure of the PDF output desired.

Any Markdown document can begin with YAML information, i.e., a section that starts and ends with three dashes in sequence on a line by themselves. Between those beginning and ending three dashes, individual lines contain key-value pairs that provide structural information on the document. Thus, for instance, a simple YAML header might be:

fileName: Pandoc-2020.08.md
# Last-edited: Sun 2020.08.09.1841 -- Danny Quah (me@DannyQuah.com)
Type: Notes
Tags: Software
# Created: Sun 2020.08.09.1517 -- Danny Quah (me@DannyQuah.com)

(preceded by and ending with the three-dash lines, of course). Like Python, YAML takes whitespace indentation to be significant, so don't try to prettify your file by introducing extraneous white spaces at the beginning of a YAML line. Comments are introduced by the # symbol, and are ignored by the processor.

Markdown rendering ignores both #-introduced comments and YAML key-value pairs. Indeed, on most systems, all YAML information is ignored by Markdown. If you have a file containing just the above lines in, say, the file file.md, opening this file with a Markdown previewer typically shows the file contains no content to display. Typora will open up file.md and display the YAML header but in non-editable form. Github, similarly. (So, to edit YAML, you'll need to open the file with a text editor like Vim or Atom or similar.)

But if Markdown ignores YAML, what is the point? Here is what's critical: YAML is used by Pandoc and by LaTeX when these two come together to generate PDF from Markdown. Pandoc reads YAML, translates the key-value pairs data into LaTeX directives and then ships everything off to LaTeX, now with input all ready to structure the output.

Thus, in many of my Markdown files destined for PDF output, the YAML header contains:

## Front Matter
title: Readable Title for My Article
author:
  - name: Danny Quah
    affiliation: Lee Kuan Yew School of Public Policy, NUS
    email: D.Quah@nus.edu.sg
    number: 1
  - name: My Coauthor
    affiliation: Economics Department, NUS
    email: ecsdqlsh@nus.edu.sg
    number: 2
date: June 2020
# abstract:
# keywords:
# thanks:

## Formatting
fontsize: 12pt
# mainfont: "gentium" # See https://fonts.google.com/ for fonts
# sansfont: "Raleway"
# monofont: "IBM Plex Mono"
mathfont: ccmath
# fontfamily: concrete | gentium | libertine
# documentclass: article | scrartcl
fontfamily: concrete
documentclass: article
classoption:
 - notitlepage
 - onecolumn
fontenc: T1
geometry:
 - a4paper
 - top=35mm
 - left=30mm
 - heightrounded
header-includes:
 - |
  ```{=latex}
  \usepackage{amsmath,amsfonts,euscript,tikz,fancyhdr,float}
  \floatplacement{figure}{H}
  \```
pagestyle: headings

(somewhere between the beginning and ending 3-dash --- lines. Also, the line right after floatplacement{figure}... should contain an indented three backquotes, but I'm having trouble getting GitHub's markdown processor to process it that way rather than as the premature end of my codeblock. Here, I've written that sequence in with a backslash qualifier instead, but that backslash obviously needs to be removed in production code.)

This is almost all it takes to generate sensible-looking PDF from my Markdown document. The problem that remains is the author key-value pair. The \maketitle command in LaTeX does not understand affiliation and email keys, only author. Thus, how above I have written the author information, to include affiliation and email explicitly, will fail on the standard Pandoc latex template. Instead, the YAML key-value pair needs to be

author:
  - Danny Quah `\\\\`{=latex} Lee Kuan Yew School of Public Policy, NUS `\\\\`{=latex} D.Quah@nus.edu.sg

so that to add a second author, write:

author:
  - Danny Quah `\\\\`{=latex} Lee Kuan Yew School of Public Policy, NUS `\\\\`{=latex} D.Quah@nus.edu.sg
  - My Coauthor `\\\\`{=latex} Economics Department, NUS `\\\\`{=latex} ecsdqlsh@nus.edu.sg

This works for me.

It can be neater, however, to separate out affiliation and email information explicitly, as in the YAML above. To use that, the latex template that Pandoc uses will need to be modified.

First, generate the default latex template that will subsequently be changed:

pandoc -D latex > mytemplate.tex

I put mytemplate.tex in ~/.pandoc/templates/ as that latter is the default personal folder that will be recognised subsequently by the Pandoc option

--template=mytemplate.tex

Now open up mytemplate.tex in a text editor and change the statement (or recognisable statement block) from:

\author{$for(author)$$author$$sep$ \and $endfor$}

to

$if(author)$
    \usepackage{authblk}
    $for(author)$
        $if(author.name)$
            $if(author.number)$
                \author[$author.number$]{$author.name$}
            $else$
                \author[]{$author.name$}
            $endif$
            $if(author.affiliation)$
                $if(author.email)$
                    \affil{$author.affiliation$ \thanks{$author.email$}}
                $else$
                    \affil{$author.affiliation$}
                $endif$
            $endif$
            $else$
            \author{$author$}
        $endif$
    $endfor$
$endif$

As you can see, the replacement code contains reference to affiliation and email, as in the YAML header, but which is not generally available in LaTeX. What makes this work is that the replacement code also loads in the package authblk (in its second line), that will then properly situate the affiliation and email key-values when the LaTeX \maketitle instruction is invoked.

If you decide to use the more elaborate, compacted YAML header and not change the latex template, then

pandoc --standalone --read=markdown --write=pdf --pdfengine=pdflatex myinput.md -o myinput.pdf

produces the desired myinput.pdf. If, however, you decide to use the more explicit and structured affiliation and email YAML header and the modified mytemplate.tex then use instead

pandoc --standalone --read=markdown --write=pdf --template=mytemplate.tex --pdfengine=pdflatex myinput.md -o myinput.pdf

instead, i.e., add the explicit new --template to your Pandoc call.

If you want to inspect the LaTeX code that's produced along the way, you can undertake this production in two steps:

pandoc --standalone --read=markdown --write=latex+raw_tex myinput.md -o myinput.tex
pdflatex myinput.tex &>/dev/null

adding in --template=mytemplate.tex as needed in the pandoc call.

References

https://maehr.github.io/academic-pandoc-template/
https://learnbyexample.github.io/tutorial/ebook-generation/customizing-pandoc/

https://pandoc.org/MANUAL.html#extension-pandoc_title_block
https://uoftcoders.github.io/studyGroup/lessons/misc/pandoc-intro/lesson/
https://opensource.com/article/18/9/pandoc-research-paper
https://en.wikibooks.org/wiki/LaTeX/Title_Creation

@CouldBeThis
Copy link

Found this document looking for something different (but I'm going to save it for later because I actually did get lost in pandoc documentation last time I looked in to it).

without the ability to produce something useful quickly to show for their efforts, that user can lose the incentive to discover more, experiment, and improve.

Just wanted to say this is the smartest thing I've read on the internet in a long time.
⭐⭐⭐⭐⭐

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment