Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save 16892434/16205002c7189708701a5d55e603a7c6 to your computer and use it in GitHub Desktop.
Save 16892434/16205002c7189708701a5d55e603a7c6 to your computer and use it in GitHub Desktop.
My Pandoc Markdown-PDF Workflow for Routine, Not Especially Technical Writing

My Pandoc Markdown-PDF Workflow for Routine, Not Especially Technical Writing

by Danny Quah Aug 2020

TL;DR: I write technical articles in LaTeX. But shorter, non-technical writings are easier to do in Markdown. How do I produce PDF from Markdown documents? Answer: provide YAML information in the Markdown; run Pandoc (typically through a Makefile or Atom's Markdown Preview Enhanced). To make all this work, some adjustment is needed in Pandoc options and template files.

Pandoc is a filter that takes a written document in its given format, and produces a version of that same document in yet a different format. I use Pandoc primarily to transform Markdown documents to PDF, but I also draw on Pandoc to convert Word or ODT documents to Markdown. Or vice versa.

Available official Pandoc documentation is voluminous. So as a matter of logic the knowledge to generate PDF from Markdown, to the user's desired degree of control, is already extant, out there somewhere. But a user just beginning might not find a good starting point, and without the ability to produce something useful quickly to show for their efforts, that user can lose the incentive to discover more, experiment, and improve. This writeup provides such a starting point for the beginner.

Pandoc Basics

To do its job, Pandoc has many options available from the command line. Using those can be as easy as just:

$ pandoc --read=odt --write=markdown oldarchive.odt -o mydocument.md

In this example, the input is the Open Document format file oldarchive.odt; output the Markdown document mydocument.md. Alternatively, I might have wanted the instruction:

$ pandoc --read=markdown --write=docx oldarchive.md -o mydocument.docx

where now input is the Markdown file oldarchive.md and output the Word document mydocument.docx.

The conversion will never be perfect, but in many cases the result provides a fine starting point for further fine-tuning.

Underlying Attributes

In the two examples just given, what you see when you render or display the input file, whether on-screen or on paper, is also approximately what you will see when you display the output file. In many Pandoc applications, that is all the user wants. The user, or their collaborators, will thereafter seek to make further changes only by editing the output file.

I myself prefer to edit Markdown directly but the people I work with insist on using Word. So I convert my Markdown file to docx, and thereafter that latter document is what we hand back and forth in our editing.

When the output is PDF, however, the result is no longer to be edited. Or, more accurately, to change the output file, users do not operate on the output file directly. Instead, a user will go back to the input file, make the alterations there, and then re-execute Pandoc to produce a new PDF. The PDF is just for show.

This workflow having the PDF endpoint will be familiar to TeX users. With Pandoc, however, it is not just TeX or its relatives that can provide input for beautifully structured PDFs. Markdown documents can provide that input as well. Since Markdown is lighter-weight than TeX, it will not of course do everything that the latter can. However, for many routine, not especially technical documents, Markdown provides a fine input engine as an alternative to TeX. The actual process is one where the Markdown document gets translated into LaTeX code, and then that result is fed into LaTeX itself.

(A purist who insists on writing everything in TeX might remember that LaTeX stands in relation to TeX much as I've described for Markdown, except that LaTeX is higher-up in the hierarchy and thus closer to TeX. But, again, for many routine, not especially technical writing, Markdown is already perfectly serviceable. Pandoc can be viewed as akin to lex and yacc for programmers. Who wants to code their own lexical analyzer in C from scratch every single time?)

Here, however, is both opportunity and potential pitfall. LaTeX and TeX are not translators that operate character by character or paragraph by paragraph. Instead, how they work is through structure. But how does structural information get conveyed from Markdown to LaTeX? Where in a Markdown document is encoded, in a single place, the information that, perhaps, every second-level heading is followed by a mediumskip? Or that all figures should float but be placed towards the top of the page closest to where they are first referenced?

This last feature might be observed to be empirically true in a specific document, but was that the intention of the author? Or did the figures just happen to come out that way? How can the author convey the information on what they intend here, in a way that Pandoc and thereafter LaTeX can use?

The answer is two-fold: first, through YAML information; second, through template files.

Markdown to PDF via LaTeX

For generating PDF, Pandoc will, behind the scenes, call on LaTeX or a related program. Pandoc can, to a great extent, do all this invisibly, but obviously those programs need to be installed somewhere on your system.

Any Markdown document can begin with YAML information, i.e., a section that starts and ends with three dashes in sequence on a line by themselves. In between those, individual lines contain key-value pairs that provide structural information on the document. Thus, a simple YAML header might be:

fileName: Pandoc-2020.08.md
# Last-edited: Sun 2020.08.09.1841 -- Danny Quah (me@DannyQuah.com)
Type: Notes
Tags: Software
# Created: Sun 2020.08.09.1517 -- Danny Quah (me@DannyQuah.com)

(preceded and ending with the three-dash sequence lines, of course). Like Python, YAML takes whitespace indentation to be significant. Comments are introduced by the # symbol, and are ignored by the processor.

As with the #-introduced comments, however, as far as Markdown rendering is concerned, the YAML key-value pairs too are ignored. Indeed, on most systems, all YAML information is ignored by Markdown. If you have a file containing just the above lines in, say, the file file.md, opening this file with a Markdown previewer typically shows the file contains nothing to display. Typora will open up file.md and display the YAML header but in non-editable form. Similarly, Github. (So, to edit YAML, you'll need to open the file with a text editor like Vim or Atom or similar.)

But if YAML is ignored, what is the point to it? Here is what's critical: YAML is used by Pandoc and by LaTeX when these two generate a PDF document from Markdown. YAML information is read off the Markdown document by Pandoc, gets passed by Pandoc to LaTeX, and with the latter employing YAML's key-value pairs as directives for structuring the document.

Thus, in many of my Markdown files destined for PDF output, the YAML header contains also:

## Front Matter
title: Readable Title for My Article
author:
  - name: Danny Quah
    affiliation: Lee Kuan Yew School of Public Policy, NUS
    email: D.Quah@nus.edu.sg
    number: 1
  - name: My Coauthor
    affiliation: Economics Department, NUS
    email: ecsdqlsh@nus.edu.sg
    number: 2
date: June 2020
# abstract:
# keywords:
# thanks:

## Formatting
fontsize: 12pt
# mainfont: "gentium" # See https://fonts.google.com/ for fonts
# sansfont: "Raleway"
# monofont: "IBM Plex Mono"
mathfont: ccmath
# fontfamily: concrete | gentium | libertine
# documentclass: article | scrartcl
fontfamily: concrete
documentclass: article
classoption:
 - notitlepage
 - onecolumn
fontenc: T1
geometry:
 - a4paper
 - top=35mm
 - left=30mm
 - heightrounded
header-includes:
 - |
  ```{=latex}
  \usepackage{amsmath,amsfonts,euscript,tikz,fancyhdr,float}
  \floatplacement{figure}{H}
  ```
pagestyle: headings

(obviously somewhere between the beginning and ending 3-dash --- lines).

This is almost all it takes to generate a sensible-looking PDF from my Markdown document. The problem that remains is the author key-value pair. The \maketitle command in LaTeX does not understand affiliation and email keys, only author. Thus, how above I have written the author information, to include affiliation and email explicitly, will fail on the standard Pandoc latex template. Instead, the YAML key-value pair needs to be

author:
  - Danny Quah `\\\\`{=latex} Lee Kuan Yew School of Public Policy, NUS `\\\\`{=latex} D.Quah@nus.edu.sg

so that to add a second author, write:

author:
  - Danny Quah `\\\\`{=latex} Lee Kuan Yew School of Public Policy, NUS `\\\\`{=latex} D.Quah@nus.edu.sg
  - My Coauthor `\\\\`{=latex} Economics Department, NUS `\\\\`{=latex} ecsdqlsh@nus.edu.sg

This works for me.

It can be neater, however, to separate out affiliation and email information explicitly, as in the YAML above. To use that then, the latex template that Pandoc uses will need to be modified.

First, generate the default latex template that will subsequently be changed:

$ pandoc -D latex > mytemplate.tex

I put mytemplate.tex in ~/.pandoc/templates/ as that latter is the default personal folder that will be recognised subsequently by the Pandoc option

--template=mytemplate.tex

Now open up mytemplate.tex in a text editor and change the statement (or recognisable statement block) from:

\author{$for(author)$$author$$sep$ \and $endfor$}

to

$if(author)$
    \usepackage{authblk}
    $for(author)$
        $if(author.name)$
            $if(author.number)$
                \author[$author.number$]{$author.name$}
            $else$
                \author[]{$author.name$}
            $endif$
            $if(author.affiliation)$
                $if(author.email)$
                    \affil{$author.affiliation$ \thanks{$author.email$}}
                $else$
                    \affil{$author.affiliation$}
                $endif$
            $endif$
            $else$
            \author{$author$}
        $endif$
    $endfor$
$endif$

As you can see, the replacement code contains reference to affiliation and email, as in the YAML header, but which is not generally available in LaTeX. What makes this work is that the replacement code also loads in the package authblk (in its second line), that will then properly situate the affiliation and email key-values when the LaTeX \maketitle instruction is invoked.

If you decide to use the more elaborate, compacted YAML header and not change the latex template, then

$ pandoc --standalone --read=markdown --write=pdf --pdfengine=pdflatex myinput.md -o myinput.pdf

produces the desired myinput.pdf. If, however, you decide to use the more explicit and structured affiliation and email YAML header and the modified mytemplate.tex then use

$ pandoc --standalone --read=markdown --write=pdf --template=mytemplate.tex --pdfengine=pdflatex myinput.md -o myinput.pdf

instead, i.e., add the explicit new --template to your Pandoc call.

If you want to inspect the LaTeX code that's produced along the way, you can undertake this production in two steps:

$ pandoc --standalone --read=markdown --write=latex+raw_tex myinput.md -o myinput.tex
$ pdflatex myinput.tex &>/dev/null

adding in --template=mytemplate.tex as needed in the pandoc call.

References

https://maehr.github.io/academic-pandoc-template/

https://learnbyexample.github.io/tutorial/ebook-generation/customizing-pandoc/

https://pandoc.org/MANUAL.html#extension-pandoc_title_block

https://uoftcoders.github.io/studyGroup/lessons/misc/pandoc-intro/lesson/

https://opensource.com/article/18/9/pandoc-research-paper

https://en.wikibooks.org/wiki/LaTeX/Title_Creation

@hmf
Copy link

hmf commented Nov 4, 2022

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment