Skip to content

Instantly share code, notes, and snippets.

@tef
Last active August 29, 2019 23:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tef/d381d14a1a4cd24a97a9d62927c337a9 to your computer and use it in GitHub Desktop.
Save tef/d381d14a1a4cd24a97a9d62927c337a9 to your computer and use it in GitHub Desktop.

Trip Report: Writing a CommonMark Parser

Resources

Breakdown/Costs

  • 2500 lines, CommonMark specific code

    • 1300 lines, Base commonmark grammar

    • 600 lines, HTML tag handling (under commonmark rules)

    • 200 lines, Tree rewriting

      • Filling in backreferences/link definitions
      • Deactivating Links that have nested links inside
      • Stripping images/links inside image text
      • Tight/Loose List detection
      • Emphasis handling
    • 350 lines, HTML output

  • 2000 lines, Support code

    • 1000 lines, Grammar Definition Library
    • 1000 lines, Parser-Generator

Pain points

Everything? Is that an answer.

Tabs

In truth, this only matters because of indented code blocks. Tabs make them a little worse. You have to track column values to expand tabs into their correct space values, and be able to capture a tab that spans grammar rules.

If a tabstob is 8, you'd have to match whitespace(4), whitespace(4)

Indented blocks

Indented code blocks make everything terrible. A four space indent, which means certain rules have to read ahead and then read forwards n-4 characters, before moving on.

Blank Lines

Every block type treats blank lines differently. Lists become loose or tight, amongst other examples.

Contextual Newlines

Knowing when and where you can read a newline, well, continue on to the next, depends on which block construct the current paragraph is nested inside

Setext Headers

The === needs a > if the paragraph is inside a blockquote, but the lines in between the start and last need not have a >

Blockquotes

a > can be missing on continuation lines, which is convenient, but leads to weird thngs

>> a
> b

is one paragraph inside two blockquotes.

HTML

Seven cases, each with somewhat unique rules.

Images

Can nest, can contain links inside alt-text. These get formatted differently

Links

Can't nest. Have to be parsed as if they nest. Can turn into text if not defined elsewhere. [foo][bar][baz] can be parsed in two different ways, [foo [foo]] isn't a link with a link inside. [foo [foo]()][foo] is.

Life would be easier if the outermost link was preserved, but alas. no.

Emphasis

Can't be parsed beyond tokenization. Highly context sensitive. Matching depends on if a link is active or not, and if the left and right flanks have a certain length or sum of lengths. Christ on a bike.

Parsing with PEGs

The usual commonmark strategy is this:

  • tokenization / or regular expression matching
  • a recursive descent parser that tracks things like columns and partial tabs
  • a block parser, handling things like setext headers, etc, into mostly tokens
  • line at a time parsing, choosing which dom node to add to before parsing the line. tight/loose list recognition is handled here.
  • extracting backreferences
  • resolving links (which ones can nest), handling alt text in images
  • handling emphasis matching

Instead I have:

  • A PEG Grammar that produces a node tree.
    • Links do not allow links defined inside
    • Links, Images, reparse the text matched as a paragraph, to handle missing linkdefs
    • Indentation, Continuation lines handled with a indent() grammar rule
    • Emphasis markers matched and counted
    • Atx headers counted

Then postprocessing over the parse tree:

  • Finding backreferences
  • Populating Links, Flattening inactive links
  • Handling Emphasis
  • Handling loose/tight links
  • Converting to HTML

Indentation

indented { rules .... indent .... rules}

A parser rule that takes some nested rules, along with an indent, and dedent rule.

Calling indent matches the indent rules in order, if any

Calling partial_indent matches the indent rules in order, and if an indent rule fails, but the dedent rule doesn't match, it continues as usual. This is how I handle continuation lines in the PEG rules.

Its very much the same approach as a line at a time parser, but backtracking instead of precisely matching which block can continue. Amenable to memoization.

Links

Parallel parsing, if one rule matches, reparse it with another rule, and store the result of both. I parse the link, and the link as if it was paragraph text, and resolve the ambiguity later.

Emphasis

Same as everyone else, made simpler by not having to handle links at the same time. Still frustrating.

PEG shenanigans

I put in variables, named captures, and a lot of the complexity became a bit more explict, but manageable. Backreferences, or counting operators to extract text or numbers, that can be passed down into later rules.

I added relative offsets to lookaheads, too

Wrapup

Things that would make markdown significantly easer to parse have been covered before. To echo others, indented code blocks make everything else worse. Links are another case, as [foo][bar][baz] gets disambiguated after parse time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment