tef/commonmark.md

## commonmark.md

      
    Raw
  

              commonmark.md
            
          
    Trip Report: Writing a CommonMark Parser

Resources


https://spec.commonmark.org/0.29/
CommonMark Spec 0.29


https://github.com/tef/toyparser2019/tree/master/toyparser/commonmark
My toy parser implementaton


Breakdown/Costs


2500 lines, CommonMark specific code


1300 lines, Base commonmark grammar


600 lines, HTML tag handling (under commonmark rules)


200 lines, Tree rewriting

Filling in backreferences/link definitions
Deactivating Links that have nested links inside
Stripping images/links inside image text
Tight/Loose List detection
Emphasis handling


350 lines, HTML output


2000 lines, Support code

1000 lines, Grammar Definition Library
1000 lines, Parser-Generator


Pain points

Everything? Is that an answer.
Tabs

In truth, this only matters because of indented code blocks. Tabs make them a little worse. You have to track column values to expand tabs into their correct space values, and be able to capture a tab that spans grammar rules.
If a tabstob is 8, you'd have to match whitespace(4), whitespace(4)
Indented blocks

Indented code blocks make everything terrible. A four space indent, which means certain rules have to read ahead and then read forwards n-4 characters, before moving on.
Blank Lines

Every block type treats blank lines differently. Lists become loose or tight, amongst other examples.
Contextual Newlines

Knowing when and where you can read a newline, well, continue on to the next, depends on which block construct the current paragraph is nested inside
Setext Headers

The === needs a > if the paragraph is inside a blockquote, but the lines in between the start and last need not have a >
Blockquotes

a > can be missing on continuation lines, which is convenient, but leads to weird thngs
>> a
> b

is one paragraph inside two blockquotes.
HTML

Seven cases, each with somewhat unique rules.
Images

Can nest, can contain links inside alt-text. These get formatted differently
Links

Can't nest. Have to be parsed as if they nest. Can turn into text if not defined elsewhere. [foo][bar][baz] can be parsed in two different ways, [foo [foo]] isn't a link with a link inside. [foo [foo]()][foo] is.
Life would be easier if the outermost link was preserved, but alas. no.
Emphasis

Can't be parsed beyond tokenization. Highly context sensitive. Matching depends on if a link is active or not, and if the left and right flanks have a certain length or sum of lengths. Christ on a bike.
Parsing with PEGs

The usual commonmark strategy is this:

tokenization / or regular expression matching
a recursive descent parser that tracks things like columns and partial tabs
a block parser, handling things like setext headers, etc, into mostly tokens
line at a time parsing, choosing which dom node to add to before parsing the line. tight/loose list recognition is handled here.
extracting backreferences
resolving links (which ones can nest), handling alt text in images
handling emphasis matching

Instead I have:

A PEG Grammar that produces a node tree.

Links do not allow links defined inside
Links, Images, reparse the text matched as a paragraph, to handle missing linkdefs
Indentation, Continuation lines handled with a indent() grammar rule
Emphasis markers matched and counted
Atx headers counted


Then postprocessing over the parse tree:

Finding backreferences
Populating Links, Flattening inactive links
Handling Emphasis
Handling loose/tight links
Converting to HTML

Indentation

indented { rules .... indent .... rules}
A parser rule that takes some nested rules, along with an indent, and dedent rule.
Calling indent matches the indent rules in order, if any
Calling partial_indent matches the indent rules in order, and if an indent rule fails, but the dedent rule doesn't match, it continues as usual. This is how I handle continuation lines in the PEG rules.
Its very much the same approach as a line at a time parser, but backtracking instead of precisely matching which block can continue. Amenable to memoization.
Links

Parallel parsing, if one rule matches, reparse it with another rule, and store the result of both. I parse the link, and the link as if it was paragraph text, and resolve the ambiguity later.
Emphasis

Same as everyone else, made simpler by not having to handle links at the same time. Still frustrating.
PEG shenanigans

I put in variables, named captures, and a lot of the complexity became a bit more explict, but manageable. Backreferences, or counting operators to extract text or numbers, that can be passed down into later rules.
I added relative offsets to lookaheads, too
Wrapup

Things that would make markdown significantly easer to parse have been covered before. To echo others, indented code blocks make everything else worse. Links are another case, as [foo][bar][baz] gets disambiguated after parse time.