-
https://spec.commonmark.org/0.29/
CommonMark Spec 0.29
-
https://github.com/tef/toyparser2019/tree/master/toyparser/commonmark
My toy parser implementaton
-
2500 lines, CommonMark specific code
-
1300 lines, Base commonmark grammar
-
600 lines, HTML tag handling (under commonmark rules)
-
200 lines, Tree rewriting
- Filling in backreferences/link definitions
- Deactivating Links that have nested links inside
- Stripping images/links inside image text
- Tight/Loose List detection
- Emphasis handling
-
350 lines, HTML output
-
-
2000 lines, Support code
- 1000 lines, Grammar Definition Library
- 1000 lines, Parser-Generator
Everything? Is that an answer.
In truth, this only matters because of indented code blocks. Tabs make them a little worse. You have to track column values to expand tabs into their correct space values, and be able to capture a tab that spans grammar rules.
If a tabstob is 8, you'd have to match whitespace(4), whitespace(4)
Indented code blocks make everything terrible. A four space indent, which means certain rules have to read ahead and then read forwards n-4 characters, before moving on.
Every block type treats blank lines differently. Lists become loose or tight, amongst other examples.
Knowing when and where you can read a newline, well, continue on to the next, depends on which block construct the current paragraph is nested inside
The ===
needs a >
if the paragraph is inside a blockquote, but the lines in between the start and last need not have a >
a >
can be missing on continuation lines, which is convenient, but leads to weird thngs
>> a
> b
is one paragraph inside two blockquotes.
Seven cases, each with somewhat unique rules.
Can nest, can contain links inside alt-text. These get formatted differently
Can't nest. Have to be parsed as if they nest. Can turn into text if not defined elsewhere. [foo][bar][baz]
can be parsed in two different ways, [foo [foo]]
isn't a link with a link inside. [foo [foo]()][foo]
is.
Life would be easier if the outermost link was preserved, but alas. no.
Can't be parsed beyond tokenization. Highly context sensitive. Matching depends on if a link is active or not, and if the left and right flanks have a certain length or sum of lengths. Christ on a bike.
The usual commonmark strategy is this:
- tokenization / or regular expression matching
- a recursive descent parser that tracks things like columns and partial tabs
- a block parser, handling things like setext headers, etc, into mostly tokens
- line at a time parsing, choosing which dom node to add to before parsing the line. tight/loose list recognition is handled here.
- extracting backreferences
- resolving links (which ones can nest), handling alt text in images
- handling emphasis matching
Instead I have:
- A PEG Grammar that produces a node tree.
- Links do not allow links defined inside
- Links, Images, reparse the text matched as a paragraph, to handle missing linkdefs
- Indentation, Continuation lines handled with a
indent()
grammar rule - Emphasis markers matched and counted
- Atx headers counted
Then postprocessing over the parse tree:
- Finding backreferences
- Populating Links, Flattening inactive links
- Handling Emphasis
- Handling loose/tight links
- Converting to HTML
indented { rules .... indent .... rules}
A parser rule that takes some nested rules, along with an indent, and dedent rule.
Calling indent
matches the indent rules in order, if any
Calling partial_indent
matches the indent rules in order, and if an indent rule fails, but the dedent rule doesn't match, it continues as usual. This is how I handle continuation lines in the PEG rules.
Its very much the same approach as a line at a time parser, but backtracking instead of precisely matching which block can continue. Amenable to memoization.
Parallel parsing, if one rule matches, reparse it with another rule, and store the result of both. I parse the link, and the link as if it was paragraph text, and resolve the ambiguity later.
Same as everyone else, made simpler by not having to handle links at the same time. Still frustrating.
I put in variables, named captures, and a lot of the complexity became a bit more explict, but manageable. Backreferences, or counting operators to extract text or numbers, that can be passed down into later rules.
I added relative offsets to lookaheads, too
Things that would make markdown significantly easer to parse have been covered before. To echo others, indented code blocks make everything else worse. Links are another case, as [foo][bar][baz]
gets disambiguated after parse time.