{{ message }}

Instantly share code, notes, and snippets.

Last active Jun 22, 2020
A metagrammar for a parser generator

Blog 2020/6/18

# A metagrammar for a parser generator

Part of writing a parser generator is specifying the grammar used to describe grammars, i.e. the grammar-grammar or metagrammar.

Just as ABNF is defined using ABNF, so too will I use my metagrammar to describe my metagrammar.

## But first, an example grammar

Let's consider the grammar for a trivial lisp-like language.

Here is a classic recursive fibnacci implementation in such a language:

``````(define fib
(lambda (n)
(if (lessthan n 2)
n
(sum
(fib (sum n -1))
(fib (sum n -2))))))
``````

The token definitions for such a language might be:

``````SYMBOL
[a-z]+
NUMBER
-?[0-9]+
OPAREN
\(
CPAREN
\)
WSPACE
\s+
``````

and the grammar used to define this language might be:

``````           program = symbolicExpression+
symbolicExpression = list | atom
list = OPAREN symbolicExpression* CPAREN
atom = SYMBOL | NUMBER
``````

# The metagrammar

``````# The mkparser.py meta-grammar.

# A grammar is at least one production rule.
grammar = productionRule+

# A production rule is a non-terminal, '=', and an expr, terminated by ';'.
productionRule = NONTERMINAL EQUALS expr SEMICOLON

# An expression is a sequence, a choice, or a unit.
expr = sequence | choice | unit

# A sequence is at least two (choice or unit).
sequence = (choice|unit) (choice|unit)+

# A choice is at least two units, delimited by '|'.
choice = unit ( BAR unit )+

# A unit is a group or atom, with an optional arity operator.
unit = zeroOrOne | zeroOrMore | oneOrMore | group | atom
zeroOrOne = (group|atom) QMARK
zeroOrMore = (group|atom) ASTERISK
oneOrMore = (group|atom) PLUS

# A group is an expr enclosed in parenthesis.
group = OPAREN expr CPAREN

# An atom is the name of a production rule or the name of a token type.
atom = NONTERMINAL | TERMINAL
``````

the the token definitions are:

``````#pragma discard WSPACE COMMENT
WSPACE
\s+
NONTERMINAL
[a-z][a-zA-Z0-9_]*
TERMINAL
[A-Z][A-Z0-9_]*
EQUALS
=
QMARK
\?
ASTERISK
\*
PLUS
\+
OPAREN
\(
CPAREN
\)
BAR
\|
COMMENT
#.*
``````

# Detailed description

## Components

A grammar consists of:

• production rule names
• token type names
• operators
• whitespace and comments (both ignored)

## Production rules

A grammar consists of one or more production rules.

Production rules are named using camelCase, with a lower-case leading character:

`[a-z][a-zA-Z0-9_]*`

This metagrammar uses `=` to define a production rule:

`foo = bar`

Definition symbols commonly used by other grammars include `:`, `:=` and `::=`.

## All terminals must be tokens

Note that there are no quoted characters allowed. All terminals must be tokens, created by the lexer.

Token types are indicated in ALLCAPS:

`[A-Z][A-Z0-9_]*`

## Grouping

As with many other grammar notations, `(` and `)` are used to indicate grouping.

## Sequence

Sequence is indicated by simply listing several whitespace-delimited components:

`ay = bee cee dee`

Other grammars sometimes use `,` to indicate sequence.

## Alternation / choice

Choice is indicated using `|`:

`foo = bar | bin | baz`

Note however that this operator indicates ordered choice a la PEG. Thus, the resulting parser does not implement back-tracking, and there is no possibility of ambiguity.

Note also that `|` only applies to its nearest neighbors. That is, this:

`foo = one two | three four`

is equivalent to:

`foo = one ( two | three ) four`

and does not mean:

`foo = ( one two ) | ( three four )`

Note also that alternation does not "see" an arity operator as a component, rather it "sees" the group consisting of the arity operator applied to its component.

That is to say, this:

`foo = one+ | two* | three?`

is equivalent to:

`foo = ( one + ) | ( two * ) | ( three ? )`

and does not mean:

`foo = one ( + | two ) ( * | three ) ?`

## Arity operators (repetition)

This metagrammar borrows the arity operators from regex notation and PEG:

• `?`: zero or one (a.k.a. optional)
• `*`: zero or more
• `+`: one or more

These can be used with terminals, non-terminals, and groups:

`foo = NUMBER? expression* (SYMBOL NUMBER)+`

Note that, like PEG, these operators are greedy.

Other grammars use `[]`, `{}`, etc. to indicate repetition.

After tokenization, all whitespace tokens and comment tokens are discarded before the tokens of the grammar are read by the parser generator.

Thus, `foo = bar+` and `foo = bar +` are equivalent.

## Automatic semicolon insertion

You may have noticed that none of the grammars I've shown use semicolons to delimit each production rule, and yet production rules are defined as:

`productionRule = NONTERMINAL EQUALS expr SEMICOLON`

So what gives?

It turns out it is easy to perform automatic semicolon insertion for these grammars.

After lexing the grammar, I run the JSON tokens through this middleware script to automatically insert the `SEMICOLON` tokens:

```#!/usr/bin/env python

# Automatic semicolon insertion for mkparser.py grammars.

# Transformation:
# NONTERMINAL EQUALS  ->  SEMICOLON NONTERMINAL EQUALS
# (except for the first production rule)

import sys
import json

format_obj = None
use_fast_format = None
nonterminal_toktype_index = None
equals_toktype_index = None
semicolon_toktype_index = None

def toktype_is_nonterminal(token):
if use_fast_format:
else:

def toktype_is_equals(token):
if use_fast_format:
else:

def semicolon_token():
if use_fast_format:
return [semicolon_toktype_index, ';']
else:
return {'type': 'token', 'token_type': 'SEMICOLON', 'text': ';'}

def insert_semicolons(tokens):
new_tokens = [tokens]
previous_token = tokens
for token in tokens[2:]:
if toktype_is_nonterminal(previous_token) and toktype_is_equals(token):
new_tokens.append(semicolon_token())
new_tokens.append(previous_token)
previous_token = token
continue
new_tokens.append(previous_token)
new_tokens.append(semicolon_token())
return new_tokens

if __name__ == '__main__':
format_obj = js_obj
use_fast_format = format_obj['format'] == 'fast'
if use_fast_format:
nonterminal_toktype_index = format_obj['token_types'].index('NONTERMINAL')
equals_toktype_index = format_obj['token_types'].index('EQUALS')
format_obj['token_types'].append('SEMICOLON')
semicolon_toktype_index = format_obj['token_types'].index('SEMICOLON')
tokens = js_obj
tokens = insert_semicolons(tokens)
js_obj = [format_obj, tokens]
output = json.dumps(js_obj)
sys.stdout.write(output)
if not output.endswith('\n'):
sys.stdout.write('\n')```