Skip to content

Instantly share code, notes, and snippets.

@cellularmitosis
Last active June 18, 2020 20:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cellularmitosis/9fb64bb25174d328290f9144e13c1c82 to your computer and use it in GitHub Desktop.
Save cellularmitosis/9fb64bb25174d328290f9144e13c1c82 to your computer and use it in GitHub Desktop.
A spec for JSON output from lexers and parsers

Blog 2020/6/17

<- previous | index | next ->

A spec for JSON output from lexers and parsers

It seems difficult to find lexers and parsers which simply spit out JSON.

Because of this (and because it is fun), I've decided to work on a general-purpose lexer generator and parser generator. The generated lexers will spit out a list of tokens in JSON, and the generated parsers will spit out a parse tree in JSON.

Part of this process is creating a spec describing those JSON formats. That's what this blog post is about!

Table of contents:

Lexer output formats

The top-level object, in all cases, is an array.

The array contains two items:

  • A format object describing which format is being used
  • The tokens

The format object

The format object is identified by a type key:

{"type": "format"}

The format object contains a format key, the value of which is one of:

  • tokens
  • token-lines
  • fast-tokens
  • fast-token-lines

e.g. {"type": "format", "format": "tokens"}

For the tokens and token-lines formats, each token is an object.

For the fast-tokens and fast-token-lines formats, each token is an array.

Token objects

A token object is identified by a type key:

{"type": "token"}

A token also has a token_type key and a text key. e.g.

{"type": "token", "token_type": "NUMBER", "text": "1.618"}

The tokens format

For the tokens format, the second item in the top-level array is an array of tokens.

Here is a complete example of the tokens JSON format:

[
    {"type": "format", "format": "tokens"},
    [
        {"type": "token", "token_type": "SYMBOL", "text": "pi"},
        {"type": "token", "token_type": "WSPACE", "text": " "},
        {"type": "token", "token_type": "NUMBER", "text": "3.14159"}
    ]
]

The above is a possible tokenization of the input string

pi 3.14159

using the lexical grammar

SYMBOL
[a-z]+
NUMBER
[0-9]+(\.[0-9]+)?
WSPACE
\s+

The token-lines format

The token-lines format differs in that the second item in the top-level array is a two-dimentional array, i.e. a list of lines of tokens.

For the input file:

pi 3.14159
phi 1.618

a possible tokenization (in token-lines format) might be:

[
    {"type": "format", "format": "token-lines"},
    [
        [
            {"type": "token", "token_type": "SYMBOL", "text": "pi"},
            {"type": "token", "token_type": "WSPACE", "text": " "},
            {"type": "token", "token_type": "NUMBER", "text": "3.14159"}
            {"type": "token", "token_type": "WSPACE", "text": "\n"},
        ],
        [
            {"type": "token", "token_type": "SYMBOL", "text": "phi"},
            {"type": "token", "token_type": "WSPACE", "text": " "},
            {"type": "token", "token_type": "NUMBER", "text": "1.618"}
            {"type": "token", "token_type": "WSPACE", "text": "\n"},
        ]
    ]
]

Fast tokens

Fast tokens are arrays rather than objects.

The first item in the array is a numeric index into a list of token types, and the second item is the token's text. e.g.

[3, "foo"]

The fast-tokens format

The format object for the fast-tokens format also contains token_types array:

{"type": "format", "format": "fast-tokens", "token_types": ["SYMBOL", "NUMBER", "STRING"]}

The tokens format example from above would look like this in fast-tokens format:

[
    {"type": "format", "format": "tokens", "token_types": ["SYMBOL", "NUMBER", "WSPACE"]},
    [
        [0, "pi"],
        [2, " "],
        [1, "3.14159"]
    ]
]

The fast-token-lines format

The fast-token-lines format similarly structures the tokens in a list of lines.

The token-lines example from above would look like this in fast-token-lines format:

[
    {"type": "format", "format": "fast-token-lines", "token_types": ["SYMBOL", "NUMBER", "WSPACE"]},
    [
        [
            [0, "pi"],
            [2, " "],
            [1, "3.14159"],
            [2, "\n"]
        ],
        [
            [0, "phi"],
            [2, " "],
            [1, "1.618"],
            [2, "\n"]
        ]
    ]
]


Parser output formats

As with the lexer output, the top-level object is an array.

The array contains two items:

  • A format object describing which format is being used
  • The root-level parse tree node

Parse trees contain non-terminal objects and tokens.

The format object

The format object is identified by a type key:

{"type": "format"}

The format object contains a format key, the value of which is one of:

  • parsetree
  • fast-parsetree

e.g. {"type": "format", "format": "parsetree"}

For the parsetree format, non-terminals and tokens are objects.

For the fast-parsetree format, non-terminals and tokens are arrays.

Non-terminal objects

The inner nodes of the parse tree (the non-terminals) are identified by a type key:

{"type": "nonterminal"}

A non-terminal object also has a nonterminal_type key and a subnodes key, e.g.:

{"type": "nonterminal", "nonterminal_type": "expression", "subnodes": [...]}

The subnodes key has a value which is an array of nested non-terminal objects, thus creating a tree structure.

The parsetree format

For the parsetree format, the second item in the top-level array is the root-level parse tree object, in parsetree format.

Here is an complete example of the parsetree JSON format:

[
    {"type": "format", "format": "parsetree"},
    {"type": "nonterminal", "nonterminal_type": "expression", "subnodes": [
        {"type": "token", "token_type": "NUMBER", "text": "1"},
        {"type": "nonterminal", "nonterminal_type": "operator", "subnodes": [
            {"type": "token", "token_type": "PLUS", "text": "+"}
        ]},
        {"type": "token", "token_type": "NUMBER", "text": "1"}
    ]}
]

The above is a possible parse of the input string

1+1

using the parsing grammar

expression = NUMBER operator NUMBER ;
operator = PLUS | MINUS ;

and the lexical grammar

NUMBER
[0-9]+
PLUS
\+
MINUS
-

Fast non-terminals

Fast non-terminals are arrays rather than objects.

Each array contains three items:

  • A numeric index (which is always 0) into the list of parse tree types (which is ["nonterminal", "token"])
  • A numeric index into the list of non-terminal types (e.g. ["expression", "operator", ...])
  • An array of subnodes

e.g. [0, 3, [...]]

Fast terminals (tokens)

Parse tree fast tokens are the same as lexer fast tokens, with the addition of a prepended index (which is always 1) into the list of parse tree types (which is ["nonterminal", "token"]).

Thus, the fast lexer token [0, "phi"] becomes [1, 0, "phi"] in fast-parsetree format.

The fast-parsetree format

The format object for the fast-parsetree format is similar, but with the addition of several type arrays:

  • types, which is always ["nonterminal", "token"]
  • nonterminal_types, which is the list of non-terminal types, e.g. ["program", "expression", "atom"]
  • token_types, which is the same as above, e.g. ["SYMBOL", "NUMBER", "WSPACE"]

The parsetree format example from above would look like this in fast-parsetree format:

[
    {"type": "format", "format": "fast-parsetree",
        "types": ["nonterminal", "token"],
        "nonterminal_types": ["expression", "operator"],
        "token_types": ["NUMBER", "PLUS", "MINUS"]
    },
    [0, 0, [
        [1, 0, "1"],
        [0, 1, [
            [1, 1, "+"]
        ]],
        [1, 0, "1"]
    ]]
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment