cellularmitosis/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Blog 2020/6/17
<- previous |
index |
next ->
A spec for JSON output from lexers and parsers

It seems difficult to find lexers and parsers which simply spit out JSON.
Because of this (and because it is fun), I've decided to work on a general-purpose
lexer generator and parser generator.  The generated lexers will spit out a
list of tokens in JSON, and the generated parsers will spit out a parse tree in JSON.
Part of this process is creating a spec describing those JSON formats.
That's what this blog post is about!
Table of contents:

Lexer output formats

The format object
Token objects
The tokens format
The token-lines format
Fast tokens
The fast-tokens format
The fast-token-lines format


Parser output formats

The format object
Non-terminal objects
The parsetree format
Fast non-terminals
Fast terminals (tokens)
The fast-parsetree format


Lexer output formats

The top-level object, in all cases, is an array.
The array contains two items:

A format object describing which format is being used
The tokens

The format object

The format object is identified by a type key:
{"type": "format"}
The format object contains a format key, the value of which is one of:

tokens
token-lines
fast-tokens
fast-token-lines

e.g. {"type": "format", "format": "tokens"}
For the tokens and token-lines formats, each token is an object.
For the fast-tokens and fast-token-lines formats, each token is an array.
Token objects

A token object is identified by a type key:
{"type": "token"}
A token also has a token_type key and a text key. e.g.
{"type": "token", "token_type": "NUMBER", "text": "1.618"}
The tokens format

For the tokens format, the second item in the top-level array is an array of tokens.
Here is a complete example of the tokens JSON format:
[
    {"type": "format", "format": "tokens"},
    [
        {"type": "token", "token_type": "SYMBOL", "text": "pi"},
        {"type": "token", "token_type": "WSPACE", "text": " "},
        {"type": "token", "token_type": "NUMBER", "text": "3.14159"}
    ]
]

The above is a possible tokenization of the input string
pi 3.14159
using the lexical grammar
SYMBOL
[a-z]+
NUMBER
[0-9]+(\.[0-9]+)?
WSPACE
\s+

The token-lines format

The token-lines format differs in that the second item in the top-level array
is a two-dimentional array, i.e. a list of lines of tokens.
For the input file:
pi 3.14159
phi 1.618

a possible tokenization (in token-lines format) might be:
[
    {"type": "format", "format": "token-lines"},
    [
        [
            {"type": "token", "token_type": "SYMBOL", "text": "pi"},
            {"type": "token", "token_type": "WSPACE", "text": " "},
            {"type": "token", "token_type": "NUMBER", "text": "3.14159"}
            {"type": "token", "token_type": "WSPACE", "text": "\n"},
        ],
        [
            {"type": "token", "token_type": "SYMBOL", "text": "phi"},
            {"type": "token", "token_type": "WSPACE", "text": " "},
            {"type": "token", "token_type": "NUMBER", "text": "1.618"}
            {"type": "token", "token_type": "WSPACE", "text": "\n"},
        ]
    ]
]

Fast tokens

Fast tokens are arrays rather than objects.
The first item in the array is a numeric index into a list of token types,
and the second item is the token's text.  e.g.
[3, "foo"]
The fast-tokens format

The format object for the fast-tokens format also contains token_types array:
{"type": "format", "format": "fast-tokens", "token_types": ["SYMBOL", "NUMBER", "STRING"]}
The tokens format example from above would look like this in fast-tokens format:
[
    {"type": "format", "format": "tokens", "token_types": ["SYMBOL", "NUMBER", "WSPACE"]},
    [
        [0, "pi"],
        [2, " "],
        [1, "3.14159"]
    ]
]

The fast-token-lines format

The fast-token-lines format similarly structures the tokens in a list of lines.
The token-lines example from above would look like this in fast-token-lines format:
[
    {"type": "format", "format": "fast-token-lines", "token_types": ["SYMBOL", "NUMBER", "WSPACE"]},
    [
        [
            [0, "pi"],
            [2, " "],
            [1, "3.14159"],
            [2, "\n"]
        ],
        [
            [0, "phi"],
            [2, " "],
            [1, "1.618"],
            [2, "\n"]
        ]
    ]
]


Parser output formats

As with the lexer output, the top-level object is an array.
The array contains two items:

A format object describing which format is being used
The root-level parse tree node

Parse trees contain non-terminal objects and tokens.
The format object

The format object is identified by a type key:
{"type": "format"}
The format object contains a format key, the value of which is one of:

parsetree
fast-parsetree

e.g. {"type": "format", "format": "parsetree"}
For the parsetree format, non-terminals and tokens are objects.
For the fast-parsetree format, non-terminals and tokens are arrays.
Non-terminal objects

The inner nodes of the parse tree (the non-terminals) are identified by a type key:
{"type": "nonterminal"}
A non-terminal object also has a nonterminal_type key and a subnodes key, e.g.:
{"type": "nonterminal", "nonterminal_type": "expression", "subnodes": [...]}
The subnodes key has a value which is an array of nested non-terminal objects,
thus creating a tree structure.
The parsetree format

For the parsetree format, the second item in the top-level array
is the root-level parse tree object, in parsetree format.
Here is an complete example of the parsetree JSON format:
[
    {"type": "format", "format": "parsetree"},
    {"type": "nonterminal", "nonterminal_type": "expression", "subnodes": [
        {"type": "token", "token_type": "NUMBER", "text": "1"},
        {"type": "nonterminal", "nonterminal_type": "operator", "subnodes": [
            {"type": "token", "token_type": "PLUS", "text": "+"}
        ]},
        {"type": "token", "token_type": "NUMBER", "text": "1"}
    ]}
]

The above is a possible parse of the input string
1+1
using the parsing grammar
expression = NUMBER operator NUMBER ;
operator = PLUS | MINUS ;

and the lexical grammar
NUMBER
[0-9]+
PLUS
\+
MINUS
-

Fast non-terminals

Fast non-terminals are arrays rather than objects.
Each array contains three items:

A numeric index (which is always 0) into the list of parse tree types (which is ["nonterminal", "token"])
A numeric index into the list of non-terminal types (e.g. ["expression", "operator", ...])
An array of subnodes

e.g. [0, 3, [...]]
Fast terminals (tokens)

Parse tree fast tokens are the same as lexer fast tokens,
with the addition of a prepended index (which is always 1)
into the list of parse tree types (which is ["nonterminal", "token"]).
Thus, the fast lexer token [0, "phi"] becomes [1, 0, "phi"] in fast-parsetree format.
The fast-parsetree format

The format object for the fast-parsetree format is similar, but with the addition of several type arrays:

types, which is always ["nonterminal", "token"]
nonterminal_types, which is the list of non-terminal types, e.g. ["program", "expression", "atom"]
token_types, which is the same as above, e.g. ["SYMBOL", "NUMBER", "WSPACE"]

The parsetree format example from above would look like this in fast-parsetree format:
[
    {"type": "format", "format": "fast-parsetree",
        "types": ["nonterminal", "token"],
        "nonterminal_types": ["expression", "operator"],
        "token_types": ["NUMBER", "PLUS", "MINUS"]
    },
    [0, 0, [
        [1, 0, "1"],
        [0, 1, [
            [1, 1, "+"]
        ]],
        [1, 0, "1"]
    ]]
]