Blog 2020/6/17
<- previous | index | next ->
It seems difficult to find lexers and parsers which simply spit out JSON.
Because of this (and because it is fun), I've decided to work on a general-purpose lexer generator and parser generator. The generated lexers will spit out a list of tokens in JSON, and the generated parsers will spit out a parse tree in JSON.
Part of this process is creating a spec describing those JSON formats. That's what this blog post is about!
Table of contents:
The top-level object, in all cases, is an array.
The array contains two items:
- A format object describing which format is being used
- The tokens
The format object is identified by a type
key:
{"type": "format"}
The format object contains a format
key, the value of which is one of:
tokens
token-lines
fast-tokens
fast-token-lines
e.g. {"type": "format", "format": "tokens"}
For the tokens
and token-lines
formats, each token is an object.
For the fast-tokens
and fast-token-lines
formats, each token is an array.
A token object is identified by a type
key:
{"type": "token"}
A token also has a token_type
key and a text
key. e.g.
{"type": "token", "token_type": "NUMBER", "text": "1.618"}
For the tokens
format, the second item in the top-level array is an array of tokens.
Here is a complete example of the tokens
JSON format:
[
{"type": "format", "format": "tokens"},
[
{"type": "token", "token_type": "SYMBOL", "text": "pi"},
{"type": "token", "token_type": "WSPACE", "text": " "},
{"type": "token", "token_type": "NUMBER", "text": "3.14159"}
]
]
The above is a possible tokenization of the input string
pi 3.14159
using the lexical grammar
SYMBOL
[a-z]+
NUMBER
[0-9]+(\.[0-9]+)?
WSPACE
\s+
The token-lines
format differs in that the second item in the top-level array
is a two-dimentional array, i.e. a list of lines of tokens.
For the input file:
pi 3.14159
phi 1.618
a possible tokenization (in token-lines
format) might be:
[
{"type": "format", "format": "token-lines"},
[
[
{"type": "token", "token_type": "SYMBOL", "text": "pi"},
{"type": "token", "token_type": "WSPACE", "text": " "},
{"type": "token", "token_type": "NUMBER", "text": "3.14159"}
{"type": "token", "token_type": "WSPACE", "text": "\n"},
],
[
{"type": "token", "token_type": "SYMBOL", "text": "phi"},
{"type": "token", "token_type": "WSPACE", "text": " "},
{"type": "token", "token_type": "NUMBER", "text": "1.618"}
{"type": "token", "token_type": "WSPACE", "text": "\n"},
]
]
]
Fast tokens are arrays rather than objects.
The first item in the array is a numeric index into a list of token types, and the second item is the token's text. e.g.
[3, "foo"]
The format
object for the fast-tokens
format also contains token_types
array:
{"type": "format", "format": "fast-tokens", "token_types": ["SYMBOL", "NUMBER", "STRING"]}
The tokens
format example from above would look like this in fast-tokens
format:
[
{"type": "format", "format": "tokens", "token_types": ["SYMBOL", "NUMBER", "WSPACE"]},
[
[0, "pi"],
[2, " "],
[1, "3.14159"]
]
]
The fast-token-lines
format similarly structures the tokens in a list of lines.
The token-lines
example from above would look like this in fast-token-lines
format:
[
{"type": "format", "format": "fast-token-lines", "token_types": ["SYMBOL", "NUMBER", "WSPACE"]},
[
[
[0, "pi"],
[2, " "],
[1, "3.14159"],
[2, "\n"]
],
[
[0, "phi"],
[2, " "],
[1, "1.618"],
[2, "\n"]
]
]
]
As with the lexer output, the top-level object is an array.
The array contains two items:
- A format object describing which format is being used
- The root-level parse tree node
Parse trees contain non-terminal objects and tokens.
The format object is identified by a type
key:
{"type": "format"}
The format object contains a format
key, the value of which is one of:
parsetree
fast-parsetree
e.g. {"type": "format", "format": "parsetree"}
For the parsetree
format, non-terminals and tokens are objects.
For the fast-parsetree
format, non-terminals and tokens are arrays.
The inner nodes of the parse tree (the non-terminals) are identified by a type
key:
{"type": "nonterminal"}
A non-terminal object also has a nonterminal_type
key and a subnodes
key, e.g.:
{"type": "nonterminal", "nonterminal_type": "expression", "subnodes": [...]}
The subnodes
key has a value which is an array of nested non-terminal objects,
thus creating a tree structure.
For the parsetree
format, the second item in the top-level array
is the root-level parse tree object, in parsetree
format.
Here is an complete example of the parsetree
JSON format:
[
{"type": "format", "format": "parsetree"},
{"type": "nonterminal", "nonterminal_type": "expression", "subnodes": [
{"type": "token", "token_type": "NUMBER", "text": "1"},
{"type": "nonterminal", "nonterminal_type": "operator", "subnodes": [
{"type": "token", "token_type": "PLUS", "text": "+"}
]},
{"type": "token", "token_type": "NUMBER", "text": "1"}
]}
]
The above is a possible parse of the input string
1+1
using the parsing grammar
expression = NUMBER operator NUMBER ;
operator = PLUS | MINUS ;
and the lexical grammar
NUMBER
[0-9]+
PLUS
\+
MINUS
-
Fast non-terminals are arrays rather than objects.
Each array contains three items:
- A numeric index (which is always
0
) into the list of parse tree types (which is["nonterminal", "token"]
) - A numeric index into the list of non-terminal types (e.g.
["expression", "operator", ...]
) - An array of subnodes
e.g. [0, 3, [...]]
Parse tree fast tokens are the same as lexer fast tokens,
with the addition of a prepended index (which is always 1
)
into the list of parse tree types (which is ["nonterminal", "token"]
).
Thus, the fast lexer token [0, "phi"]
becomes [1, 0, "phi"]
in fast-parsetree
format.
The format
object for the fast-parsetree
format is similar, but with the addition of several type arrays:
types
, which is always["nonterminal", "token"]
nonterminal_types
, which is the list of non-terminal types, e.g.["program", "expression", "atom"]
token_types
, which is the same as above, e.g.["SYMBOL", "NUMBER", "WSPACE"]
The parsetree
format example from above would look like this in fast-parsetree
format:
[
{"type": "format", "format": "fast-parsetree",
"types": ["nonterminal", "token"],
"nonterminal_types": ["expression", "operator"],
"token_types": ["NUMBER", "PLUS", "MINUS"]
},
[0, 0, [
[1, 0, "1"],
[0, 1, [
[1, 1, "+"]
]],
[1, 0, "1"]
]]
]