Blog 2020/6/12
<- previous | index | next ->
In a previous post, I introduced a lexer generator.
In this post I'll descibe a few changes:
- output formats:
tokens
,lines
,tokens-fast
andlines-fast
- pragmas:
line-oriented
,discard
andeof
- lexical grammars now support empty lines and comments
I typed up a descriptive prose spec of mklexer.py, but I hated it. Instead I'll describe it by example.
If this is your token definitions file:
ASSIGN
=
NUMBER
-?[0-9](\.[0-9]+)?
SYMBOL
[a-z]+
and this is your input:
pi=3.14159
then the lexer JSON output (in the default tokens
format) will be:
[
{"type": "format", "format": "tokens"},
[
{"type": "token", "token-type": "SYMBOL", "text": "pi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "3.14159"}
]
]
The top-level structure is still an array,
but now the first item is a format
object
and the second item is the list of tokens.
If the lexer is invoked with the --fast
command-line option, the above example would have this output:
[
{"type": "format", "format": "fast", "token-types": ["ASSIGN", "NUMBER", "SYMBOL"]},
[
[2, "pi"],
[0, "="],
[1, "3.14159"]
]
]
The format
object now contains a token-types
list of token type names.
The tokens are now tuples,
where the first item is a numeric index into the token-types
list,
and the second item is the token text.
There are two additional formats: tokens-lines
and fast-lines
,
which break up the list of tokens into individual lines.
These formats are activated using the line-oriented
pragma.
Lexical grammar:
#pragma line-oriented
ASSIGN
=
NUMBER
-?[0-9](\.[0-9]+)?
SYMBOL
[a-z]+
NEWLINE
\n
Input text:
pi=3.14159
phi=1.618
or, more specifically:
pi=3.14159\nphi=1.618
JSON output in tokens-lines
format:
[
{"type": "format", "format": "tokens-lines"},
[
[
{"type": "token", "token-type": "SYMBOL", "text": "pi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "3.14159"},
],
[
{"type": "token", "token-type": "SYMBOL", "text": "phi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "1.618"}
]
]
]
and in fast-lines
format:
[
{"type": "format", "format": "fast-lines", "token-types": ["ASSIGN", "NUMBER", "SYMBOL", "NEWLINE"]},
[
[
[2, "pi"],
[0, "="],
[1, "3.14159"],
],
[
[2, "phi"],
[0, "="],
[1, "1.618"]
]
]
]
The discard
pragma lists token types which will be automatically discarded from the output.
Lexical grammar:
ASSIGN
=
NUMBER
-?[0-9](\.[0-9]+)?
SYMBOL
[a-z]+
WSPACE
\s+
COMMENT
;.*
Input text:
phi = 1.618 ; the golden ratio
Output without the discard
pragma:
[
{"type": "format", "format": "tokens"},
[
{"type": "token", "token-type": "SYMBOL", "text": "phi"},
{"type": "token", "token-type": "WSPACE", "text": " "},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "WSPACE", "text": " "},
{"type": "token", "token-type": "NUMBER", "text": "1.618"}
{"type": "token", "token-type": "WSPACE", "text": " "},
{"type": "token", "token-type": "COMMENT", "text": "; the golden ratio"},
]
]
Now, we use the discard
pragma to get rid of whitespace and comments:
#pragma discard WSPACE COMMENT
ASSIGN
=
NUMBER
-?[0-9](\.[0-9]+)?
SYMBOL
[a-z]+
WSPACE
\s+
COMMENT
;.*
and our output becomes:
[
{"type": "format", "format": "tokens"},
[
{"type": "token", "token-type": "SYMBOL", "text": "phi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "1.618"}
]
]
The lexical grammar now supports comments and empty lines.
Here's the grammar from our previous example, but with some spacing and comments:
# A lexical grammar for trivial assignment statements.
# our parser doesn't care about whitespace and comments
#pragma discard WSPACE COMMENT
# the assignment operator
ASSIGN
=
# integer and floating-point numbers
NUMBER
-?[0-9](\.[0-9]+)?
SYMBOL
[a-z]+
WSPACE
\s+
# comments extend to the end of the current line
COMMENT
;.*
The eof
pragma will append an 'EOF' token.
Lexical grammar:
#pragma eof
ASSIGN
=
NUMBER
-?[0-9](\.[0-9]+)?
SYMBOL
[a-z]+
input:
pi=3.14159
tokens:
[
{"type": "format", "format": "tokens"},
[
{"type": "token", "token-type": "SYMBOL", "text": "pi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "3.14159"},
{"type": "token", "token-type": "EOF", "text": ""}
]
]
For the --fast
format, it also adds EOF
to token-types
:
[
{"type": "format", "format": "fast", "token-types": ["ASSIGN", "NUMBER", "SYMBOL", "EOF"]},
[
[2, "pi"],
[0, "="],
[1, "3.14159"],
[3, ""]
]
]
For line-oriented output, the EOF
token will always appear on its own line. Example 3 would look like:
[
{"type": "format", "format": "tokens-lines"},
[
[
{"type": "token", "token-type": "SYMBOL", "text": "pi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "3.14159"},
],
[
{"type": "token", "token-type": "SYMBOL", "text": "phi"},
{"type": "token", "token-type": "ASSIGN", "text": "="},
{"type": "token", "token-type": "NUMBER", "text": "1.618"}
],
[
{"type": "token", "token-type": "EOF", "text": ""}
]
]
]