Blog 2019/9/7
<- previous | index | next ->
Let's learn how to implement a Lisp or Scheme-like language by writing a series of increasingly capable interpreters.
We'll start by writing a lexer which can recognize integers and Lisp-like lists of integers,
i.e. 42
, (1 2 3)
, (1 2 (11 12) 3 4)
, etc.
Declare your token definitions in a text file, where each pair of lines names a token type and defines the regex pattern used to recognize that type of token.
We'll need to recognize integers, opening parentheses, closing parentheses, and whitespace.
tokendefs.txt
:
INT
\d+
OPAREN
\(
CPAREN
\)
WSPACE
\s+
Write a Python function to load in the token definitions:
lexer.py
:
def load_tokendefs(fpath):
import re
tdefs = []
with open(fpath) as f:
for line in f:
token_name = line.rstrip('\n')
line2 = f.next()
pattern = line2.rstrip('\n')
regex = re.compile(pattern)
pair = (token_name, regex)
tdefs.append(pair)
return tdefs
Try it out:
$ python
Python 2.7.16 (default, Mar 4 2019, 09:02:22)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lex import *
>>> tdefs = load_tokendefs("tokendefs.txt")
>>> import pprint
>>> pprint.pprint(tdefs)
[('INT', <_sre.SRE_Pattern object at 0x10a88a458>),
('OPAREN', <_sre.SRE_Pattern object at 0x10a7ed1c8>),
('CPAREN', <_sre.SRE_Pattern object at 0x10a7ed3e8>),
('WSPACE', <_sre.SRE_Pattern object at 0x10a88a4f0>)]
>>>
A regex-based lexer loops over its table of regexes until it finds one which works, then spits out a token and repeats. If it reaches the end of its table without finding a match, it throws an error.
Add a function next_token
to lexer.py
which takes a table of token definitions and some input text, and returns the next token in a string (we'll define a "token" as being a tuple of the token name and the matched text).
def next_token(tdefs, text):
for (token_name, regex) in tdefs:
m = regex.match(text)
if m is None:
continue
else:
matched_text = m.group()
token = (token_name, matched_text)
return token
raise Exception("Don't know how to lex '%s'" % text[:16])
Also write a test, and try it out:
def _test_next_token():
tdefs = load_tokendefs("tokendefs.txt")
token = next_token(tdefs, "42 ")
assert token == ("INT", "42")
token = next_token(tdefs, " 17")
assert token == ("WSPACE", " ")
token = next_token(tdefs, "(1 2 3)")
assert token == ("OPAREN", "(")
$ python
Python 2.7.16 (default, Mar 4 2019, 09:02:22)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lex
>>> lex._test_next_token()
>>>
We didn't see any exceptions, so the test worked.
We can also try out the exception by feeding the lexer some bad input:
$ python
Python 2.7.16 (default, Mar 4 2019, 09:02:22)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lex import *
>>> tdefs = load_tokendefs("tokendefs.txt")
>>> next_token(tdefs, "foo")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lex.py", line 23, in next_token
raise Exception("Don't know how to lex '%s'" % text[:16])
Exception: Don't know how to lex 'foo'
Now write a function tokenize
which tokenizes an entire string by repeatedly calling next_token
.
def tokenize(tdefs, text):
tokens = []
while len(text) > 0:
token = next_token(tdefs, text)
tokens.append(token)
matched_text = token[1]
text = text[len(matched_text):]
return tokens
Try it out:
$ python
Python 2.7.16 (default, Mar 4 2019, 09:02:22)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lex import *
>>> tdefs = load_tokendefs("tokendefs.txt")
>>> tokens = tokenize(tdefs, "(1 2 3)")
>>> import pprint
>>> pprint.pprint(tokens)
[('OPAREN', '('),
('INT', '1'),
('WSPACE', ' '),
('INT', '2'),
('WSPACE', ' '),
('INT', '3'),
('CPAREN', ')')]
If it runs into something it can't tokenize, it will throw the exception:
>>> tokenize(tdefs, "(1 2 zebra)")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lex.py", line 37, in tokenize
token = next_token(tdefs, text)
File "lex.py", line 23, in next_token
raise Exception("Don't know how to lex '%s'" % text[:16])
Exception: Don't know how to lex 'zebra)'
Now write up a test:
def _test_tokenize():
tdefs = load_tokendefs("tokendefs.txt")
tokens = tokenize(tdefs, "(1 42 65536)")
assert tokens == [
("OPAREN", "("),
("INT", "1"),
("WSPACE", " "),
("INT", "42"),
("WSPACE", " "),
("INT", "65536"),
("CPAREN", ")")
]
Add a function to run all of the tests:
def _test():
_test_next_token()
_test_tokenize()
$ python
Python 2.7.16 (default, Mar 4 2019, 09:02:22)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lex
>>> lex._test()
>>>
Add a __main__
implementation which will tokenize a file, or tokenize from stdin
.
if __name__ == "__main__":
tdefs = load_tokendefs("tokendefs.txt")
import sys
if len(sys.argv) > 1:
fname = sys.argv[-1]
text = open(fname).read()
else:
text = sys.stdin.read()
tokens = tokenize(tdefs, text)
import pprint
pprint.pprint(tokens)
$ echo '(1 2 3)' | python lex.py
[('OPAREN', '('),
('INT', '1'),
('WSPACE', ' '),
('INT', '2'),
('WSPACE', ' '),
('INT', '3'),
('CPAREN', ')'),
('WSPACE', '\n')]
$ echo '(1 2 3)' > foo.txt
$ python lex.py foo.txt
[('OPAREN', '('),
('INT', '1'),
('WSPACE', ' '),
('INT', '2'),
('WSPACE', ' '),
('INT', '3'),
('CPAREN', ')'),
('WSPACE', '\n')]
Continued in Step 1b.