Skip to content

Instantly share code, notes, and snippets.

@cellularmitosis
Last active April 15, 2020 03:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cellularmitosis/5b43864dfc4593218130554d3e79a6b2 to your computer and use it in GitHub Desktop.
Save cellularmitosis/5b43864dfc4593218130554d3e79a6b2 to your computer and use it in GitHub Desktop.
A Lisp-like interpreter (in Python), step 1: how a regex-based lexer (tokenizer) works

Blog 2019/9/7

<- previous | index | next ->

A Lisp-like interpreter (in Python), step 1: how a regex-based lexer (tokenizer) works

Let's learn how to implement a Lisp or Scheme-like language by writing a series of increasingly capable interpreters.

Step 1: How a regex-based lexer (tokenizer) works

We'll start by writing a lexer which can recognize integers and Lisp-like lists of integers, i.e. 42, (1 2 3), (1 2 (11 12) 3 4), etc.

Declare your token definitions in a text file, where each pair of lines names a token type and defines the regex pattern used to recognize that type of token.

We'll need to recognize integers, opening parentheses, closing parentheses, and whitespace.

tokendefs.txt:

INT
\d+
OPAREN
\(
CPAREN
\)
WSPACE
\s+

Write a Python function to load in the token definitions:

lexer.py:

def load_tokendefs(fpath):
    import re
    tdefs = []
    with open(fpath) as f:
        for line in f:
            token_name = line.rstrip('\n')
            line2 = f.next()
            pattern = line2.rstrip('\n')
            regex = re.compile(pattern)
            pair = (token_name, regex)
            tdefs.append(pair)
    return tdefs

Try it out:

$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lex import *
>>> tdefs = load_tokendefs("tokendefs.txt")
>>> import pprint
>>> pprint.pprint(tdefs)
[('INT', <_sre.SRE_Pattern object at 0x10a88a458>),
 ('OPAREN', <_sre.SRE_Pattern object at 0x10a7ed1c8>),
 ('CPAREN', <_sre.SRE_Pattern object at 0x10a7ed3e8>),
 ('WSPACE', <_sre.SRE_Pattern object at 0x10a88a4f0>)]
>>>

Read the next token

A regex-based lexer loops over its table of regexes until it finds one which works, then spits out a token and repeats. If it reaches the end of its table without finding a match, it throws an error.

Add a function next_token to lexer.py which takes a table of token definitions and some input text, and returns the next token in a string (we'll define a "token" as being a tuple of the token name and the matched text).

def next_token(tdefs, text):
    for (token_name, regex) in tdefs:
        m = regex.match(text)
        if m is None:
            continue
        else:
            matched_text = m.group()
            token = (token_name, matched_text)
            return token
    raise Exception("Don't know how to lex '%s'" % text[:16])

Also write a test, and try it out:

def _test_next_token():
    tdefs = load_tokendefs("tokendefs.txt")
    token = next_token(tdefs, "42 ")
    assert token == ("INT", "42")
    token = next_token(tdefs, " 17")
    assert token == ("WSPACE", " ")
    token = next_token(tdefs, "(1 2 3)")
    assert token == ("OPAREN", "(")
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lex
>>> lex._test_next_token()
>>> 

We didn't see any exceptions, so the test worked.

We can also try out the exception by feeding the lexer some bad input:

$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lex import *
>>> tdefs = load_tokendefs("tokendefs.txt")
>>> next_token(tdefs, "foo")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lex.py", line 23, in next_token
    raise Exception("Don't know how to lex '%s'" % text[:16])
Exception: Don't know how to lex 'foo'

Tokenize an entire string

Now write a function tokenize which tokenizes an entire string by repeatedly calling next_token.

def tokenize(tdefs, text):
    tokens = []
    while len(text) > 0:
        token = next_token(tdefs, text)
        tokens.append(token)
        matched_text = token[1]
        text = text[len(matched_text):]
    return tokens

Try it out:

$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lex import *
>>> tdefs = load_tokendefs("tokendefs.txt")
>>> tokens = tokenize(tdefs, "(1 2 3)")
>>> import pprint
>>> pprint.pprint(tokens)
[('OPAREN', '('),
 ('INT', '1'),
 ('WSPACE', ' '),
 ('INT', '2'),
 ('WSPACE', ' '),
 ('INT', '3'),
 ('CPAREN', ')')]

If it runs into something it can't tokenize, it will throw the exception:

>>> tokenize(tdefs, "(1 2 zebra)")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lex.py", line 37, in tokenize
    token = next_token(tdefs, text)
  File "lex.py", line 23, in next_token
    raise Exception("Don't know how to lex '%s'" % text[:16])
Exception: Don't know how to lex 'zebra)'

Now write up a test:

def _test_tokenize():
    tdefs = load_tokendefs("tokendefs.txt")
    tokens = tokenize(tdefs, "(1 42   65536)")
    assert tokens == [
        ("OPAREN", "("),
        ("INT", "1"),
        ("WSPACE", " "),
        ("INT", "42"),
        ("WSPACE", "   "),
        ("INT", "65536"),
        ("CPAREN", ")")
    ]

Add a function to run all of the tests:

def _test():
    _test_next_token()
    _test_tokenize()
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lex
>>> lex._test()
>>> 

Make it runnable

Add a __main__ implementation which will tokenize a file, or tokenize from stdin.

if __name__ == "__main__":
    tdefs = load_tokendefs("tokendefs.txt")
    import sys
    if len(sys.argv) > 1:
        fname = sys.argv[-1]
        text = open(fname).read()
    else:
        text = sys.stdin.read()
    tokens = tokenize(tdefs, text)
    import pprint
    pprint.pprint(tokens)
$ echo '(1 2 3)' | python lex.py
[('OPAREN', '('),
 ('INT', '1'),
 ('WSPACE', ' '),
 ('INT', '2'),
 ('WSPACE', ' '),
 ('INT', '3'),
 ('CPAREN', ')'),
 ('WSPACE', '\n')]
$ echo '(1 2 3)' > foo.txt
$ python lex.py foo.txt
[('OPAREN', '('),
 ('INT', '1'),
 ('WSPACE', ' '),
 ('INT', '2'),
 ('WSPACE', ' '),
 ('INT', '3'),
 ('CPAREN', ')'),
 ('WSPACE', '\n')]

Continued in Step 1b.

def load_tokendefs(fpath):
import re
tdefs = []
with open(fpath) as f:
for line in f:
token_name = line.rstrip('\n')
line2 = f.next()
pattern = line2.rstrip('\n')
regex = re.compile(pattern)
pair = (token_name, regex)
tdefs.append(pair)
return tdefs
def next_token(tdefs, text):
for (token_name, regex) in tdefs:
m = regex.match(text)
if m is None:
continue
else:
matched_text = m.group()
token = (token_name, matched_text)
return token
raise Exception("Don't know how to lex '%s'" % text[:16])
def _test_next_token():
tdefs = load_tokendefs("tokendefs.txt")
token = next_token(tdefs, "42 ")
assert token == ("INT", "42")
token = next_token(tdefs, " 17")
assert token == ("WSPACE", " ")
token = next_token(tdefs, "(1 2 3)")
assert token == ("OPAREN", "(")
def tokenize(tdefs, text):
tokens = []
while len(text) > 0:
token = next_token(tdefs, text)
tokens.append(token)
matched_text = token[1]
text = text[len(matched_text):]
return tokens
def _test_tokenize():
tdefs = load_tokendefs("tokendefs.txt")
tokens = tokenize(tdefs, "(1 42 65536)")
assert tokens == [
("OPAREN", "("),
("INT", "1"),
("WSPACE", " "),
("INT", "42"),
("WSPACE", " "),
("INT", "65536"),
("CPAREN", ")")
]
def _test():
_test_next_token()
_test_tokenize()
if __name__ == "__main__":
tdefs = load_tokendefs("tokendefs.txt")
import sys
if len(sys.argv) > 1:
fname = sys.argv[-1]
text = open(fname).read()
else:
text = sys.stdin.read()
tokens = tokenize(tdefs, text)
import pprint
pprint.pprint(tokens)
INT
\d+
OPAREN
\(
CPAREN
\)
WSPACE
\s+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment