cellularmitosis/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Blog 2019/9/7
<- previous |
index |
next ->
A Lisp-like interpreter (in Python), step 1: how a regex-based lexer (tokenizer) works

Let's learn how to implement a Lisp or Scheme-like language
by writing a series of increasingly capable interpreters.

next: Step 1b: A performance tweak

Step 1: How a regex-based lexer (tokenizer) works

We'll start by writing a lexer which can recognize integers and Lisp-like lists of integers,
i.e. 42, (1 2 3), (1 2 (11 12) 3 4), etc.
Declare your token definitions in a text file,
where each pair of lines names a token type
and defines the regex pattern used to recognize that type of token.
We'll need to recognize integers, opening parentheses, closing parentheses, and whitespace.
tokendefs.txt:
INT
\d+
OPAREN
\(
CPAREN
\)
WSPACE
\s+

Write a Python function to load in the token definitions:
lexer.py:
def load_tokendefs(fpath):
    import re
    tdefs = []
    with open(fpath) as f:
        for line in f:
            token_name = line.rstrip('\n')
            line2 = f.next()
            pattern = line2.rstrip('\n')
            regex = re.compile(pattern)
            pair = (token_name, regex)
            tdefs.append(pair)
    return tdefs
Try it out:
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lex import *
>>> tdefs = load_tokendefs("tokendefs.txt")
>>> import pprint
>>> pprint.pprint(tdefs)
[('INT', <_sre.SRE_Pattern object at 0x10a88a458>),
 ('OPAREN', <_sre.SRE_Pattern object at 0x10a7ed1c8>),
 ('CPAREN', <_sre.SRE_Pattern object at 0x10a7ed3e8>),
 ('WSPACE', <_sre.SRE_Pattern object at 0x10a88a4f0>)]
>>>

Read the next token

A regex-based lexer loops over its table of regexes until it finds one which works,
then spits out a token and repeats.
If it reaches the end of its table without finding a match, it throws an error.
Add a function next_token to lexer.py which takes a table of token definitions and some input text, and returns the next token in a string (we'll define a "token" as being a tuple of the token name and the matched text).
def next_token(tdefs, text):
    for (token_name, regex) in tdefs:
        m = regex.match(text)
        if m is None:
            continue
        else:
            matched_text = m.group()
            token = (token_name, matched_text)
            return token
    raise Exception("Don't know how to lex '%s'" % text[:16])
Also write a test, and try it out:
def _test_next_token():
    tdefs = load_tokendefs("tokendefs.txt")
    token = next_token(tdefs, "42 ")
    assert token == ("INT", "42")
    token = next_token(tdefs, " 17")
    assert token == ("WSPACE", " ")
    token = next_token(tdefs, "(1 2 3)")
    assert token == ("OPAREN", "(")
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lex
>>> lex._test_next_token()
>>> 

We didn't see any exceptions, so the test worked.
We can also try out the exception by feeding the lexer some bad input:
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lex import *
>>> tdefs = load_tokendefs("tokendefs.txt")
>>> next_token(tdefs, "foo")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lex.py", line 23, in next_token
    raise Exception("Don't know how to lex '%s'" % text[:16])
Exception: Don't know how to lex 'foo'

Tokenize an entire string

Now write a function tokenize which tokenizes an entire string by repeatedly calling next_token.
def tokenize(tdefs, text):
    tokens = []
    while len(text) > 0:
        token = next_token(tdefs, text)
        tokens.append(token)
        matched_text = token[1]
        text = text[len(matched_text):]
    return tokens
Try it out:
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lex import *
>>> tdefs = load_tokendefs("tokendefs.txt")
>>> tokens = tokenize(tdefs, "(1 2 3)")
>>> import pprint
>>> pprint.pprint(tokens)
[('OPAREN', '('),
 ('INT', '1'),
 ('WSPACE', ' '),
 ('INT', '2'),
 ('WSPACE', ' '),
 ('INT', '3'),
 ('CPAREN', ')')]

If it runs into something it can't tokenize, it will throw the exception:
>>> tokenize(tdefs, "(1 2 zebra)")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lex.py", line 37, in tokenize
    token = next_token(tdefs, text)
  File "lex.py", line 23, in next_token
    raise Exception("Don't know how to lex '%s'" % text[:16])
Exception: Don't know how to lex 'zebra)'

Now write up a test:
def _test_tokenize():
    tdefs = load_tokendefs("tokendefs.txt")
    tokens = tokenize(tdefs, "(1 42   65536)")
    assert tokens == [
        ("OPAREN", "("),
        ("INT", "1"),
        ("WSPACE", " "),
        ("INT", "42"),
        ("WSPACE", "   "),
        ("INT", "65536"),
        ("CPAREN", ")")
    ]
Add a function to run all of the tests:
def _test():
    _test_next_token()
    _test_tokenize()
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lex
>>> lex._test()
>>> 

Make it runnable

Add a __main__ implementation which will tokenize a file, or tokenize from stdin.
if __name__ == "__main__":
    tdefs = load_tokendefs("tokendefs.txt")
    import sys
    if len(sys.argv) > 1:
        fname = sys.argv[-1]
        text = open(fname).read()
    else:
        text = sys.stdin.read()
    tokens = tokenize(tdefs, text)
    import pprint
    pprint.pprint(tokens)
$ echo '(1 2 3)' | python lex.py
[('OPAREN', '('),
 ('INT', '1'),
 ('WSPACE', ' '),
 ('INT', '2'),
 ('WSPACE', ' '),
 ('INT', '3'),
 ('CPAREN', ')'),
 ('WSPACE', '\n')]

$ echo '(1 2 3)' > foo.txt
$ python lex.py foo.txt
[('OPAREN', '('),
 ('INT', '1'),
 ('WSPACE', ' '),
 ('INT', '2'),
 ('WSPACE', ' '),
 ('INT', '3'),
 ('CPAREN', ')'),
 ('WSPACE', '\n')]

Continued in Step 1b.

  
## lex.py
def load_tokendefs(fpath):
    import re
    tdefs = []
    with open(fpath) as f:
        for line in f:
            token_name = line.rstrip('\n')
            line2 = f.next()
            pattern = line2.rstrip('\n')
            regex = re.compile(pattern)
            pair = (token_name, regex)
            tdefs.append(pair)
    return tdefs

def next_token(tdefs, text):
    for (token_name, regex) in tdefs:
        m = regex.match(text)
        if m is None:
            continue
        else:
            matched_text = m.group()
            token = (token_name, matched_text)
            return token
    raise Exception("Don't know how to lex '%s'" % text[:16])

def _test_next_token():
    tdefs = load_tokendefs("tokendefs.txt")
    token = next_token(tdefs, "42 ")
    assert token == ("INT", "42")
    token = next_token(tdefs, " 17")
    assert token == ("WSPACE", " ")
    token = next_token(tdefs, "(1 2 3)")
    assert token == ("OPAREN", "(")

def tokenize(tdefs, text):
    tokens = []
    while len(text) > 0:
        token = next_token(tdefs, text)
        tokens.append(token)
        matched_text = token[1]
        text = text[len(matched_text):]
    return tokens

def _test_tokenize():
    tdefs = load_tokendefs("tokendefs.txt")
    tokens = tokenize(tdefs, "(1 42   65536)")
    assert tokens == [
        ("OPAREN", "("),
        ("INT", "1"),
        ("WSPACE", " "),
        ("INT", "42"),
        ("WSPACE", "   "),
        ("INT", "65536"),
        ("CPAREN", ")")
    ]

def _test():
    _test_next_token()
    _test_tokenize()


if __name__ == "__main__":
    tdefs = load_tokendefs("tokendefs.txt")
    import sys
    if len(sys.argv) > 1:
        fname = sys.argv[-1]
        text = open(fname).read()
    else:
        text = sys.stdin.read()
    tokens = tokenize(tdefs, text)
    import pprint
    pprint.pprint(tokens)

## tokendefs.txt
INT
\d+
OPAREN
\(
CPAREN
\)
WSPACE
\s+
	def load_tokendefs(fpath):
	import re
	tdefs = []
	with open(fpath) as f:
	for line in f:
	token_name = line.rstrip('\n')
	line2 = f.next()
	pattern = line2.rstrip('\n')
	regex = re.compile(pattern)
	pair = (token_name, regex)
	tdefs.append(pair)
	return tdefs

	def next_token(tdefs, text):
	for (token_name, regex) in tdefs:
	m = regex.match(text)
	if m is None:
	continue
	else:
	matched_text = m.group()
	token = (token_name, matched_text)
	return token
	raise Exception("Don't know how to lex '%s'" % text[:16])

	def _test_next_token():
	tdefs = load_tokendefs("tokendefs.txt")
	token = next_token(tdefs, "42 ")
	assert token == ("INT", "42")
	token = next_token(tdefs, " 17")
	assert token == ("WSPACE", " ")
	token = next_token(tdefs, "(1 2 3)")
	assert token == ("OPAREN", "(")

	def tokenize(tdefs, text):
	tokens = []
	while len(text) > 0:
	token = next_token(tdefs, text)
	tokens.append(token)
	matched_text = token[1]
	text = text[len(matched_text):]
	return tokens

	def _test_tokenize():
	tdefs = load_tokendefs("tokendefs.txt")
	tokens = tokenize(tdefs, "(1 42 65536)")
	assert tokens == [
	("OPAREN", "("),
	("INT", "1"),
	("WSPACE", " "),
	("INT", "42"),
	("WSPACE", " "),
	("INT", "65536"),
	("CPAREN", ")")
	]

	def _test():
	_test_next_token()
	_test_tokenize()


	if __name__ == "__main__":
	tdefs = load_tokendefs("tokendefs.txt")
	import sys
	if len(sys.argv) > 1:
	fname = sys.argv[-1]
	text = open(fname).read()
	else:
	text = sys.stdin.read()
	tokens = tokenize(tdefs, text)
	import pprint
	pprint.pprint(tokens)