Blog 2019/9/7
<- previous | index | next ->
Let's learn how to implement a Lisp or Scheme-like language by writing a series of increasingly capable interpreters.
Our implementation from step 1 actually has a performance issue, on this line of the tokenize
function:
text = text[len(matched_text):]
Here, we are advancing the tokenizer by slicing off the length of the matched text. However, this creates a copy of the remaining text.
Consider an input file consisting of 1,000 OPARENs. Our lexer would:
- match a paren,
- advance by making a copy of the remaining 999 bytes of input,
- match a paren,
- advance by making a copy of the remaining 998 bytes of input,
- match a paren,
- advance by making a copy of the remaining 997 bytes of input,
- and so on...
All of this string copying will slow things down when processing large input files.
We can solve this performance problem by making a few tweaks to our code.
The key is that Python's match()
function accepts an offset as an optional parameter. This means that rather than making all of these string copies, we can just use one copy and keep track of our offset into that string.
As Armin Ronacher says:
This is immensely useful for building lexers because you can continue to use the special ^ symbol to indicate the beginning of a line of entire string. We just need to increase the index to match further. It also means we do not have to slice up the string ourselves which saves a ton of memory allocations and string copying in the process...
Here's the gist of the changes:
$ diff -urN step1/lex.py step1b/lex.py
--- step1/lex.py 2019-09-07 20:19:24.000000000 -0500
+++ step1b/lex.py 2019-09-07 20:15:07.000000000 -0500
@@ -11,9 +11,9 @@
tdefs.append(pair)
return tdefs
-def next_token(tdefs, text):
+def next_token(tdefs, text, offset=0):
for (token_name, regex) in tdefs:
- m = regex.match(text)
+ m = regex.match(text, offset)
if m is None:
continue
else:
@@ -33,11 +33,12 @@
def tokenize(tdefs, text):
tokens = []
- while len(text) > 0:
- token = next_token(tdefs, text)
+ offset = 0
+ while (len(text) - offset) > 0:
+ token = next_token(tdefs, text, offset)
tokens.append(token)
matched_text = token[1]
- text = text[len(matched_text):]
+ offset += len(matched_text)
return tokens
def _test_tokenize():
We'll add an offset
parameter to next_token()
, and pass that offset to re.match()
. We'll give offset
a default value of zero, making it an optional parameter.
def next_token(tdefs, text, offset=0):
for (token_name, regex) in tdefs:
m = regex.match(text, offset)
if m is None:
continue
else:
matched_text = m.group()
token = (token_name, matched_text)
return token
raise Exception("Don't know how to lex '%s'" % text[:16])
Track the offset by adding the length of the matched text to it after each token is generated, and update the while
predicate to take this into account:
def tokenize(tdefs, text):
tokens = []
offset = 0
while (len(text) - offset) > 0:
token = next_token(tdefs, text, offset)
tokens.append(token)
matched_text = token[1]
offset += len(matched_text)
return tokens
$ python
Python 2.7.16 (default, Mar 4 2019, 09:02:22)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lex
>>> lex._test()
>>>
All good :)
Continued in Step 2.