A Lisp-like interpreter (in Python), step 1b: a performance tweak
Let's learn how to implement a Lisp or Scheme-like language by writing a series of increasingly capable interpreters.
Step 1b: A performance tweak
Our implementation from step 1 actually has a performance issue, on this line of the
text = text[len(matched_text):]
Here, we are advancing the tokenizer by slicing off the length of the matched text. However, this creates a copy of the remaining text.
Consider an input file consisting of 1,000 OPARENs. Our lexer would:
- match a paren,
- advance by making a copy of the remaining 999 bytes of input,
- match a paren,
- advance by making a copy of the remaining 998 bytes of input,
- match a paren,
- advance by making a copy of the remaining 997 bytes of input,
- and so on...
All of this string copying will slow things down when processing large input files.
We can solve this performance problem by making a few tweaks to our code.
The key is that Python's
match() function accepts an offset as an optional parameter. This means that rather than making all of these string copies, we can just use one copy and keep track of our offset into that string.
As Armin Ronacher says:
This is immensely useful for building lexers because you can continue to use the special ^ symbol to indicate the beginning of a line of entire string. We just need to increase the index to match further. It also means we do not have to slice up the string ourselves which saves a ton of memory allocations and string copying in the process...
Here's the gist of the changes:
$ diff -urN step1/lex.py step1b/lex.py --- step1/lex.py 2019-09-07 20:19:24.000000000 -0500 +++ step1b/lex.py 2019-09-07 20:15:07.000000000 -0500 @@ -11,9 +11,9 @@ tdefs.append(pair) return tdefs -def next_token(tdefs, text): +def next_token(tdefs, text, offset=0): for (token_name, regex) in tdefs: - m = regex.match(text) + m = regex.match(text, offset) if m is None: continue else: @@ -33,11 +33,12 @@ def tokenize(tdefs, text): tokens =  - while len(text) > 0: - token = next_token(tdefs, text) + offset = 0 + while (len(text) - offset) > 0: + token = next_token(tdefs, text, offset) tokens.append(token) matched_text = token - text = text[len(matched_text):] + offset += len(matched_text) return tokens def _test_tokenize():
next_token() to accept an offset
We'll add an
offset parameter to
next_token(), and pass that offset to
re.match(). We'll give
offset a default value of zero, making it an optional parameter.
def next_token(tdefs, text, offset=0): for (token_name, regex) in tdefs: m = regex.match(text, offset) if m is None: continue else: matched_text = m.group() token = (token_name, matched_text) return token raise Exception("Don't know how to lex '%s'" % text[:16])
tokenize() to track an offset instead of using string slicing
Track the offset by adding the length of the matched text to it after each token is generated, and update the
while predicate to take this into account:
def tokenize(tdefs, text): tokens =  offset = 0 while (len(text) - offset) > 0: token = next_token(tdefs, text, offset) tokens.append(token) matched_text = token offset += len(matched_text) return tokens
Re-run the tests
$ python Python 2.7.16 (default, Mar 4 2019, 09:02:22) [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import lex >>> lex._test() >>>
All good :)
Continued in Step 2.