Skip to content

Instantly share code, notes, and snippets.

@cellularmitosis
Last active April 15, 2020 03:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cellularmitosis/75dc4aefe88438c14e9433f8728342d1 to your computer and use it in GitHub Desktop.
Save cellularmitosis/75dc4aefe88438c14e9433f8728342d1 to your computer and use it in GitHub Desktop.
A Lisp-like interpreter (in Python), step 2: a recursive descent parsing

Blog 2019/9/7

<- previous | index | next ->

A Lisp-like interpreter (in Python), step 2: recursive descent parsing

Let's learn how to implement a Lisp or Scheme-like language by writing a series of increasingly capable interpreters.

Step 2: Recursive descent parsing

Time to parse our tokens into an AST (Abstract Syntax Tree)!

We'll start out with a very simple grammar:

SEXPR = INT | LIST
LIST  = OPAREN CPAREN
      | OPAREN SEXPR { WSPACE SEXPR } CPAREN

(A SEXPR is an s-expression, short for symbollic expression.)

The above bit of EBNF says:

  • a SEXPR is made up of either an INT or a LIST
  • a LIST is either an empty list, or a populated list, where
    • an empty list is an OPAREN followed by a CPAREN, and
    • a populated list is an OPAREN, a SEXPR, zero or more pairs of WSPACE and SEXPR, followed by a CPAREN.

The parsing technique we will use is called recursive descent.

In this technique, each of the production rules and terminals in the EBNF grammar are implemented as functions (which may call each other recursively).

Because these functions will call each other recursively, they need to share the same function signature. All of these functions will:

  • take one argument: the list of remaining tokens
  • return the resulting AST node and the list of remaining tokens, or (None, None) if the function did not match.

The general pattern will be something like the following pseudocode:

def parse_x(tokens):
    if able_to_parse:
        return (ast, remaining_tokens)
    else:
        return (None, None)

Start by writing some stubs:

parse.py:

from lex import load_tokendefs, tokenize

def parse_int(tokens):
    return (None, None)

def parse_oparen(tokens):
    return (None, None)

def parse_cparen(tokens):
    return (None, None)

def parse_wspace(tokens):
    return (None, None)

def parse_list(tokens):
    return (None, None)

def parse_sexpr(tokens):
    return (None, None)

(Note: do not name this file parser.py, as that will conflict with Python's parser module.)

Parsing terminals

We'll start with the easy part: the terminals. We just need to check if the token type is correct.

Here's an implementation of parse_int:

def parse_int(tokens):
    if len(tokens) == 0:
        return (None, None)
    token = tokens[0]
    if token[0] == "INT":
        return (token, tokens[1:])
    else:
        return (None, None)

If the next token is an INT, return an AST node (here, we'll use the token itself as an AST node) along with the tokens which are left over after we consume the INT token. Otherwise, return (None, None).

Try it out:

$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> from parse import *
>>> tokens = [("INT", "42")]
>>> parse_int(tokens)
(('INT', '42'), [])

For clarity, let's extract out a helper function, is_toktype:

def is_toktype(toktype, token):
    return token[0] == toktype

This makes parse_int a bit clearer:

def parse_int(tokens):
    if len(tokens) == 0:
        return (None, None)
    token = tokens[0]
    if is_toktype("INT", token):
        return (token, tokens[1:])
    else:
        return (None, None)

Let's implement parse_oparen and parse_cparen:

def parse_oparen(tokens):
    if len(tokens) == 0:
        return (None, None)
    token = tokens[0]
    if is_toktype("OPAREN", token):
        return (token, tokens[1:])
    else:
        return (None, None)
def parse_cparen(tokens):
    if len(tokens) == 0:
        return (None, None)
    token = tokens[0]
    if is_toktype("CPAREN", token):
        return (token, tokens[1:])
    else:
        return (None, None)

Hmm, these terminal-parsing functions are nearly identical. Let's factor out a helper function, parse_terminal:

def parse_terminal(tokens, toktype):
    if len(tokens) == 0:
        return (None, None)
    token = tokens[0]
    if is_toktype(toktype, token):
        return (token, tokens[1:])
    else:
        return (None, None)

def parse_int(tokens):
    return parse_terminal(tokens, "INT")

def parse_oparen(tokens):
    return parse_terminal(tokens, "OPAREN")

def parse_cparen(tokens):
    return parse_terminal(tokens, "CPAREN")

def parse_wspace(tokens):
    return parse_terminal(tokens, "WSPACE")

The above refactor will save us some work in the future if we need to add more terminal parsing functions or change their implementation.

Let's add a test:

def _test_parse_int():
    tokens = [("INT", 1), ("WSPACE", " ")]
    (ast, remaining) = parse_int(tokens)
    assert ast == ("INT", 1)
    assert remaining == [("WSPACE", " ")]
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> parse._test_parse_int()
>>> 

Parsing non-terminals

Implementing production rules involves coming up with a coding technique for each of the features of EBNF:

  • concatenation: FOO BAR BAZ
    • all of parse_foo, parse_bar and parse_baz must succeed in-order
  • alternatives: FOO | BAR | BAZ
    • just one of parse_foo, parse_bar, parse_baz must succeed
  • repetition: FOO { WSPACE FOO }
    • parse_foo followed by a while loop of calls to parse_wspace and parse_foo

First, we'll implement the simpler of the two production-rules: SEXPR = INT | LIST

def parse_sexpr(tokens):
    (ast, remaining) = parse_int(tokens)
    if ast is not None:
        return (ast, remaining)
    else:
        (ast, remaining) = parse_list(tokens)
        if ast is not None:
            return (ast, remaining)
        else:
            return (None, None)

This is pretty simple:

  • try to parse an int
  • if that worked, return the AST node and the remaining tokens
  • otherwise, try to parse a list
  • if that worked, return the AST node and the remaining tokens
  • otherwise, return (None, None)

However, the final if/else clause is a bit redunant. If ast it not not None, then it is None, because parse_list returned (None, None). So, in either case, we want to return whatever parse_list returned to us, i.e.

    else:
        (ast, remaining) = parse_list(tokens)
        if ast is not None:
            return (ast, remaining)
        else:
            return (ast, remaining)

This means we can replace the entire else clause by delegating to parse_list:

def parse_sexpr(tokens):
    (ast, remaining) = parse_int(tokens)
    if ast is not None:
        return (ast, remaining)
    else:
        return parse_list(tokens)

(Note that because we are delegating to one of either parse_int or parse_list, we won't actually see a "SEXPR" node anywhere in the resulting AST.)

Parsing lists:

The other non-terminal function is a bit of a doozy.

Recall our grammar:

LIST  = OPAREN CPAREN
      | OPAREN SEXPR { WSPACE SEXPR } CPAREN
def parse_list(tokens):
    # we have to have at least two tokens.
    if len(tokens) < 2:
        return (None, None)
    # the first token must be an OPAREN.
    (ast, remaining) = parse_oparen(tokens)
    if ast is None:
        return (None, None)
    # try to match the empty list.
    (ast, remaining2) = parse_cparen(remaining)
    if ast is not None:
        ast2 = ("LIST", [])
        return (ast2, remaining2)
    # try to match a populated list.
    (sexpr, remaining) = parse_sexpr(remaining)
    if sexpr is None:
        return (None, None)
    children = [sexpr]
    while len(remaining) > 0:
        # CPAREN signals the end of the list.
        (ast, remaining2) = parse_cparen(remaining)
        if ast is not None:
            remaining = remaining2
            break
        if len(remaining) == 1:
            # only one token left and it wasn't a CPAREN.  bad parse.
            return (None, None)
        # otherwise, there should be a space and another sexpr.
        (ast, remaining) = parse_wspace(remaining)
        if ast is None:
            return (None, None)
        (ast, remaining) = parse_sexpr(remaining)
        if ast is None:
            return (None, None)
        children.append(ast)
    ast = ("LIST", children)
    return (ast, remaining)

See if you can follow the code by reading the comments.

The AST node which will be returned will be ("LIST", children), where children is a list of child AST nodes.

(Note that while constructing the LIST AST node, we discard any WSPACE tokens.)

Here are some examples to illustrate what LIST nodes look like:

()        -> ("LIST", [])
(1)       -> ("LIST", [("INT", "1")])
(1 2)     -> ("LIST", [("INT", "1"), ("INT", "2")])
(())      -> ("LIST", [("LIST", [])])
(1 (2 3)) -> ("LIST", [("INT", "1"), ("LIST", [("INT", "2"), ("INT", "3")])])

Let's write a test:

def _test_parse_list():
    # parse a list from "() "
    tokens = [("OPAREN", "("), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [])
    assert remaining == [("WSPACE", " ")]
    # parse a list from "(42) "
    tokens = [("OPAREN", "("), ("INT", "42"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("INT", "42")])
    assert remaining == [("WSPACE", " ")]
    # parse "(1 2) "
    tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
    assert remaining == [("WSPACE", " ")]
    # parse "(()) "
    tokens = [("OPAREN", "("), ("OPAREN", "("), ("CPAREN", ")"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("LIST", [])])
    assert remaining == [("WSPACE", " ")]
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> parse._test_parse_list()
>>> 

and now that we have implemented parse_list, we can also run (and test) parse_sexpr:

$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from parse import *
>>> tokens = [("INT", 1), ("WSPACE", " ")]
>>> parse_sexpr(tokens)
(('INT', 1), [('WSPACE', ' ')])
>>> 
def _test_parse_sexpr():
    # parse an int
    tokens = [("INT", 1), ("WSPACE", " ")]
    (ast, remaining) = parse_sexpr(tokens)
    assert ast == ("INT", 1)
    assert remaining == [("WSPACE", " ")]
    # parse a list
    tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
    assert remaining == [("WSPACE", " ")]
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> parse._test_parse_sexpr()
>>> 

Wrapping up

We'll wrap up the parser by implementing a single function which serves as its interface: parse:

def parse(tokens):
    (ast, remaining_tokens) = parse_sexpr(tokens)
    if len(remaining_tokens) > 0:
        raise Exception("Leftover tokens! %s" % remaining_tokens)
    return ast

For now, this is pretty simple: try to parse a SEXPR and ensure there aren't any leftover tokens (which would indicate a failed parse).

Write a test:

def _test_parse():
    tdefs = load_tokendefs("tokendefs.txt")
    ast = parse(tokenize(tdefs, "1"))
    assert ast == ("INT", "1")
    ast = parse(tokenize(tdefs, "()"))
    assert ast == ("LIST", [])
    ast = parse(tokenize(tdefs, "(1 2 3)"))
    assert ast == ("LIST", [("INT", "1"), ("INT", "2"), ("INT", "3")])
    ast = parse(tokenize(tdefs, "(1 (2 3))"))
    assert ast == ("LIST", [("INT", "1"), ("LIST", [("INT", "2"), ("INT", "3")])])

and write a function which runs all of the tests:

def _test():
    _test_parse_int()
    _test_parse_list()
    _test_parse_sexpr()
    _test_parse()
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> parse._test()
>>> 

and finish off with a __main__ implementation:

if __name__ == "__main__":
    import sys
    if len(sys.argv) > 1:
        text = open(sys.argv[-1]).read()
    else:
        text = sys.stdin.read()
    tdefs = load_tokendefs("tokendefs.txt")
    import pprint
    pprint.pprint(parse(tokenize(tdefs, text)))

We can now compare the behavior of lex.py and parse.py:

$ echo -n '(1 2 3)' | python lex.py
[('OPAREN', '('),
 ('INT', '1'),
 ('WSPACE', ' '),
 ('INT', '2'),
 ('WSPACE', ' '),
 ('INT', '3'),
 ('CPAREN', ')')]
$ echo -n '(1 2 3)' | python parse.py
('LIST', [('INT', '1'), ('INT', '2'), ('INT', '3')])

Continued in Step 3.

def load_tokendefs(fpath):
import re
tdefs = []
with open(fpath) as f:
for line in f:
token_name = line.rstrip('\n')
line2 = f.next()
pattern = line2.rstrip('\n')
regex = re.compile(pattern)
pair = (token_name, regex)
tdefs.append(pair)
return tdefs
def next_token(tdefs, text, offset=0):
for (token_name, regex) in tdefs:
m = regex.match(text, offset)
if m is None:
continue
else:
matched_text = m.group()
token = (token_name, matched_text)
return token
raise Exception("Don't know how to lex '%s'" % text[:16])
def _test_next_token():
tdefs = load_tokendefs("tokendefs.txt")
token = next_token(tdefs, "42 ")
assert token == ("INT", "42")
token = next_token(tdefs, " 17")
assert token == ("WSPACE", " ")
token = next_token(tdefs, "(1 2 3)")
assert token == ("OPAREN", "(")
def tokenize(tdefs, text):
tokens = []
offset = 0
while (len(text) - offset) > 0:
token = next_token(tdefs, text, offset)
tokens.append(token)
matched_text = token[1]
offset += len(matched_text)
return tokens
def _test_tokenize():
tdefs = load_tokendefs("tokendefs.txt")
tokens = tokenize(tdefs, "(1 42 65536)")
assert tokens == [
("OPAREN", "("),
("INT", "1"),
("WSPACE", " "),
("INT", "42"),
("WSPACE", " "),
("INT", "65536"),
("CPAREN", ")")
]
def _test():
_test_next_token()
_test_tokenize()
if __name__ == "__main__":
tdefs = load_tokendefs("tokendefs.txt")
import sys
if len(sys.argv) > 1:
fname = sys.argv[-1]
text = open(fname).read()
else:
text = sys.stdin.read()
tokens = tokenize(tdefs, text)
import pprint
pprint.pprint(tokens)
from lex import load_tokendefs, tokenize
# our grammar:
"""
SEXPR = INT | LIST
LIST = OPAREN CPAREN
| OPAREN SEXPR { WSPACE SEXPR } CPAREN
"""
def is_toktype(toktype, token):
return token[0] == toktype
def parse_terminal(tokens, toktype):
if len(tokens) == 0:
return (None, None)
token = tokens[0]
if is_toktype(toktype, token):
return (token, tokens[1:])
else:
return (None, None)
def parse_int(tokens):
return parse_terminal(tokens, "INT")
def parse_wspace(tokens):
return parse_terminal(tokens, "WSPACE")
def parse_oparen(tokens):
return parse_terminal(tokens, "OPAREN")
def parse_cparen(tokens):
return parse_terminal(tokens, "CPAREN")
def _test_parse_int():
tokens = [("INT", 1), ("WSPACE", " ")]
(ast, remaining) = parse_int(tokens)
assert ast == ("INT", 1)
assert remaining == [("WSPACE", " ")]
def parse_list(tokens):
# we have to have at least two tokens.
if len(tokens) < 2:
return (None, None)
# the first token must be an OPAREN.
(ast, remaining) = parse_oparen(tokens)
if ast is None:
return (None, None)
# try to match the empty list.
(ast, remaining2) = parse_cparen(remaining)
if ast is not None:
ast2 = ("LIST", [])
return (ast2, remaining2)
# try to match a populated list.
(sexpr, remaining) = parse_sexpr(remaining)
if sexpr is None:
return (None, None)
children = [sexpr]
while len(remaining) > 0:
# CPAREN signals the end of the list.
(ast, remaining2) = parse_cparen(remaining)
if ast is not None:
remaining = remaining2
break
if len(remaining) == 1:
# only one token left and it wasn't a CPAREN. bad parse.
return (None, None)
# otherwise, there should be a space and another sexpr.
(ast, remaining) = parse_wspace(remaining)
if ast is None:
return (None, None)
(ast, remaining) = parse_sexpr(remaining)
if ast is None:
return (None, None)
children.append(ast)
ast = ("LIST", children)
return (ast, remaining)
def _test_parse_list():
# parse a list from "() "
tokens = [("OPAREN", "("), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [])
assert remaining == [("WSPACE", " ")]
# parse a list from "(42) "
tokens = [("OPAREN", "("), ("INT", "42"), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [("INT", "42")])
assert remaining == [("WSPACE", " ")]
# parse "(1 2) "
tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
assert remaining == [("WSPACE", " ")]
# parse "(()) "
tokens = [("OPAREN", "("), ("OPAREN", "("), ("CPAREN", ")"), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [("LIST", [])])
assert remaining == [("WSPACE", " ")]
def parse_sexpr(tokens):
(ast, remaining) = parse_int(tokens)
if ast is not None:
return (ast, remaining)
else:
return parse_list(tokens)
def _test_parse_sexpr():
# parse an int
tokens = [("INT", 1), ("WSPACE", " ")]
(ast, remaining) = parse_sexpr(tokens)
assert ast == ("INT", 1)
assert remaining == [("WSPACE", " ")]
# parse a list
tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
assert remaining == [("WSPACE", " ")]
def parse(tokens):
(ast, remaining_tokens) = parse_sexpr(tokens)
if len(remaining_tokens) > 0:
raise Exception("Leftover tokens! %s" % remaining_tokens)
return ast
def _test_parse():
tdefs = load_tokendefs("tokendefs.txt")
ast = parse(tokenize(tdefs, "1"))
assert ast == ("INT", "1")
ast = parse(tokenize(tdefs, "()"))
assert ast == ("LIST", [])
ast = parse(tokenize(tdefs, "(1 2 3)"))
assert ast == ("LIST", [("INT", "1"), ("INT", "2"), ("INT", "3")])
ast = parse(tokenize(tdefs, "(1 (2 3))"))
assert ast == ("LIST", [("INT", "1"), ("LIST", [("INT", "2"), ("INT", "3")])])
def _test():
_test_parse_int()
_test_parse_list()
_test_parse_sexpr()
_test_parse()
if __name__ == "__main__":
import sys
if len(sys.argv) > 1:
text = open(sys.argv[-1]).read()
else:
text = sys.stdin.read()
tdefs = load_tokendefs("tokendefs.txt")
import pprint
pprint.pprint(parse(tokenize(tdefs, text)))
INT
\d+
OPAREN
\(
CPAREN
\)
WSPACE
\s+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment