cellularmitosis/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Blog 2019/9/7
<- previous |
index |
next ->
A Lisp-like interpreter (in Python), step 2: recursive descent parsing

Let's learn how to implement a Lisp or Scheme-like language by writing a series of increasingly capable interpreters.

previous: Step 1b: A performance tweak
next: Step 3: Evaluation

Step 2: Recursive descent parsing

Time to parse our tokens into an AST (Abstract Syntax Tree)!
We'll start out with a very simple grammar:
SEXPR = INT | LIST
LIST  = OPAREN CPAREN
      | OPAREN SEXPR { WSPACE SEXPR } CPAREN

(A SEXPR is an s-expression, short for symbollic expression.)
The above bit of EBNF says:

a SEXPR is made up of either an INT or a LIST
a LIST is either an empty list, or a populated list, where

an empty list is an OPAREN followed by a CPAREN, and
a populated list is an OPAREN, a SEXPR, zero or more pairs of WSPACE and SEXPR, followed by a CPAREN.


The parsing technique we will use is called recursive descent.
In this technique, each of the production rules
and terminals
in the EBNF grammar are implemented as functions (which may call each other recursively).
Because these functions will call each other recursively,
they need to share the same function signature.
All of these functions will:

take one argument: the list of remaining tokens
return the resulting AST node and the list of remaining tokens, or (None, None) if the function did not match.

The general pattern will be something like the following pseudocode:
def parse_x(tokens):
    if able_to_parse:
        return (ast, remaining_tokens)
    else:
        return (None, None)
Start by writing some stubs:
parse.py:
from lex import load_tokendefs, tokenize

def parse_int(tokens):
    return (None, None)

def parse_oparen(tokens):
    return (None, None)

def parse_cparen(tokens):
    return (None, None)

def parse_wspace(tokens):
    return (None, None)

def parse_list(tokens):
    return (None, None)

def parse_sexpr(tokens):
    return (None, None)
(Note: do not name this file parser.py, as that will conflict with Python's parser module.)
Parsing terminals

We'll start with the easy part: the terminals.  We just need to check if the token type is correct.
Here's an implementation of parse_int:
def parse_int(tokens):
    if len(tokens) == 0:
        return (None, None)
    token = tokens[0]
    if token[0] == "INT":
        return (token, tokens[1:])
    else:
        return (None, None)
If the next token is an INT, return an AST node (here, we'll use the token itself as an AST node)
along with the tokens which are left over after we consume the INT token.
Otherwise, return (None, None).
Try it out:
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> from parse import *
>>> tokens = [("INT", "42")]
>>> parse_int(tokens)
(('INT', '42'), [])

For clarity, let's extract out a helper function, is_toktype:
def is_toktype(toktype, token):
    return token[0] == toktype
This makes parse_int a bit clearer:
def parse_int(tokens):
    if len(tokens) == 0:
        return (None, None)
    token = tokens[0]
    if is_toktype("INT", token):
        return (token, tokens[1:])
    else:
        return (None, None)
Let's implement parse_oparen and parse_cparen:
def parse_oparen(tokens):
    if len(tokens) == 0:
        return (None, None)
    token = tokens[0]
    if is_toktype("OPAREN", token):
        return (token, tokens[1:])
    else:
        return (None, None)
def parse_cparen(tokens):
    if len(tokens) == 0:
        return (None, None)
    token = tokens[0]
    if is_toktype("CPAREN", token):
        return (token, tokens[1:])
    else:
        return (None, None)
Hmm, these terminal-parsing functions are nearly identical.  Let's factor out a helper function, parse_terminal:
def parse_terminal(tokens, toktype):
    if len(tokens) == 0:
        return (None, None)
    token = tokens[0]
    if is_toktype(toktype, token):
        return (token, tokens[1:])
    else:
        return (None, None)

def parse_int(tokens):
    return parse_terminal(tokens, "INT")

def parse_oparen(tokens):
    return parse_terminal(tokens, "OPAREN")

def parse_cparen(tokens):
    return parse_terminal(tokens, "CPAREN")

def parse_wspace(tokens):
    return parse_terminal(tokens, "WSPACE")
The above refactor will save us some work in the future
if we need to add more terminal parsing functions or change their implementation.
Let's add a test:
def _test_parse_int():
    tokens = [("INT", 1), ("WSPACE", " ")]
    (ast, remaining) = parse_int(tokens)
    assert ast == ("INT", 1)
    assert remaining == [("WSPACE", " ")]
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> parse._test_parse_int()
>>> 

Parsing non-terminals

Implementing production rules involves coming up with a coding technique
for each of the features
of EBNF:

concatenation: FOO BAR BAZ

all of parse_foo, parse_bar and parse_baz must succeed in-order


alternatives: FOO | BAR | BAZ

just one of parse_foo, parse_bar, parse_baz must succeed


repetition: FOO { WSPACE FOO }

parse_foo followed by a while loop of calls to parse_wspace and parse_foo


First, we'll implement the simpler of the two production-rules: SEXPR = INT | LIST
def parse_sexpr(tokens):
    (ast, remaining) = parse_int(tokens)
    if ast is not None:
        return (ast, remaining)
    else:
        (ast, remaining) = parse_list(tokens)
        if ast is not None:
            return (ast, remaining)
        else:
            return (None, None)
This is pretty simple:

try to parse an int
if that worked, return the AST node and the remaining tokens
otherwise, try to parse a list
if that worked, return the AST node and the remaining tokens
otherwise, return (None, None)

However, the final if/else clause is a bit redunant.  If ast it not not None, then it is None, because parse_list returned (None, None).  So, in either case, we want to return whatever parse_list returned to us, i.e.
    else:
        (ast, remaining) = parse_list(tokens)
        if ast is not None:
            return (ast, remaining)
        else:
            return (ast, remaining)
This means we can replace the entire else clause by delegating to parse_list:
def parse_sexpr(tokens):
    (ast, remaining) = parse_int(tokens)
    if ast is not None:
        return (ast, remaining)
    else:
        return parse_list(tokens)
(Note that because we are delegating to one of either parse_int or parse_list,
we won't actually see a "SEXPR" node anywhere in the resulting AST.)
Parsing lists:

The other non-terminal function is a bit of a doozy.
Recall our grammar:
LIST  = OPAREN CPAREN
      | OPAREN SEXPR { WSPACE SEXPR } CPAREN

def parse_list(tokens):
    # we have to have at least two tokens.
    if len(tokens) < 2:
        return (None, None)
    # the first token must be an OPAREN.
    (ast, remaining) = parse_oparen(tokens)
    if ast is None:
        return (None, None)
    # try to match the empty list.
    (ast, remaining2) = parse_cparen(remaining)
    if ast is not None:
        ast2 = ("LIST", [])
        return (ast2, remaining2)
    # try to match a populated list.
    (sexpr, remaining) = parse_sexpr(remaining)
    if sexpr is None:
        return (None, None)
    children = [sexpr]
    while len(remaining) > 0:
        # CPAREN signals the end of the list.
        (ast, remaining2) = parse_cparen(remaining)
        if ast is not None:
            remaining = remaining2
            break
        if len(remaining) == 1:
            # only one token left and it wasn't a CPAREN.  bad parse.
            return (None, None)
        # otherwise, there should be a space and another sexpr.
        (ast, remaining) = parse_wspace(remaining)
        if ast is None:
            return (None, None)
        (ast, remaining) = parse_sexpr(remaining)
        if ast is None:
            return (None, None)
        children.append(ast)
    ast = ("LIST", children)
    return (ast, remaining)
See if you can follow the code by reading the comments.
The AST node which will be returned will be ("LIST", children), where children is a list of child AST nodes.
(Note that while constructing the LIST AST node, we discard any WSPACE tokens.)
Here are some examples to illustrate what LIST nodes look like:
()        -> ("LIST", [])
(1)       -> ("LIST", [("INT", "1")])
(1 2)     -> ("LIST", [("INT", "1"), ("INT", "2")])
(())      -> ("LIST", [("LIST", [])])
(1 (2 3)) -> ("LIST", [("INT", "1"), ("LIST", [("INT", "2"), ("INT", "3")])])

Let's write a test:
def _test_parse_list():
    # parse a list from "() "
    tokens = [("OPAREN", "("), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [])
    assert remaining == [("WSPACE", " ")]
    # parse a list from "(42) "
    tokens = [("OPAREN", "("), ("INT", "42"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("INT", "42")])
    assert remaining == [("WSPACE", " ")]
    # parse "(1 2) "
    tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
    assert remaining == [("WSPACE", " ")]
    # parse "(()) "
    tokens = [("OPAREN", "("), ("OPAREN", "("), ("CPAREN", ")"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("LIST", [])])
    assert remaining == [("WSPACE", " ")]
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> parse._test_parse_list()
>>> 

and now that we have implemented parse_list, we can also run (and test) parse_sexpr:
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from parse import *
>>> tokens = [("INT", 1), ("WSPACE", " ")]
>>> parse_sexpr(tokens)
(('INT', 1), [('WSPACE', ' ')])
>>> 

def _test_parse_sexpr():
    # parse an int
    tokens = [("INT", 1), ("WSPACE", " ")]
    (ast, remaining) = parse_sexpr(tokens)
    assert ast == ("INT", 1)
    assert remaining == [("WSPACE", " ")]
    # parse a list
    tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
    assert remaining == [("WSPACE", " ")]
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> parse._test_parse_sexpr()
>>> 

Wrapping up

We'll wrap up the parser by implementing a single function which serves as its interface: parse:
def parse(tokens):
    (ast, remaining_tokens) = parse_sexpr(tokens)
    if len(remaining_tokens) > 0:
        raise Exception("Leftover tokens! %s" % remaining_tokens)
    return ast
For now, this is pretty simple: try to parse a SEXPR and ensure there aren't any leftover tokens (which would indicate a failed parse).
Write a test:
def _test_parse():
    tdefs = load_tokendefs("tokendefs.txt")
    ast = parse(tokenize(tdefs, "1"))
    assert ast == ("INT", "1")
    ast = parse(tokenize(tdefs, "()"))
    assert ast == ("LIST", [])
    ast = parse(tokenize(tdefs, "(1 2 3)"))
    assert ast == ("LIST", [("INT", "1"), ("INT", "2"), ("INT", "3")])
    ast = parse(tokenize(tdefs, "(1 (2 3))"))
    assert ast == ("LIST", [("INT", "1"), ("LIST", [("INT", "2"), ("INT", "3")])])
and write a function which runs all of the tests:
def _test():
    _test_parse_int()
    _test_parse_list()
    _test_parse_sexpr()
    _test_parse()
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse
>>> parse._test()
>>> 

and finish off with a __main__ implementation:
if __name__ == "__main__":
    import sys
    if len(sys.argv) > 1:
        text = open(sys.argv[-1]).read()
    else:
        text = sys.stdin.read()
    tdefs = load_tokendefs("tokendefs.txt")
    import pprint
    pprint.pprint(parse(tokenize(tdefs, text)))
We can now compare the behavior of lex.py and parse.py:
$ echo -n '(1 2 3)' | python lex.py
[('OPAREN', '('),
 ('INT', '1'),
 ('WSPACE', ' '),
 ('INT', '2'),
 ('WSPACE', ' '),
 ('INT', '3'),
 ('CPAREN', ')')]

$ echo -n '(1 2 3)' | python parse.py
('LIST', [('INT', '1'), ('INT', '2'), ('INT', '3')])

Continued in Step 3.

  
## lex.py
def load_tokendefs(fpath):
    import re
    tdefs = []
    with open(fpath) as f:
        for line in f:
            token_name = line.rstrip('\n')
            line2 = f.next()
            pattern = line2.rstrip('\n')
            regex = re.compile(pattern)
            pair = (token_name, regex)
            tdefs.append(pair)
    return tdefs

def next_token(tdefs, text, offset=0):
    for (token_name, regex) in tdefs:
        m = regex.match(text, offset)
        if m is None:
            continue
        else:
            matched_text = m.group()
            token = (token_name, matched_text)
            return token
    raise Exception("Don't know how to lex '%s'" % text[:16])

def _test_next_token():
    tdefs = load_tokendefs("tokendefs.txt")
    token = next_token(tdefs, "42 ")
    assert token == ("INT", "42")
    token = next_token(tdefs, " 17")
    assert token == ("WSPACE", " ")
    token = next_token(tdefs, "(1 2 3)")
    assert token == ("OPAREN", "(")

def tokenize(tdefs, text):
    tokens = []
    offset = 0
    while (len(text) - offset) > 0:
        token = next_token(tdefs, text, offset)
        tokens.append(token)
        matched_text = token[1]
        offset += len(matched_text)
    return tokens

def _test_tokenize():
    tdefs = load_tokendefs("tokendefs.txt")
    tokens = tokenize(tdefs, "(1 42   65536)")
    assert tokens == [
        ("OPAREN", "("),
        ("INT", "1"),
        ("WSPACE", " "),
        ("INT", "42"),
        ("WSPACE", "   "),
        ("INT", "65536"),
        ("CPAREN", ")")
    ]

def _test():
    _test_next_token()
    _test_tokenize()


if __name__ == "__main__":
    tdefs = load_tokendefs("tokendefs.txt")
    import sys
    if len(sys.argv) > 1:
        fname = sys.argv[-1]
        text = open(fname).read()
    else:
        text = sys.stdin.read()
    tokens = tokenize(tdefs, text)
    import pprint
    pprint.pprint(tokens)

## parse.py
from lex import load_tokendefs, tokenize

# our grammar:
"""
SEXPR = INT | LIST
LIST  = OPAREN CPAREN
      | OPAREN SEXPR { WSPACE SEXPR } CPAREN
"""

def is_toktype(toktype, token):
    return token[0] == toktype

def parse_terminal(tokens, toktype):
    if len(tokens) == 0:
        return (None, None)
    token = tokens[0]
    if is_toktype(toktype, token):
        return (token, tokens[1:])
    else:
        return (None, None)

def parse_int(tokens):
    return parse_terminal(tokens, "INT")

def parse_wspace(tokens):
    return parse_terminal(tokens, "WSPACE")

def parse_oparen(tokens):
    return parse_terminal(tokens, "OPAREN")

def parse_cparen(tokens):
    return parse_terminal(tokens, "CPAREN")

def _test_parse_int():
    tokens = [("INT", 1), ("WSPACE", " ")]
    (ast, remaining) = parse_int(tokens)
    assert ast == ("INT", 1)
    assert remaining == [("WSPACE", " ")]

def parse_list(tokens):
    # we have to have at least two tokens.
    if len(tokens) < 2:
        return (None, None)
    # the first token must be an OPAREN.
    (ast, remaining) = parse_oparen(tokens)
    if ast is None:
        return (None, None)
    # try to match the empty list.
    (ast, remaining2) = parse_cparen(remaining)
    if ast is not None:
        ast2 = ("LIST", [])
        return (ast2, remaining2)
    # try to match a populated list.
    (sexpr, remaining) = parse_sexpr(remaining)
    if sexpr is None:
        return (None, None)
    children = [sexpr]
    while len(remaining) > 0:
        # CPAREN signals the end of the list.
        (ast, remaining2) = parse_cparen(remaining)
        if ast is not None:
            remaining = remaining2
            break
        if len(remaining) == 1:
            # only one token left and it wasn't a CPAREN.  bad parse.
            return (None, None)
        # otherwise, there should be a space and another sexpr.
        (ast, remaining) = parse_wspace(remaining)
        if ast is None:
            return (None, None)
        (ast, remaining) = parse_sexpr(remaining)
        if ast is None:
            return (None, None)
        children.append(ast)
    ast = ("LIST", children)
    return (ast, remaining)

def _test_parse_list():
    # parse a list from "() "
    tokens = [("OPAREN", "("), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [])
    assert remaining == [("WSPACE", " ")]
    # parse a list from "(42) "
    tokens = [("OPAREN", "("), ("INT", "42"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("INT", "42")])
    assert remaining == [("WSPACE", " ")]
    # parse "(1 2) "
    tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
    assert remaining == [("WSPACE", " ")]
    # parse "(()) "
    tokens = [("OPAREN", "("), ("OPAREN", "("), ("CPAREN", ")"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("LIST", [])])
    assert remaining == [("WSPACE", " ")]

def parse_sexpr(tokens):
    (ast, remaining) = parse_int(tokens)
    if ast is not None:
        return (ast, remaining)
    else:
        return parse_list(tokens)

def _test_parse_sexpr():
    # parse an int
    tokens = [("INT", 1), ("WSPACE", " ")]
    (ast, remaining) = parse_sexpr(tokens)
    assert ast == ("INT", 1)
    assert remaining == [("WSPACE", " ")]
    # parse a list
    tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
    assert remaining == [("WSPACE", " ")]

def parse(tokens):
    (ast, remaining_tokens) = parse_sexpr(tokens)
    if len(remaining_tokens) > 0:
        raise Exception("Leftover tokens! %s" % remaining_tokens)
    return ast

def _test_parse():
    tdefs = load_tokendefs("tokendefs.txt")
    ast = parse(tokenize(tdefs, "1"))
    assert ast == ("INT", "1")
    ast = parse(tokenize(tdefs, "()"))
    assert ast == ("LIST", [])
    ast = parse(tokenize(tdefs, "(1 2 3)"))
    assert ast == ("LIST", [("INT", "1"), ("INT", "2"), ("INT", "3")])
    ast = parse(tokenize(tdefs, "(1 (2 3))"))
    assert ast == ("LIST", [("INT", "1"), ("LIST", [("INT", "2"), ("INT", "3")])])

def _test():
    _test_parse_int()
    _test_parse_list()
    _test_parse_sexpr()
    _test_parse()


if __name__ == "__main__":
    import sys
    if len(sys.argv) > 1:
        text = open(sys.argv[-1]).read()
    else:
        text = sys.stdin.read()
    tdefs = load_tokendefs("tokendefs.txt")
    import pprint
    pprint.pprint(parse(tokenize(tdefs, text)))

## tokendefs.txt
INT
\d+
OPAREN
\(
CPAREN
\)
WSPACE
\s+
	def load_tokendefs(fpath):
	import re
	tdefs = []
	with open(fpath) as f:
	for line in f:
	token_name = line.rstrip('\n')
	line2 = f.next()
	pattern = line2.rstrip('\n')
	regex = re.compile(pattern)
	pair = (token_name, regex)
	tdefs.append(pair)
	return tdefs

	def next_token(tdefs, text, offset=0):
	for (token_name, regex) in tdefs:
	m = regex.match(text, offset)
	if m is None:
	continue
	else:
	matched_text = m.group()
	token = (token_name, matched_text)
	return token
	raise Exception("Don't know how to lex '%s'" % text[:16])

	def _test_next_token():
	tdefs = load_tokendefs("tokendefs.txt")
	token = next_token(tdefs, "42 ")
	assert token == ("INT", "42")
	token = next_token(tdefs, " 17")
	assert token == ("WSPACE", " ")
	token = next_token(tdefs, "(1 2 3)")
	assert token == ("OPAREN", "(")

	def tokenize(tdefs, text):
	tokens = []
	offset = 0
	while (len(text) - offset) > 0:
	token = next_token(tdefs, text, offset)
	tokens.append(token)
	matched_text = token[1]
	offset += len(matched_text)
	return tokens

	def _test_tokenize():
	tdefs = load_tokendefs("tokendefs.txt")
	tokens = tokenize(tdefs, "(1 42 65536)")
	assert tokens == [
	("OPAREN", "("),
	("INT", "1"),
	("WSPACE", " "),
	("INT", "42"),
	("WSPACE", " "),
	("INT", "65536"),
	("CPAREN", ")")
	]

	def _test():
	_test_next_token()
	_test_tokenize()


	if __name__ == "__main__":
	tdefs = load_tokendefs("tokendefs.txt")
	import sys
	if len(sys.argv) > 1:
	fname = sys.argv[-1]
	text = open(fname).read()
	else:
	text = sys.stdin.read()
	tokens = tokenize(tdefs, text)
	import pprint
	pprint.pprint(tokens)
	from lex import load_tokendefs, tokenize

	# our grammar:
	"""
	SEXPR = INT \| LIST
	LIST = OPAREN CPAREN
	\| OPAREN SEXPR { WSPACE SEXPR } CPAREN
	"""

	def is_toktype(toktype, token):
	return token[0] == toktype

	def parse_terminal(tokens, toktype):
	if len(tokens) == 0:
	return (None, None)
	token = tokens[0]
	if is_toktype(toktype, token):
	return (token, tokens[1:])
	else:
	return (None, None)

	def parse_int(tokens):
	return parse_terminal(tokens, "INT")

	def parse_wspace(tokens):
	return parse_terminal(tokens, "WSPACE")

	def parse_oparen(tokens):
	return parse_terminal(tokens, "OPAREN")

	def parse_cparen(tokens):
	return parse_terminal(tokens, "CPAREN")

	def _test_parse_int():
	tokens = [("INT", 1), ("WSPACE", " ")]
	(ast, remaining) = parse_int(tokens)
	assert ast == ("INT", 1)
	assert remaining == [("WSPACE", " ")]

	def parse_list(tokens):
	# we have to have at least two tokens.
	if len(tokens) < 2:
	return (None, None)
	# the first token must be an OPAREN.
	(ast, remaining) = parse_oparen(tokens)
	if ast is None:
	return (None, None)
	# try to match the empty list.
	(ast, remaining2) = parse_cparen(remaining)
	if ast is not None:
	ast2 = ("LIST", [])
	return (ast2, remaining2)
	# try to match a populated list.
	(sexpr, remaining) = parse_sexpr(remaining)
	if sexpr is None:
	return (None, None)
	children = [sexpr]
	while len(remaining) > 0:
	# CPAREN signals the end of the list.
	(ast, remaining2) = parse_cparen(remaining)
	if ast is not None:
	remaining = remaining2
	break
	if len(remaining) == 1:
	# only one token left and it wasn't a CPAREN. bad parse.
	return (None, None)
	# otherwise, there should be a space and another sexpr.
	(ast, remaining) = parse_wspace(remaining)
	if ast is None:
	return (None, None)
	(ast, remaining) = parse_sexpr(remaining)
	if ast is None:
	return (None, None)
	children.append(ast)
	ast = ("LIST", children)
	return (ast, remaining)

	def _test_parse_list():
	# parse a list from "() "
	tokens = [("OPAREN", "("), ("CPAREN", ")"), ("WSPACE", " ")]
	(ast, remaining) = parse_list(tokens)
	assert ast == ("LIST", [])
	assert remaining == [("WSPACE", " ")]
	# parse a list from "(42) "
	tokens = [("OPAREN", "("), ("INT", "42"), ("CPAREN", ")"), ("WSPACE", " ")]
	(ast, remaining) = parse_list(tokens)
	assert ast == ("LIST", [("INT", "42")])
	assert remaining == [("WSPACE", " ")]
	# parse "(1 2) "
	tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
	(ast, remaining) = parse_list(tokens)
	assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
	assert remaining == [("WSPACE", " ")]
	# parse "(()) "
	tokens = [("OPAREN", "("), ("OPAREN", "("), ("CPAREN", ")"), ("CPAREN", ")"), ("WSPACE", " ")]
	(ast, remaining) = parse_list(tokens)
	assert ast == ("LIST", [("LIST", [])])
	assert remaining == [("WSPACE", " ")]

	def parse_sexpr(tokens):
	(ast, remaining) = parse_int(tokens)
	if ast is not None:
	return (ast, remaining)
	else:
	return parse_list(tokens)

	def _test_parse_sexpr():
	# parse an int
	tokens = [("INT", 1), ("WSPACE", " ")]
	(ast, remaining) = parse_sexpr(tokens)
	assert ast == ("INT", 1)
	assert remaining == [("WSPACE", " ")]
	# parse a list
	tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
	(ast, remaining) = parse_list(tokens)
	assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
	assert remaining == [("WSPACE", " ")]

	def parse(tokens):
	(ast, remaining_tokens) = parse_sexpr(tokens)
	if len(remaining_tokens) > 0:
	raise Exception("Leftover tokens! %s" % remaining_tokens)
	return ast

	def _test_parse():
	tdefs = load_tokendefs("tokendefs.txt")
	ast = parse(tokenize(tdefs, "1"))
	assert ast == ("INT", "1")
	ast = parse(tokenize(tdefs, "()"))
	assert ast == ("LIST", [])
	ast = parse(tokenize(tdefs, "(1 2 3)"))
	assert ast == ("LIST", [("INT", "1"), ("INT", "2"), ("INT", "3")])
	ast = parse(tokenize(tdefs, "(1 (2 3))"))
	assert ast == ("LIST", [("INT", "1"), ("LIST", [("INT", "2"), ("INT", "3")])])

	def _test():
	_test_parse_int()
	_test_parse_list()
	_test_parse_sexpr()
	_test_parse()


	if __name__ == "__main__":
	import sys
	if len(sys.argv) > 1:
	text = open(sys.argv[-1]).read()
	else:
	text = sys.stdin.read()
	tdefs = load_tokendefs("tokendefs.txt")
	import pprint
	pprint.pprint(parse(tokenize(tdefs, text)))