cellularmitosis/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Blog 2019/9/7
<- previous |
index |
next ->
A Lisp-like interpreter (in Python), step 3: evaluation

Let's learn how to implement a Lisp or Scheme-like language by writing a series of increasingly capable interpreters.

previous: Step 2: Recursive descent parsing

Step 3: Evaluation


The core of any Lisp-like interpreter is the eval function.
As a concept, eval is over 60 years old,
first seen in McCarthy's
original Lisp implementation.
Paul Graham
wrote a great article[ps][pdf][markdown]
about the significance of eval.
eval takes an AST and an environment and returns a value.
However, Python already has an eval function, so we will use the name eval2.  Start with a stub:
eval.py
def eval2(ast, env):
  return None
Lisp-like languages follow the convention that a list represents a function call, where the first item in the list is a function (the operator) and the rest of the list items are the arguments to that function.  E.g. (sum 1 2 3) evaluates to 6.
More particularly, sum is a symbol.  Symbols evaluate to their corresponding value from the environment.  The environment is like a lookup table which maps symbols to values.  In this case, the value of sum is a function pointer.
1, 2 and 3 are literals, which evaluate to themselves.
In a traditional Lisp-like language implementation, eval determines the value of the operand and arguments, then hands them to a function named apply, which executes the function.
(again, Python already has an apply, so we will use the name apply2.)
Literals evaluate to themselves

But we are getting ahead of ourselves.  First, let's support evaluating literals:
def is_int(ast):
    return ast[0] == "INT"

def eval2(ast, env):
    if is_int(ast):
        return int(ast[1])
    else:
        raise Exception("Don't know how to evaluate %s" % ast)
Try it out:
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from eval import *
>>> env = {}
>>> ast = ("INT", "42")
>>> eval2(ast, env)
42

Cool, let's skip to the punchline and add a __main__ implementation so that we can run this bad boy:
if __name__ == "__main__":
    import sys
    from lex import load_tokendefs, tokenize
    from parse import parse
    if len(sys.argv) > 1:
        text = open(sys.argv[1]).read()
    else:
        text = sys.stdin.read()
    tdefs = load_tokendefs("tokendefs.txt")
    env = {}
    print eval2(parse(tokenize(tdefs, text)), env)
$ echo -n 1 | python eval.py
1

Woot!
Symbols evaluate to their corresponding value from the environment

Next, let's create a function sum2 (Python already defines a sum function):
import operator

def sum2(*args):
    return reduce(operator.add, args)
Try it out:
$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from eval import *
>>> sum2(1, 2, 3)
6

Next, crack open tokendefs.txt and add a SYMBOL token type.  We'll start off with something very basic: a symbol is a group of lowercase letters:
SYMBOL
[a-z]+

Let's check that it works with our lexer:
$ echo 'sum' | python lex.py
[('SYMBOL', 'sum'), ('WSPACE', '\n')]

Next, we need to update our parser to handle symbols.  We'll make a slight tweak to our grammar:
SEXPR = ATOM | LIST
ATOM  = INT | SYMBOL
LIST  = OPAREN CPAREN
      | OPAREN SEXPR { WSPACE SEXPR } CPAREN

We'll need to add another terminal parser for SYMBOL:
def parse_symbol(tokens):
    return parse_terminal(tokens, "SYMBOL")
We need to create a parse_atom function for the rule ATOM = INT | SYMBOL.  This uses alternation, so its implementation will look very similar to parse_sexpr:
def parse_atom(tokens):
    (ast, remaining) = parse_int(tokens)
    if ast is not None:
        return (ast, remaining)
    else:
        return parse_symbol(tokens)
And speaking of parse_sexpr, we need to update that as well:
def parse_sexpr(tokens):
    (ast, remaining) = parse_atom(tokens)
    if ast is not None:
        return (ast, remaining)
    else:
        return parse_list(tokens)
Now let's try it out:
$ echo -n '(sum 1 2 3)' | python parse.py
('LIST', [('SYMBOL', 'sum'), ('INT', '1'), ('INT', '2'), ('INT', '3')])

Fantastico!
Now, we'll need to update eval2 to handle symbol lookup:
def is_symbol(ast):
    return ast[0] == "SYMBOL"

def lookup(ast, env):
    symbol_name = ast[1]
    return env[symbol_name]

def eval2(ast, env):
    if is_int(ast):
        return int(ast[1])
    elif is_symbol(ast):
        return lookup(ast, env)
    else:
        raise Exception("Don't know how to evaluate %s" % ast)
In the eval.py __main__ implementation, replace env = {} with:
    env = { "sum": sum2 }
Now let's see if we can eval a symbol:
$ echo -n 'sum' | python eval.py
<function sum2 at 0x102b40398>

Rock and roll!  We are really close now.
Now we just need to add an elif is_list() clause to eval2:
def is_list(ast):
    return ast[0] == "LIST"

def eval2(ast, env):
    if is_int(ast):
        return int(ast[1])
    elif is_symbol(ast):
        return lookup(ast, env)
    elif is_list(ast):
        children = ast[1]
        first, rest = children[0], children[1:]
        operand = eval2(first, env)
        args = [eval2(arg, env) for arg in rest]
        return apply2(operand, args)
    else:
        raise Exception("Don't know how to evaluate %s" % ast)

def apply2(op, args):
    return op(*args)
Now let's eval (sum 1 2 3):
$ echo -n '(sum 1 2 3)' | python eval.py
6

KABLAMMO MOTHERFUCKER!!!  You just wrote an interpreter!
Next time, we'll look at defining new entries in the environment.

  
## eval.py
import operator

def sum2(*args):
    return reduce(operator.add, args)

def is_int(ast):
    return ast[0] == "INT"

def is_symbol(ast):
    return ast[0] == "SYMBOL"

def is_list(ast):
    return ast[0] == "LIST"

def lookup(ast, env):
    symbol_name = ast[1]
    return env[symbol_name]

def eval2(ast, env):
    if is_int(ast):
        return int(ast[1])
    elif is_symbol(ast):
        return lookup(ast, env)
    elif is_list(ast):
        children = ast[1]
        first, rest = children[0], children[1:]
        operand = eval2(first, env)
        args = [eval2(arg, env) for arg in rest]
        return apply2(operand, args)
    else:
        raise Exception("Don't know how to evaluate %s" % ast)

def apply2(op, args):
    return op(*args)


if __name__ == "__main__":
    import sys
    from lex import load_tokendefs, tokenize
    from parse import parse
    if len(sys.argv) > 1:
        text = open(sys.argv[1]).read()
    else:
        text = sys.stdin.read()
    tdefs = load_tokendefs("tokendefs.txt")
    env = {
        "sum": sum2
    }
    print eval2(parse(tokenize(tdefs, text)), env)

## lex.py
def load_tokendefs(fpath):
    import re
    tdefs = []
    with open(fpath) as f:
        for line in f:
            token_name = line.rstrip('\n')
            line2 = f.next()
            pattern = line2.rstrip('\n')
            regex = re.compile(pattern)
            pair = (token_name, regex)
            tdefs.append(pair)
    return tdefs

def next_token(tdefs, text, offset=0):
    for (token_name, regex) in tdefs:
        m = regex.match(text, offset)
        if m is None:
            continue
        else:
            matched_text = m.group()
            token = (token_name, matched_text)
            return token
    raise Exception("Don't know how to lex '%s'" % text[:16])

def _test_next_token():
    tdefs = load_tokendefs("tokendefs.txt")
    token = next_token(tdefs, "42 ")
    assert token == ("INT", "42")
    token = next_token(tdefs, " 17")
    assert token == ("WSPACE", " ")
    token = next_token(tdefs, "(1 2 3)")
    assert token == ("OPAREN", "(")

def tokenize(tdefs, text):
    tokens = []
    offset = 0
    while (len(text) - offset) > 0:
        token = next_token(tdefs, text, offset)
        tokens.append(token)
        matched_text = token[1]
        offset += len(matched_text)
    return tokens

def _test_tokenize():
    tdefs = load_tokendefs("tokendefs.txt")
    tokens = tokenize(tdefs, "(1 42   65536)")
    assert tokens == [
        ("OPAREN", "("),
        ("INT", "1"),
        ("WSPACE", " "),
        ("INT", "42"),
        ("WSPACE", "   "),
        ("INT", "65536"),
        ("CPAREN", ")")
    ]

def _test():
    _test_next_token()
    _test_tokenize()


if __name__ == "__main__":
    tdefs = load_tokendefs("tokendefs.txt")
    import sys
    if len(sys.argv) > 1:
        fname = sys.argv[-1]
        text = open(fname).read()
    else:
        text = sys.stdin.read()
    tokens = tokenize(tdefs, text)
    import pprint
    pprint.pprint(tokens)

## parse.py
from lex import load_tokendefs, tokenize

# our grammar:
"""
SEXPR = ATOM | LIST
ATOM  = INT | SYMBOL
LIST  = OPAREN CPAREN
      | OPAREN SEXPR { WSPACE SEXPR } CPAREN
"""

def is_toktype(toktype, token):
    return token[0] == toktype

def parse_terminal(tokens, toktype):
    if len(tokens) == 0:
        return (None, None)
    token = tokens[0]
    if is_toktype(toktype, token):
        return (token, tokens[1:])
    else:
        return (None, None)

def parse_int(tokens):
    return parse_terminal(tokens, "INT")

def parse_wspace(tokens):
    return parse_terminal(tokens, "WSPACE")

def parse_oparen(tokens):
    return parse_terminal(tokens, "OPAREN")

def parse_cparen(tokens):
    return parse_terminal(tokens, "CPAREN")

def parse_symbol(tokens):
    return parse_terminal(tokens, "SYMBOL")

def _test_parse_int():
    tokens = [("INT", 1), ("WSPACE", " ")]
    (ast, remaining) = parse_int(tokens)
    assert ast == ("INT", 1)
    assert remaining == [("WSPACE", " ")]

def parse_list(tokens):
    # we have to have at least two tokens.
    if len(tokens) < 2:
        return (None, None)
    # the first token must be an OPAREN.
    (ast, remaining) = parse_oparen(tokens)
    if ast is None:
        return (None, None)
    # try to match the empty list.
    (ast, remaining2) = parse_cparen(remaining)
    if ast is not None:
        ast2 = ("LIST", [])
        return (ast2, remaining2)
    # try to match a populated list.
    (sexpr, remaining) = parse_sexpr(remaining)
    if sexpr is None:
        return (None, None)
    children = [sexpr]
    while len(remaining) > 0:
        # CPAREN signals the end of the list.
        (ast, remaining2) = parse_cparen(remaining)
        if ast is not None:
            remaining = remaining2
            break
        if len(remaining) == 1:
            # only one token left and it wasn't a CPAREN.  bad parse.
            return (None, None)
        # otherwise, there should be a space and another sexpr.
        (ast, remaining) = parse_wspace(remaining)
        if ast is None:
            return (None, None)
        (ast, remaining) = parse_sexpr(remaining)
        if ast is None:
            return (None, None)
        children.append(ast)
    ast = ("LIST", children)
    return (ast, remaining)

def _test_parse_list():
    # parse a list from "() "
    tokens = [("OPAREN", "("), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [])
    assert remaining == [("WSPACE", " ")]
    # parse a list from "(42) "
    tokens = [("OPAREN", "("), ("INT", "42"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("INT", "42")])
    assert remaining == [("WSPACE", " ")]
    # parse "(1 2) "
    tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
    assert remaining == [("WSPACE", " ")]
    # parse "(()) "
    tokens = [("OPAREN", "("), ("OPAREN", "("), ("CPAREN", ")"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("LIST", [])])
    assert remaining == [("WSPACE", " ")]

def parse_sexpr(tokens):
    (ast, remaining) = parse_atom(tokens)
    if ast is not None:
        return (ast, remaining)
    else:
        return parse_list(tokens)

def parse_atom(tokens):
    (ast, remaining) = parse_int(tokens)
    if ast is not None:
        return (ast, remaining)
    else:
        return parse_symbol(tokens)

def _test_parse_sexpr():
    # parse an int
    tokens = [("INT", 1), ("WSPACE", " ")]
    (ast, remaining) = parse_sexpr(tokens)
    assert ast == ("INT", 1)
    assert remaining == [("WSPACE", " ")]
    # parse a list
    tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
    (ast, remaining) = parse_list(tokens)
    assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
    assert remaining == [("WSPACE", " ")]

def parse(tokens):
    (ast, remaining_tokens) = parse_sexpr(tokens)
    if len(remaining_tokens) > 0:
        raise Exception("Leftover tokens! %s" % remaining_tokens)
    return ast

def _test_parse():
    tdefs = load_tokendefs("tokendefs.txt")
    ast = parse(tokenize(tdefs, "1"))
    assert ast == ("INT", "1")
    ast = parse(tokenize(tdefs, "()"))
    assert ast == ("LIST", [])
    ast = parse(tokenize(tdefs, "(1 2 3)"))
    assert ast == ("LIST", [("INT", "1"), ("INT", "2"), ("INT", "3")])
    ast = parse(tokenize(tdefs, "(1 (2 3))"))
    assert ast == ("LIST", [("INT", "1"), ("LIST", [("INT", "2"), ("INT", "3")])])

def _test():
    _test_parse_int()
    _test_parse_list()
    _test_parse_sexpr()
    _test_parse()


if __name__ == "__main__":
    import sys
    if len(sys.argv) > 1:
        text = open(sys.argv[-1]).read()
    else:
        text = sys.stdin.read()
    tdefs = load_tokendefs("tokendefs.txt")
    import pprint
    pprint.pprint(parse(tokenize(tdefs, text)))

## tokendefs.txt
INT
\d+
OPAREN
\(
CPAREN
\)
WSPACE
\s+
SYMBOL
[a-z]+
	import operator

	def sum2(*args):
	return reduce(operator.add, args)

	def is_int(ast):
	return ast[0] == "INT"

	def is_symbol(ast):
	return ast[0] == "SYMBOL"

	def is_list(ast):
	return ast[0] == "LIST"

	def lookup(ast, env):
	symbol_name = ast[1]
	return env[symbol_name]

	def eval2(ast, env):
	if is_int(ast):
	return int(ast[1])
	elif is_symbol(ast):
	return lookup(ast, env)
	elif is_list(ast):
	children = ast[1]
	first, rest = children[0], children[1:]
	operand = eval2(first, env)
	args = [eval2(arg, env) for arg in rest]
	return apply2(operand, args)
	else:
	raise Exception("Don't know how to evaluate %s" % ast)

	def apply2(op, args):
	return op(*args)


	if __name__ == "__main__":
	import sys
	from lex import load_tokendefs, tokenize
	from parse import parse
	if len(sys.argv) > 1:
	text = open(sys.argv[1]).read()
	else:
	text = sys.stdin.read()
	tdefs = load_tokendefs("tokendefs.txt")
	env = {
	"sum": sum2
	}
	print eval2(parse(tokenize(tdefs, text)), env)
	def load_tokendefs(fpath):
	import re
	tdefs = []
	with open(fpath) as f:
	for line in f:
	token_name = line.rstrip('\n')
	line2 = f.next()
	pattern = line2.rstrip('\n')
	regex = re.compile(pattern)
	pair = (token_name, regex)
	tdefs.append(pair)
	return tdefs

	def next_token(tdefs, text, offset=0):
	for (token_name, regex) in tdefs:
	m = regex.match(text, offset)
	if m is None:
	continue
	else:
	matched_text = m.group()
	token = (token_name, matched_text)
	return token
	raise Exception("Don't know how to lex '%s'" % text[:16])

	def _test_next_token():
	tdefs = load_tokendefs("tokendefs.txt")
	token = next_token(tdefs, "42 ")
	assert token == ("INT", "42")
	token = next_token(tdefs, " 17")
	assert token == ("WSPACE", " ")
	token = next_token(tdefs, "(1 2 3)")
	assert token == ("OPAREN", "(")

	def tokenize(tdefs, text):
	tokens = []
	offset = 0
	while (len(text) - offset) > 0:
	token = next_token(tdefs, text, offset)
	tokens.append(token)
	matched_text = token[1]
	offset += len(matched_text)
	return tokens

	def _test_tokenize():
	tdefs = load_tokendefs("tokendefs.txt")
	tokens = tokenize(tdefs, "(1 42 65536)")
	assert tokens == [
	("OPAREN", "("),
	("INT", "1"),
	("WSPACE", " "),
	("INT", "42"),
	("WSPACE", " "),
	("INT", "65536"),
	("CPAREN", ")")
	]

	def _test():
	_test_next_token()
	_test_tokenize()


	if __name__ == "__main__":
	tdefs = load_tokendefs("tokendefs.txt")
	import sys
	if len(sys.argv) > 1:
	fname = sys.argv[-1]
	text = open(fname).read()
	else:
	text = sys.stdin.read()
	tokens = tokenize(tdefs, text)
	import pprint
	pprint.pprint(tokens)
	from lex import load_tokendefs, tokenize

	# our grammar:
	"""
	SEXPR = ATOM \| LIST
	ATOM = INT \| SYMBOL
	LIST = OPAREN CPAREN
	\| OPAREN SEXPR { WSPACE SEXPR } CPAREN
	"""

	def is_toktype(toktype, token):
	return token[0] == toktype

	def parse_terminal(tokens, toktype):
	if len(tokens) == 0:
	return (None, None)
	token = tokens[0]
	if is_toktype(toktype, token):
	return (token, tokens[1:])
	else:
	return (None, None)

	def parse_int(tokens):
	return parse_terminal(tokens, "INT")

	def parse_wspace(tokens):
	return parse_terminal(tokens, "WSPACE")

	def parse_oparen(tokens):
	return parse_terminal(tokens, "OPAREN")

	def parse_cparen(tokens):
	return parse_terminal(tokens, "CPAREN")

	def parse_symbol(tokens):
	return parse_terminal(tokens, "SYMBOL")

	def _test_parse_int():
	tokens = [("INT", 1), ("WSPACE", " ")]
	(ast, remaining) = parse_int(tokens)
	assert ast == ("INT", 1)
	assert remaining == [("WSPACE", " ")]

	def parse_list(tokens):
	# we have to have at least two tokens.
	if len(tokens) < 2:
	return (None, None)
	# the first token must be an OPAREN.
	(ast, remaining) = parse_oparen(tokens)
	if ast is None:
	return (None, None)
	# try to match the empty list.
	(ast, remaining2) = parse_cparen(remaining)
	if ast is not None:
	ast2 = ("LIST", [])
	return (ast2, remaining2)
	# try to match a populated list.
	(sexpr, remaining) = parse_sexpr(remaining)
	if sexpr is None:
	return (None, None)
	children = [sexpr]
	while len(remaining) > 0:
	# CPAREN signals the end of the list.
	(ast, remaining2) = parse_cparen(remaining)
	if ast is not None:
	remaining = remaining2
	break
	if len(remaining) == 1:
	# only one token left and it wasn't a CPAREN. bad parse.
	return (None, None)
	# otherwise, there should be a space and another sexpr.
	(ast, remaining) = parse_wspace(remaining)
	if ast is None:
	return (None, None)
	(ast, remaining) = parse_sexpr(remaining)
	if ast is None:
	return (None, None)
	children.append(ast)
	ast = ("LIST", children)
	return (ast, remaining)

	def _test_parse_list():
	# parse a list from "() "
	tokens = [("OPAREN", "("), ("CPAREN", ")"), ("WSPACE", " ")]
	(ast, remaining) = parse_list(tokens)
	assert ast == ("LIST", [])
	assert remaining == [("WSPACE", " ")]
	# parse a list from "(42) "
	tokens = [("OPAREN", "("), ("INT", "42"), ("CPAREN", ")"), ("WSPACE", " ")]
	(ast, remaining) = parse_list(tokens)
	assert ast == ("LIST", [("INT", "42")])
	assert remaining == [("WSPACE", " ")]
	# parse "(1 2) "
	tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
	(ast, remaining) = parse_list(tokens)
	assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
	assert remaining == [("WSPACE", " ")]
	# parse "(()) "
	tokens = [("OPAREN", "("), ("OPAREN", "("), ("CPAREN", ")"), ("CPAREN", ")"), ("WSPACE", " ")]
	(ast, remaining) = parse_list(tokens)
	assert ast == ("LIST", [("LIST", [])])
	assert remaining == [("WSPACE", " ")]

	def parse_sexpr(tokens):
	(ast, remaining) = parse_atom(tokens)
	if ast is not None:
	return (ast, remaining)
	else:
	return parse_list(tokens)

	def parse_atom(tokens):
	(ast, remaining) = parse_int(tokens)
	if ast is not None:
	return (ast, remaining)
	else:
	return parse_symbol(tokens)

	def _test_parse_sexpr():
	# parse an int
	tokens = [("INT", 1), ("WSPACE", " ")]
	(ast, remaining) = parse_sexpr(tokens)
	assert ast == ("INT", 1)
	assert remaining == [("WSPACE", " ")]
	# parse a list
	tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
	(ast, remaining) = parse_list(tokens)
	assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
	assert remaining == [("WSPACE", " ")]

	def parse(tokens):
	(ast, remaining_tokens) = parse_sexpr(tokens)
	if len(remaining_tokens) > 0:
	raise Exception("Leftover tokens! %s" % remaining_tokens)
	return ast

	def _test_parse():
	tdefs = load_tokendefs("tokendefs.txt")
	ast = parse(tokenize(tdefs, "1"))
	assert ast == ("INT", "1")
	ast = parse(tokenize(tdefs, "()"))
	assert ast == ("LIST", [])
	ast = parse(tokenize(tdefs, "(1 2 3)"))
	assert ast == ("LIST", [("INT", "1"), ("INT", "2"), ("INT", "3")])
	ast = parse(tokenize(tdefs, "(1 (2 3))"))
	assert ast == ("LIST", [("INT", "1"), ("LIST", [("INT", "2"), ("INT", "3")])])

	def _test():
	_test_parse_int()
	_test_parse_list()
	_test_parse_sexpr()
	_test_parse()


	if __name__ == "__main__":
	import sys
	if len(sys.argv) > 1:
	text = open(sys.argv[-1]).read()
	else:
	text = sys.stdin.read()
	tdefs = load_tokendefs("tokendefs.txt")
	import pprint
	pprint.pprint(parse(tokenize(tdefs, text)))