Skip to content

Instantly share code, notes, and snippets.

@cellularmitosis
Last active April 15, 2020 03:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cellularmitosis/a46efbf32544cde3a1d156e4903e051f to your computer and use it in GitHub Desktop.
Save cellularmitosis/a46efbf32544cde3a1d156e4903e051f to your computer and use it in GitHub Desktop.
A Lisp-like interpreter (in Python), step 3: evaluation

Blog 2019/9/7

<- previous | index | next ->

A Lisp-like interpreter (in Python), step 3: evaluation

Let's learn how to implement a Lisp or Scheme-like language by writing a series of increasingly capable interpreters.

Step 3: Evaluation

sicp

The core of any Lisp-like interpreter is the eval function. As a concept, eval is over 60 years old, first seen in McCarthy's original Lisp implementation. Paul Graham wrote a great article[ps][pdf][markdown] about the significance of eval.

eval takes an AST and an environment and returns a value.

However, Python already has an eval function, so we will use the name eval2. Start with a stub:

eval.py

def eval2(ast, env):
  return None

Lisp-like languages follow the convention that a list represents a function call, where the first item in the list is a function (the operator) and the rest of the list items are the arguments to that function. E.g. (sum 1 2 3) evaluates to 6.

More particularly, sum is a symbol. Symbols evaluate to their corresponding value from the environment. The environment is like a lookup table which maps symbols to values. In this case, the value of sum is a function pointer.

1, 2 and 3 are literals, which evaluate to themselves.

In a traditional Lisp-like language implementation, eval determines the value of the operand and arguments, then hands them to a function named apply, which executes the function.

(again, Python already has an apply, so we will use the name apply2.)

Literals evaluate to themselves

But we are getting ahead of ourselves. First, let's support evaluating literals:

def is_int(ast):
    return ast[0] == "INT"

def eval2(ast, env):
    if is_int(ast):
        return int(ast[1])
    else:
        raise Exception("Don't know how to evaluate %s" % ast)

Try it out:

$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from eval import *
>>> env = {}
>>> ast = ("INT", "42")
>>> eval2(ast, env)
42

Cool, let's skip to the punchline and add a __main__ implementation so that we can run this bad boy:

if __name__ == "__main__":
    import sys
    from lex import load_tokendefs, tokenize
    from parse import parse
    if len(sys.argv) > 1:
        text = open(sys.argv[1]).read()
    else:
        text = sys.stdin.read()
    tdefs = load_tokendefs("tokendefs.txt")
    env = {}
    print eval2(parse(tokenize(tdefs, text)), env)
$ echo -n 1 | python eval.py
1

Woot!

Symbols evaluate to their corresponding value from the environment

Next, let's create a function sum2 (Python already defines a sum function):

import operator

def sum2(*args):
    return reduce(operator.add, args)

Try it out:

$ python
Python 2.7.16 (default, Mar  4 2019, 09:02:22) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from eval import *
>>> sum2(1, 2, 3)
6

Next, crack open tokendefs.txt and add a SYMBOL token type. We'll start off with something very basic: a symbol is a group of lowercase letters:

SYMBOL
[a-z]+

Let's check that it works with our lexer:

$ echo 'sum' | python lex.py
[('SYMBOL', 'sum'), ('WSPACE', '\n')]

Next, we need to update our parser to handle symbols. We'll make a slight tweak to our grammar:

SEXPR = ATOM | LIST
ATOM  = INT | SYMBOL
LIST  = OPAREN CPAREN
      | OPAREN SEXPR { WSPACE SEXPR } CPAREN

We'll need to add another terminal parser for SYMBOL:

def parse_symbol(tokens):
    return parse_terminal(tokens, "SYMBOL")

We need to create a parse_atom function for the rule ATOM = INT | SYMBOL. This uses alternation, so its implementation will look very similar to parse_sexpr:

def parse_atom(tokens):
    (ast, remaining) = parse_int(tokens)
    if ast is not None:
        return (ast, remaining)
    else:
        return parse_symbol(tokens)

And speaking of parse_sexpr, we need to update that as well:

def parse_sexpr(tokens):
    (ast, remaining) = parse_atom(tokens)
    if ast is not None:
        return (ast, remaining)
    else:
        return parse_list(tokens)

Now let's try it out:

$ echo -n '(sum 1 2 3)' | python parse.py
('LIST', [('SYMBOL', 'sum'), ('INT', '1'), ('INT', '2'), ('INT', '3')])

Fantastico!

Now, we'll need to update eval2 to handle symbol lookup:

def is_symbol(ast):
    return ast[0] == "SYMBOL"

def lookup(ast, env):
    symbol_name = ast[1]
    return env[symbol_name]

def eval2(ast, env):
    if is_int(ast):
        return int(ast[1])
    elif is_symbol(ast):
        return lookup(ast, env)
    else:
        raise Exception("Don't know how to evaluate %s" % ast)

In the eval.py __main__ implementation, replace env = {} with:

    env = { "sum": sum2 }

Now let's see if we can eval a symbol:

$ echo -n 'sum' | python eval.py
<function sum2 at 0x102b40398>

Rock and roll! We are really close now.

Now we just need to add an elif is_list() clause to eval2:

def is_list(ast):
    return ast[0] == "LIST"

def eval2(ast, env):
    if is_int(ast):
        return int(ast[1])
    elif is_symbol(ast):
        return lookup(ast, env)
    elif is_list(ast):
        children = ast[1]
        first, rest = children[0], children[1:]
        operand = eval2(first, env)
        args = [eval2(arg, env) for arg in rest]
        return apply2(operand, args)
    else:
        raise Exception("Don't know how to evaluate %s" % ast)

def apply2(op, args):
    return op(*args)

Now let's eval (sum 1 2 3):

$ echo -n '(sum 1 2 3)' | python eval.py
6

KABLAMMO MOTHERFUCKER!!! You just wrote an interpreter!

Next time, we'll look at defining new entries in the environment.

import operator
def sum2(*args):
return reduce(operator.add, args)
def is_int(ast):
return ast[0] == "INT"
def is_symbol(ast):
return ast[0] == "SYMBOL"
def is_list(ast):
return ast[0] == "LIST"
def lookup(ast, env):
symbol_name = ast[1]
return env[symbol_name]
def eval2(ast, env):
if is_int(ast):
return int(ast[1])
elif is_symbol(ast):
return lookup(ast, env)
elif is_list(ast):
children = ast[1]
first, rest = children[0], children[1:]
operand = eval2(first, env)
args = [eval2(arg, env) for arg in rest]
return apply2(operand, args)
else:
raise Exception("Don't know how to evaluate %s" % ast)
def apply2(op, args):
return op(*args)
if __name__ == "__main__":
import sys
from lex import load_tokendefs, tokenize
from parse import parse
if len(sys.argv) > 1:
text = open(sys.argv[1]).read()
else:
text = sys.stdin.read()
tdefs = load_tokendefs("tokendefs.txt")
env = {
"sum": sum2
}
print eval2(parse(tokenize(tdefs, text)), env)
def load_tokendefs(fpath):
import re
tdefs = []
with open(fpath) as f:
for line in f:
token_name = line.rstrip('\n')
line2 = f.next()
pattern = line2.rstrip('\n')
regex = re.compile(pattern)
pair = (token_name, regex)
tdefs.append(pair)
return tdefs
def next_token(tdefs, text, offset=0):
for (token_name, regex) in tdefs:
m = regex.match(text, offset)
if m is None:
continue
else:
matched_text = m.group()
token = (token_name, matched_text)
return token
raise Exception("Don't know how to lex '%s'" % text[:16])
def _test_next_token():
tdefs = load_tokendefs("tokendefs.txt")
token = next_token(tdefs, "42 ")
assert token == ("INT", "42")
token = next_token(tdefs, " 17")
assert token == ("WSPACE", " ")
token = next_token(tdefs, "(1 2 3)")
assert token == ("OPAREN", "(")
def tokenize(tdefs, text):
tokens = []
offset = 0
while (len(text) - offset) > 0:
token = next_token(tdefs, text, offset)
tokens.append(token)
matched_text = token[1]
offset += len(matched_text)
return tokens
def _test_tokenize():
tdefs = load_tokendefs("tokendefs.txt")
tokens = tokenize(tdefs, "(1 42 65536)")
assert tokens == [
("OPAREN", "("),
("INT", "1"),
("WSPACE", " "),
("INT", "42"),
("WSPACE", " "),
("INT", "65536"),
("CPAREN", ")")
]
def _test():
_test_next_token()
_test_tokenize()
if __name__ == "__main__":
tdefs = load_tokendefs("tokendefs.txt")
import sys
if len(sys.argv) > 1:
fname = sys.argv[-1]
text = open(fname).read()
else:
text = sys.stdin.read()
tokens = tokenize(tdefs, text)
import pprint
pprint.pprint(tokens)
from lex import load_tokendefs, tokenize
# our grammar:
"""
SEXPR = ATOM | LIST
ATOM = INT | SYMBOL
LIST = OPAREN CPAREN
| OPAREN SEXPR { WSPACE SEXPR } CPAREN
"""
def is_toktype(toktype, token):
return token[0] == toktype
def parse_terminal(tokens, toktype):
if len(tokens) == 0:
return (None, None)
token = tokens[0]
if is_toktype(toktype, token):
return (token, tokens[1:])
else:
return (None, None)
def parse_int(tokens):
return parse_terminal(tokens, "INT")
def parse_wspace(tokens):
return parse_terminal(tokens, "WSPACE")
def parse_oparen(tokens):
return parse_terminal(tokens, "OPAREN")
def parse_cparen(tokens):
return parse_terminal(tokens, "CPAREN")
def parse_symbol(tokens):
return parse_terminal(tokens, "SYMBOL")
def _test_parse_int():
tokens = [("INT", 1), ("WSPACE", " ")]
(ast, remaining) = parse_int(tokens)
assert ast == ("INT", 1)
assert remaining == [("WSPACE", " ")]
def parse_list(tokens):
# we have to have at least two tokens.
if len(tokens) < 2:
return (None, None)
# the first token must be an OPAREN.
(ast, remaining) = parse_oparen(tokens)
if ast is None:
return (None, None)
# try to match the empty list.
(ast, remaining2) = parse_cparen(remaining)
if ast is not None:
ast2 = ("LIST", [])
return (ast2, remaining2)
# try to match a populated list.
(sexpr, remaining) = parse_sexpr(remaining)
if sexpr is None:
return (None, None)
children = [sexpr]
while len(remaining) > 0:
# CPAREN signals the end of the list.
(ast, remaining2) = parse_cparen(remaining)
if ast is not None:
remaining = remaining2
break
if len(remaining) == 1:
# only one token left and it wasn't a CPAREN. bad parse.
return (None, None)
# otherwise, there should be a space and another sexpr.
(ast, remaining) = parse_wspace(remaining)
if ast is None:
return (None, None)
(ast, remaining) = parse_sexpr(remaining)
if ast is None:
return (None, None)
children.append(ast)
ast = ("LIST", children)
return (ast, remaining)
def _test_parse_list():
# parse a list from "() "
tokens = [("OPAREN", "("), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [])
assert remaining == [("WSPACE", " ")]
# parse a list from "(42) "
tokens = [("OPAREN", "("), ("INT", "42"), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [("INT", "42")])
assert remaining == [("WSPACE", " ")]
# parse "(1 2) "
tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
assert remaining == [("WSPACE", " ")]
# parse "(()) "
tokens = [("OPAREN", "("), ("OPAREN", "("), ("CPAREN", ")"), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [("LIST", [])])
assert remaining == [("WSPACE", " ")]
def parse_sexpr(tokens):
(ast, remaining) = parse_atom(tokens)
if ast is not None:
return (ast, remaining)
else:
return parse_list(tokens)
def parse_atom(tokens):
(ast, remaining) = parse_int(tokens)
if ast is not None:
return (ast, remaining)
else:
return parse_symbol(tokens)
def _test_parse_sexpr():
# parse an int
tokens = [("INT", 1), ("WSPACE", " ")]
(ast, remaining) = parse_sexpr(tokens)
assert ast == ("INT", 1)
assert remaining == [("WSPACE", " ")]
# parse a list
tokens = [("OPAREN", "("), ("INT", "1"), ("WSPACE", " "), ("INT", "2"), ("CPAREN", ")"), ("WSPACE", " ")]
(ast, remaining) = parse_list(tokens)
assert ast == ("LIST", [("INT", "1"), ("INT", "2")])
assert remaining == [("WSPACE", " ")]
def parse(tokens):
(ast, remaining_tokens) = parse_sexpr(tokens)
if len(remaining_tokens) > 0:
raise Exception("Leftover tokens! %s" % remaining_tokens)
return ast
def _test_parse():
tdefs = load_tokendefs("tokendefs.txt")
ast = parse(tokenize(tdefs, "1"))
assert ast == ("INT", "1")
ast = parse(tokenize(tdefs, "()"))
assert ast == ("LIST", [])
ast = parse(tokenize(tdefs, "(1 2 3)"))
assert ast == ("LIST", [("INT", "1"), ("INT", "2"), ("INT", "3")])
ast = parse(tokenize(tdefs, "(1 (2 3))"))
assert ast == ("LIST", [("INT", "1"), ("LIST", [("INT", "2"), ("INT", "3")])])
def _test():
_test_parse_int()
_test_parse_list()
_test_parse_sexpr()
_test_parse()
if __name__ == "__main__":
import sys
if len(sys.argv) > 1:
text = open(sys.argv[-1]).read()
else:
text = sys.stdin.read()
tdefs = load_tokendefs("tokendefs.txt")
import pprint
pprint.pprint(parse(tokenize(tdefs, text)))
INT
\d+
OPAREN
\(
CPAREN
\)
WSPACE
\s+
SYMBOL
[a-z]+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment