KayEss/lisp.cpp

## lisp.cpp
/// # Build me a LISP

// Read this nicely formatted at https://kirit.com/Build%20me%20a%20LISP

/**
Normally programming language blog posts get started with grammars and syntax
and parsing and all sorts of stuff like that. I can't be bothered with all of
that, so instead, let's start with the (maybe more interesting aspect) of the
actual evaluation of the code. Here we'll build an evaluation engine for LISP
like languages in less than 50 lines of C++ -- [and literate C++ at
that!](https://gist.github.com/KayEss/45a2e88675832228f57e2d598afc02ae)

Lisp uses something called a "cons cell" to store each part of both the program
and the data. Let's just say we have something called a cell for now, and we'll
come back to defining what it is later on.
 */
struct cell;

/**
## S-Expressions

A Lisp program is actually just something called an s-expression. It's a fancy
term for a list of cells, but there is a special way to interpret these
s-expressions. We'll come back to that later on.

For now, we just need to know that s-expressions describe our LISP program, and
to run a LISP program we simply evaluate the s-expression.
 */
#include <vector>
using sexpr = std::vector<cell>;

/**
If you read the [Wikipedia page on
s-expressions](https://en.wikipedia.org/wiki/S-expression) you'll see that
they're usually described in terms of a tree of cells. We're using a vector
instead, but the two are equivalent.

Walking down the right branch of the cell is the same as walking to the next
cell in the vector, and walking down the left branch of the cell is the same
as nesting an s-expression in a cell.


## Cells

Now we can talk about what a cell is. Our simple LISP will only have a single
data type, strings, so our cells will either contain a string or an
s-expression.
 */
#include <string>
#include <variant>
struct cell {
    std::variant<std::string, sexpr> content;
    /**
    We do need to be able to make cells, and they can be made from either
    strings or s-expressions.
     */
    cell(std::string s) : content{std::move(s)} {}
    cell(sexpr s) : content{std::move(s)} {}
};
/**
## Evaluation

There are three different sorts of data structure that we have so far:

1. Strings
2. S-expressions
3. Cells

Because running our programs is going to be a matter of evaluating these
structures, we'll need to consider what "evaluation" means for each of these.

First of all, strings are just themselves.
 */
std::string eval(std::string s) { return s; }

/**
The s-expressions are a bit more interesting, but for now we just have to
promise to implement them so that we can start with the simpler cells (and
because we have to implement the evaluation of the s-expressions in terms of
evaluating cells)...
 */
std::string eval(sexpr const &);

/**
Evaluating a cell is pretty simple. It's either going to be a string or a
s-expression, so depending on what we find in the cell content we can just
forward to evaluating whichever one we find.
 */
std::string eval(cell const &c) {
    return std::visit([](auto const &c) { return eval(c); }, c.content);
}

/**
## The standard library

Although an s-expression is just a vector of cells, the first item in it is
special. To evaluate an s-expression we take the first item in it to be the name
of a function that is to be executed, and the remaining s-expression cells are
then taken as that function's arguments (some of which may be s-expressions that
require evaluation themselves).

So if you're used to writing your programs like this:

```python
foo(bar, baz)
```

The s-expression form would look more like:

```lisp
(foo bar baz)
```

We're going to implement our library in C++, and we're going to look functions
up by name. Because the s-expression can have any number of arguments, it makes
sense to evaluate them based on a couple of iterators (`marker`s), one to the
first argument and one to the end of the s-expression.
 */
using marker = sexpr::const_iterator;

/**
Our built in library functions take the two iterators representing their
arguments and return a string which is the result of running them.
 */
#include <functional>
using builtin = std::function<auto(marker, marker)->std::string>;

/**
We now need a simple data structure that will hold these for us and allow
look-ups by name.
 */
#include <map>
std::map<std::string, builtin> library;

/**
## Evaluating S-Expressions

The first item in an s-expression is a function name that we're going to call.
This means that an empty s-expression is an error.
 */
#include <stdexcept>
std::string eval(sexpr const &s) {
    if (s.begin() == s.end()) {
        throw std::domain_error("Empty sexpr");
        /**
        An alternative design would be to say that an empty s-expression
        evaluates to an empty string, but this is good for us.


        Now, assuming our s-expression contains a first item, we need to fetch
        the function name. Remember that our s-expression is comprised of cells
        (which can contain strings *or* s-expressions) so we need to turn our
        cell into a string. Luckily we already have a function that does that:
        `eval`.
         */
    } else {
        auto const fname = eval(*s.begin());
        /**
        Using `eval` here (rather than just fetching the string out of the cell
        directly) has some important consequences to what this language can do.

        Although we've said so far that the first item in an s-expression has to
        be the function name, by evaluating the cell to fetch that string we can
        actually have an s-expression as the first cell! And evaluating that
        s-expression gives the function name that we will run. Don't worry about
        this for now, there will be an example at the end.

        Now that we have a function name we only need to look it up and call it,
        and of course take into account the possibility that it might not be
        found.
         */
        if (auto const fp = library.find(fname); fp != library.end()) {
            return fp->second(++s.begin(), s.end());
        } else {
            throw std::invalid_argument{fname + " not in the library"};
        }
    }
}
/**
And that's actually it for the whole evaluation engine. Luckily we have some
lines left over to try it out with.

## Running a program
 */
#include <iostream>
int main() {
    try {
        /**
        As with all programming languages we have to make sure to say hello to
        the whole world.

        ```lisp
        (cat "Hello" "world!")
        ```

        Because we didn't bother with a parser we have to write the data
        structure for the program directly into the C++ source code. Thankfully
        it's not too hard.
         */
        using namespace std::string_literals;
        auto const prog = sexpr{"cat"s, "Hello "s, "world!"s};

        /**
        Of course we didn't put anything into our library yet, so let's make
        sure that we do at least have an implementation of `cat` we can call.
         */
        library["cat"] = [](marker pos, marker end) {
            std::string ret;
            while (pos != end) ret += eval(*pos++);
            return ret;
        };
        /**
        Now we can run our program and show the result.
         */
        std::cout << eval(prog) << std::endl;
        /**
        [If you grab this C++ file you can compile and run it using something
        like (the file is available as a Gist on
        Github)](https://gist.github.com/KayEss/45a2e88675832228f57e2d598afc02ae):

        ```bash
        clang++ lisp.cpp -std=c++17 -o lisp
        ./lisp
        ```

        And you'll see some output:

            Hello world!


        ## Errors...

        The very last thing we should do is to print out some sort of error
        message if a run-time error came up.
         */
    } catch (std::exception &e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
}

/**
## More fun

Remember there was the bit about the evaluation of the lead term in the
s-expression? Because the first term can itself be an s-expression you can write
a program to generate the function name you want to call.

If you do try this on your machine, try to change the program to the following
and see what happens.

```cpp
auto const prog = sexpr{
    sexpr{"cat"s, "c"s, "a"s, "t"s},
        "Hello "s, "world!"s};
```

When you run this it will still print "Hello world!" because the s-expression as
the first term gets evaluated and produces the name `cat` which then gets called
to produce the output.


### Order fun

Remember the type we had for `builtin` took `marker`s to the s-expressions of
the arguments? This means that the argument values are evaluated inside of the
library function definition. Look again at the loop in our `cat` function:

```cpp
while (pos != end) ret += eval(*pos++);
```

It would also be a reasonable choice to move that evaluation outside of the
library function and instead have our `builtin` type look like this (a function
that takes a vector of strings and produces a string):

```cpp
using builtin = std::function<auto(std::vector<std::string>)->std::string>;
```

This difference in where the expressions are evaluated is called [*Evaluation
Strategy*](https://en.wikipedia.org/wiki/Evaluation_strategy) and the choice we
made above is called *normal-order*. This is of course the opposite of what
you're used to from pretty much every other programming language, which normally
use *applicative-order*.

More commonly, normal-order is referred to as [*lazy
evaluation*](https://en.wikipedia.org/wiki/Lazy_evaluation) and
applicative-order as [*eager
evaluation*](https://en.wikipedia.org/wiki/Eager_evaluation).

### `if` isn't (an eager) function

In LISP terms our `builtin`s are all [*special
forms*](http://www.lispworks.com/documentation/HyperSpec/Body/03_ababa.htm).
This is important in order to be able to correctly implement *forms* like `if`.

Let's say we want to implement `if` so that it can be used something like this:

```lisp
(if (condition) (true-case) (false-case)) -> string
```

We will always need to evaluate the `condition` expression, and, depending on
what happens, we are for sure going to return either the `true-case` or the
`false-case`. But more importantly we only want to **evaluate** one of them.

If we have to implement `if` in a language that only has *eager evaluation* then
we will evaluate both the `true-case` and the `false-case` even if we only
return the result of one of them.

But with *lazy evaluation* we can save ourselves some processing, and this
control of the evaluation that we give to the library functions is critical if
you want to take this toy language and turn it into something real.

C++ itself is an eager language and this is why we cannot short-circuit in
user-defined implementations of `&&` and `||`, something that can cause us some
pain. [But there is
hope](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0927r2.pdf).
 */
	/// # Build me a LISP

	// Read this nicely formatted at https://kirit.com/Build%20me%20a%20LISP

	/**
	Normally programming language blog posts get started with grammars and syntax
	and parsing and all sorts of stuff like that. I can't be bothered with all of
	that, so instead, let's start with the (maybe more interesting aspect) of the
	actual evaluation of the code. Here we'll build an evaluation engine for LISP
	like languages in less than 50 lines of C++ -- [and literate C++ at
	that!](https://gist.github.com/KayEss/45a2e88675832228f57e2d598afc02ae)

	Lisp uses something called a "cons cell" to store each part of both the program
	and the data. Let's just say we have something called a cell for now, and we'll
	come back to defining what it is later on.
	*/
	struct cell;

	/**
	## S-Expressions

	A Lisp program is actually just something called an s-expression. It's a fancy
	term for a list of cells, but there is a special way to interpret these
	s-expressions. We'll come back to that later on.

	For now, we just need to know that s-expressions describe our LISP program, and
	to run a LISP program we simply evaluate the s-expression.
	*/
	#include <vector>
	using sexpr = std::vector<cell>;

	/**
	If you read the [Wikipedia page on
	s-expressions](https://en.wikipedia.org/wiki/S-expression) you'll see that
	they're usually described in terms of a tree of cells. We're using a vector
	instead, but the two are equivalent.

	Walking down the right branch of the cell is the same as walking to the next
	cell in the vector, and walking down the left branch of the cell is the same
	as nesting an s-expression in a cell.


	## Cells

	Now we can talk about what a cell is. Our simple LISP will only have a single
	data type, strings, so our cells will either contain a string or an
	s-expression.
	*/
	#include <string>
	#include <variant>
	struct cell {
	std::variant<std::string, sexpr> content;
	/**
	We do need to be able to make cells, and they can be made from either
	strings or s-expressions.
	*/
	cell(std::string s) : content{std::move(s)} {}
	cell(sexpr s) : content{std::move(s)} {}
	};
	/**
	## Evaluation

	There are three different sorts of data structure that we have so far:

	1. Strings
	2. S-expressions
	3. Cells

	Because running our programs is going to be a matter of evaluating these
	structures, we'll need to consider what "evaluation" means for each of these.

	First of all, strings are just themselves.
	*/
	std::string eval(std::string s) { return s; }

	/**
	The s-expressions are a bit more interesting, but for now we just have to
	promise to implement them so that we can start with the simpler cells (and
	because we have to implement the evaluation of the s-expressions in terms of
	evaluating cells)...
	*/
	std::string eval(sexpr const &);

	/**
	Evaluating a cell is pretty simple. It's either going to be a string or a
	s-expression, so depending on what we find in the cell content we can just
	forward to evaluating whichever one we find.
	*/
	std::string eval(cell const &c) {
	return std::visit([](auto const &c) { return eval(c); }, c.content);
	}

	/**
	## The standard library

	Although an s-expression is just a vector of cells, the first item in it is
	special. To evaluate an s-expression we take the first item in it to be the name
	of a function that is to be executed, and the remaining s-expression cells are
	then taken as that function's arguments (some of which may be s-expressions that
	require evaluation themselves).

	So if you're used to writing your programs like this:

	```python
	foo(bar, baz)
	```

	The s-expression form would look more like:

	```lisp
	(foo bar baz)
	```

	We're going to implement our library in C++, and we're going to look functions
	up by name. Because the s-expression can have any number of arguments, it makes
	sense to evaluate them based on a couple of iterators (`marker`s), one to the
	first argument and one to the end of the s-expression.
	*/
	using marker = sexpr::const_iterator;

	/**
	Our built in library functions take the two iterators representing their
	arguments and return a string which is the result of running them.
	*/
	#include <functional>
	using builtin = std::function<auto(marker, marker)->std::string>;

	/**
	We now need a simple data structure that will hold these for us and allow
	look-ups by name.
	*/
	#include <map>
	std::map<std::string, builtin> library;

	/**
	## Evaluating S-Expressions

	The first item in an s-expression is a function name that we're going to call.
	This means that an empty s-expression is an error.
	*/
	#include <stdexcept>
	std::string eval(sexpr const &s) {
	if (s.begin() == s.end()) {
	throw std::domain_error("Empty sexpr");
	/**
	An alternative design would be to say that an empty s-expression
	evaluates to an empty string, but this is good for us.


	Now, assuming our s-expression contains a first item, we need to fetch
	the function name. Remember that our s-expression is comprised of cells
	(which can contain strings or s-expressions) so we need to turn our
	cell into a string. Luckily we already have a function that does that:
	`eval`.
	*/
	} else {
	auto const fname = eval(*s.begin());
	/**
	Using `eval` here (rather than just fetching the string out of the cell
	directly) has some important consequences to what this language can do.

	Although we've said so far that the first item in an s-expression has to
	be the function name, by evaluating the cell to fetch that string we can
	actually have an s-expression as the first cell! And evaluating that
	s-expression gives the function name that we will run. Don't worry about
	this for now, there will be an example at the end.

	Now that we have a function name we only need to look it up and call it,
	and of course take into account the possibility that it might not be
	found.
	*/
	if (auto const fp = library.find(fname); fp != library.end()) {
	return fp->second(++s.begin(), s.end());
	} else {
	throw std::invalid_argument{fname + " not in the library"};
	}
	}
	}
	/**
	And that's actually it for the whole evaluation engine. Luckily we have some
	lines left over to try it out with.

	## Running a program
	*/
	#include <iostream>
	int main() {
	try {
	/**
	As with all programming languages we have to make sure to say hello to
	the whole world.

	```lisp
	(cat "Hello" "world!")
	```

	Because we didn't bother with a parser we have to write the data
	structure for the program directly into the C++ source code. Thankfully
	it's not too hard.
	*/
	using namespace std::string_literals;
	auto const prog = sexpr{"cat"s, "Hello "s, "world!"s};

	/**
	Of course we didn't put anything into our library yet, so let's make
	sure that we do at least have an implementation of `cat` we can call.
	*/
	library["cat"] = [](marker pos, marker end) {
	std::string ret;
	while (pos != end) ret += eval(*pos++);
	return ret;
	};
	/**
	Now we can run our program and show the result.
	*/
	std::cout << eval(prog) << std::endl;
	/**
	[If you grab this C++ file you can compile and run it using something
	like (the file is available as a Gist on
	Github)](https://gist.github.com/KayEss/45a2e88675832228f57e2d598afc02ae):

	```bash
	clang++ lisp.cpp -std=c++17 -o lisp
	./lisp
	```

	And you'll see some output:

	Hello world!


	## Errors...

	The very last thing we should do is to print out some sort of error
	message if a run-time error came up.
	*/
	} catch (std::exception &e) {
	std::cerr << "Error: " << e.what() << std::endl;
	return 1;
	}
	}

	/**
	## More fun

	Remember there was the bit about the evaluation of the lead term in the
	s-expression? Because the first term can itself be an s-expression you can write
	a program to generate the function name you want to call.

	If you do try this on your machine, try to change the program to the following
	and see what happens.

	```cpp
	auto const prog = sexpr{
	sexpr{"cat"s, "c"s, "a"s, "t"s},
	"Hello "s, "world!"s};
	```

	When you run this it will still print "Hello world!" because the s-expression as
	the first term gets evaluated and produces the name `cat` which then gets called
	to produce the output.


	### Order fun

	Remember the type we had for `builtin` took `marker`s to the s-expressions of
	the arguments? This means that the argument values are evaluated inside of the
	library function definition. Look again at the loop in our `cat` function:

	```cpp
	while (pos != end) ret += eval(*pos++);
	```

	It would also be a reasonable choice to move that evaluation outside of the
	library function and instead have our `builtin` type look like this (a function
	that takes a vector of strings and produces a string):

	```cpp
	using builtin = std::function<auto(std::vector<std::string>)->std::string>;
	```

	This difference in where the expressions are evaluated is called [*Evaluation
	Strategy*](https://en.wikipedia.org/wiki/Evaluation_strategy) and the choice we
	made above is called normal-order. This is of course the opposite of what
	you're used to from pretty much every other programming language, which normally
	use applicative-order.

	More commonly, normal-order is referred to as [*lazy
	evaluation*](https://en.wikipedia.org/wiki/Lazy_evaluation) and
	applicative-order as [*eager
	evaluation*](https://en.wikipedia.org/wiki/Eager_evaluation).

	### `if` isn't (an eager) function

	In LISP terms our `builtin`s are all [*special
	forms*](http://www.lispworks.com/documentation/HyperSpec/Body/03_ababa.htm).
	This is important in order to be able to correctly implement forms like `if`.

	Let's say we want to implement `if` so that it can be used something like this:

	```lisp
	(if (condition) (true-case) (false-case)) -> string
	```

	We will always need to evaluate the `condition` expression, and, depending on
	what happens, we are for sure going to return either the `true-case` or the
	`false-case`. But more importantly we only want to evaluate one of them.

	If we have to implement `if` in a language that only has eager evaluation then
	we will evaluate both the `true-case` and the `false-case` even if we only
	return the result of one of them.

	But with lazy evaluation we can save ourselves some processing, and this
	control of the evaluation that we give to the library functions is critical if
	you want to take this toy language and turn it into something real.

	C++ itself is an eager language and this is why we cannot short-circuit in
	user-defined implementations of `&&` and `\|\|`, something that can cause us some
	pain. [But there is
	hope](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0927r2.pdf).
	*/