jlouis/gist:85007bfc4f7b8c75ad8e Secret

## gistfile1.txt
# On Erlang, State and Crashes

There are two things which are ubiquitous in Erlang:

 * A Process has an internal state.
 * When the process crashes, that internal state is gone.

These two facts pose some problems for new Erlang programmers. If my
state is gone, then what should I then do? The short answer to the
question is that some other process must have the state and provide
the backup, but this is hardly a fulfilling answer: It is turtles all
the way down. Now, that other process might die, and then *another*
beast of a process must have the state. And this observation continues
ad infinitum. So what is the Erlang programmer to do? This is my
attempt at answering the question.

## State Classification

The internal state of an Erlang process can naturally be
classified. First, state has different value. State related to the
current computation residing on the stack may not be important at all
after a process crash. It crashed for a reason and chance are that the
exact same state will bring down the process again with the same
error. The same observation might apply to some internal state: It is
like a scratchpad or a blackboard: when the next lecture starts, it
can be erased because it has served its purpose.

Next is static state. If a process is governing a TCP/IP connection
that process should probably connect to the same TCP/IP Address/Port
pair if it crashes and is restarted. We call that kind of data
configuration or *static* data. It is there, but it is not meant to
change over the course of the application, or only change rarely.

Finally our crude classification of state has *dynamic* data. This class is the
data we generate over the course of the running program, get from user
input, create because other programs communicate with us and so
on. The class can be split into two major components: State we can
compute from other data and state we cannot compute. The computable
state is somewhat less of a problem. We can basically just recompute
it after a crash, so the real problem is the other kind of
user/program-supplied information.

In other words, we have three major kinds of state: *scratchpad,
static and dynamic*.

## The Error Kernel

Erlang programs have a concept called the *error kernel*. The kernel
is the part of the program which *must* be correct for its correct
operation. Good Erlang design begins with identifying the error kernel
of the system: What part *must* not fail or it will bring down the
whole system? Once you have the kernel identified, you seek to make it
minimal. Whenever the kernel is about to do an operation which is
dangerous and might crash, you "outsource" that computation to another
process, a dumb slave worker. If he crashes and is killed, nothing
really bad has happened - since the kernel keeps going.

Identifying the kernel plugs the "turtles all the way down" hole. As
soon as the kernel is hit, we assume correctness. But since the kernel is
small, the *trusted computing base* of our program is likewise. We
only need to trust a small part of the program, and that part is also
fairly simple.

A visualization is this: A program is a patchwork of small
squares. Some of the squares are red, and these are the "error
kernel". Most (naively implemented) imperative programs are mostly
red, save for a few squares. These are the squares where exceptions
are handled explicitly and the error is correctly mitigated. The
kernel is thus fairly large. In contrast, robustness-aware Erlang
programs have few red squares - most of the patchwork is white. It is
a design-goal to get as few red squares as possible. It is achieved by
delegating dangerous work to the white areas so a crash does not
affect the kernel.

## Handling the state classes

Each class must be handled differently. First there is the
scratchpad/blackboard class. If a process crashes, the class is
interesting because it contains the stack trace and usually the data
which tells a story - namely *how* and *why* the process crashed. We
usually export this data via SASLs error logger, so we can look at a
crash report and understand what went wrong. After all, the internal
state is gone after the crash report is done and logged.

Next, there is the static class. The simplest thing is to have another
process feed in the static data. This can be done by, among others,
the supervisor, by asking an ETS table, by asking GProc (if you use
gproc in your system), by asking another process or by discovery
through the call `application:get_env/2`. It is important to note just
how static the data is. You have few options at easily changing some
of the approaches where others give you a lot of freedom, but require
more of the process in turn.

Finally, the fully dynamic data is the nasty culprit. If you can
recompute the data, you are lucky. As an example from my etorrent
application, each peer has a dynamic table of what parts of a torrent
file the given peer has. So the controlling process has an internal
table of this information. But if we crash and reconnect to the peer,
the virtue of the bittorrent protocol will send us this information
again. So that information is hardly worth keeping around. Other
times, you can simple recalculate the information when your process
restarts, and that is almost never a problem either.

So what about the user supplied data? This is where the error kernel
comes in. You need to *protect* data which you can not
reconstruct. You protect it by shoving it into the error kernel and
keep some simple state maintenance processes there to handle the
state. A word of warning though: If your state is corrupted, it means
that processes basing their work on the state will do something
wrong. To mitigate this, it is important to make some general sanity
checking of your data. Make it a priority to check your data for
invariants if you find them. And don't blindly trust non-error-kernel
parts of the system.

If a process crashes, you should definitely think how much of its
internal state you want to recycle. If you recycle everything you risk
hitting the exact same bug again and crash. Rather, there may be a
benefit to only recycling parts of the internal state.

## The next step: Onion-layered Error kernels

The next logical step up, is to recognize that the error kernel is not
discrete. You want to regard the error kernel as an onion. Whenever
you peel off a layer, you get a step closer to the trusted computing
base of the application. Then your system design is to push down state
maintenance the the outermost layer in the onion where it still makes
sense. This in effect protects one part of the application from
others. In Etorrent, we can download multiple torrent files at the
same time. If one such torrent download fails, there is no reason it
should affect the other torrent files. We can add a layer to the
onion: Some state which is local to the torrent is kept in a separate
supervisor tree - to mitigate the error if that part fails.

The net effect is program robustness: A bug in the program will
suddenly need perseverance. It has to penetrate several layers in the
onion before it can take the full program down. And if the Erlang
system is well designed, even the most grave bugs can only penetrate
so far before the stopping power of the onion layers brings it to a
halt.

Furthermore, it underpins a mantra of Erlang programs: Small bugs have
small impact. They won't even penetrate the first layer. And they will
hardly be a scratch in the fabric of computing.

(Aside: Good computer security engineering use the same onion-layered
model. There are strong similarities between protecting a computer
system against a well-armed intruder and protecting a program against
an aggressive, persistent, dangerous and maiming bug. End of Aside)
	# On Erlang, State and Crashes

	There are two things which are ubiquitous in Erlang:

	* A Process has an internal state.
	* When the process crashes, that internal state is gone.

	These two facts pose some problems for new Erlang programmers. If my
	state is gone, then what should I then do? The short answer to the
	question is that some other process must have the state and provide
	the backup, but this is hardly a fulfilling answer: It is turtles all
	the way down. Now, that other process might die, and then another
	beast of a process must have the state. And this observation continues
	ad infinitum. So what is the Erlang programmer to do? This is my
	attempt at answering the question.

	## State Classification

	The internal state of an Erlang process can naturally be
	classified. First, state has different value. State related to the
	current computation residing on the stack may not be important at all
	after a process crash. It crashed for a reason and chance are that the
	exact same state will bring down the process again with the same
	error. The same observation might apply to some internal state: It is
	like a scratchpad or a blackboard: when the next lecture starts, it
	can be erased because it has served its purpose.

	Next is static state. If a process is governing a TCP/IP connection
	that process should probably connect to the same TCP/IP Address/Port
	pair if it crashes and is restarted. We call that kind of data
	configuration or static data. It is there, but it is not meant to
	change over the course of the application, or only change rarely.

	Finally our crude classification of state has dynamic data. This class is the
	data we generate over the course of the running program, get from user
	input, create because other programs communicate with us and so
	on. The class can be split into two major components: State we can
	compute from other data and state we cannot compute. The computable
	state is somewhat less of a problem. We can basically just recompute
	it after a crash, so the real problem is the other kind of
	user/program-supplied information.

	In other words, we have three major kinds of state: *scratchpad,
	static and dynamic*.

	## The Error Kernel

	Erlang programs have a concept called the error kernel. The kernel
	is the part of the program which must be correct for its correct
	operation. Good Erlang design begins with identifying the error kernel
	of the system: What part must not fail or it will bring down the
	whole system? Once you have the kernel identified, you seek to make it
	minimal. Whenever the kernel is about to do an operation which is
	dangerous and might crash, you "outsource" that computation to another
	process, a dumb slave worker. If he crashes and is killed, nothing
	really bad has happened - since the kernel keeps going.

	Identifying the kernel plugs the "turtles all the way down" hole. As
	soon as the kernel is hit, we assume correctness. But since the kernel is
	small, the trusted computing base of our program is likewise. We
	only need to trust a small part of the program, and that part is also
	fairly simple.

	A visualization is this: A program is a patchwork of small
	squares. Some of the squares are red, and these are the "error
	kernel". Most (naively implemented) imperative programs are mostly
	red, save for a few squares. These are the squares where exceptions
	are handled explicitly and the error is correctly mitigated. The
	kernel is thus fairly large. In contrast, robustness-aware Erlang
	programs have few red squares - most of the patchwork is white. It is
	a design-goal to get as few red squares as possible. It is achieved by
	delegating dangerous work to the white areas so a crash does not
	affect the kernel.

	## Handling the state classes

	Each class must be handled differently. First there is the
	scratchpad/blackboard class. If a process crashes, the class is
	interesting because it contains the stack trace and usually the data
	which tells a story - namely how and why the process crashed. We
	usually export this data via SASLs error logger, so we can look at a
	crash report and understand what went wrong. After all, the internal
	state is gone after the crash report is done and logged.

	Next, there is the static class. The simplest thing is to have another
	process feed in the static data. This can be done by, among others,
	the supervisor, by asking an ETS table, by asking GProc (if you use
	gproc in your system), by asking another process or by discovery
	through the call `application:get_env/2`. It is important to note just
	how static the data is. You have few options at easily changing some
	of the approaches where others give you a lot of freedom, but require
	more of the process in turn.

	Finally, the fully dynamic data is the nasty culprit. If you can
	recompute the data, you are lucky. As an example from my etorrent
	application, each peer has a dynamic table of what parts of a torrent
	file the given peer has. So the controlling process has an internal
	table of this information. But if we crash and reconnect to the peer,
	the virtue of the bittorrent protocol will send us this information
	again. So that information is hardly worth keeping around. Other
	times, you can simple recalculate the information when your process
	restarts, and that is almost never a problem either.

	So what about the user supplied data? This is where the error kernel
	comes in. You need to protect data which you can not
	reconstruct. You protect it by shoving it into the error kernel and
	keep some simple state maintenance processes there to handle the
	state. A word of warning though: If your state is corrupted, it means
	that processes basing their work on the state will do something
	wrong. To mitigate this, it is important to make some general sanity
	checking of your data. Make it a priority to check your data for
	invariants if you find them. And don't blindly trust non-error-kernel
	parts of the system.

	If a process crashes, you should definitely think how much of its
	internal state you want to recycle. If you recycle everything you risk
	hitting the exact same bug again and crash. Rather, there may be a
	benefit to only recycling parts of the internal state.

	## The next step: Onion-layered Error kernels

	The next logical step up, is to recognize that the error kernel is not
	discrete. You want to regard the error kernel as an onion. Whenever
	you peel off a layer, you get a step closer to the trusted computing
	base of the application. Then your system design is to push down state
	maintenance the the outermost layer in the onion where it still makes
	sense. This in effect protects one part of the application from
	others. In Etorrent, we can download multiple torrent files at the
	same time. If one such torrent download fails, there is no reason it
	should affect the other torrent files. We can add a layer to the
	onion: Some state which is local to the torrent is kept in a separate
	supervisor tree - to mitigate the error if that part fails.

	The net effect is program robustness: A bug in the program will
	suddenly need perseverance. It has to penetrate several layers in the
	onion before it can take the full program down. And if the Erlang
	system is well designed, even the most grave bugs can only penetrate
	so far before the stopping power of the onion layers brings it to a
	halt.

	Furthermore, it underpins a mantra of Erlang programs: Small bugs have
	small impact. They won't even penetrate the first layer. And they will
	hardly be a scratch in the fabric of computing.

	(Aside: Good computer security engineering use the same onion-layered
	model. There are strong similarities between protecting a computer
	system against a well-armed intruder and protecting a program against
	an aggressive, persistent, dangerous and maiming bug. End of Aside)