# On Erlang, State and Crashes
There are two things which are ubiquitous in Erlang:
* A Process has an internal state.
* When the process crashes, that internal state is gone.
These two facts pose some problems for new Erlang programmers. If my
state is gone, then what should I then do? The short answer to the
question is that some other process must have the state and provide
the backup, but this is hardly a fulfilling answer: It is turtles all
the way down. Now, that other process might die, and then *another*
beast of a process must have the state. And this observation continues
ad infinitum. So what is the Erlang programmer to do? This is my
attempt at answering the question.
## State Classification
The internal state of an Erlang process can naturally be
classified. First, state has different value. State related to the
current computation residing on the stack may not be important at all
after a process crash. It crashed for a reason and chance are that the
exact same state will bring down the process again with the same
error. The same observation might apply to some internal state: It is
like a scratchpad or a blackboard: when the next lecture starts, it
can be erased because it has served its purpose.
Next is static state. If a process is governing a TCP/IP connection
that process should probably connect to the same TCP/IP Address/Port
pair if it crashes and is restarted. We call that kind of data
configuration or *static* data. It is there, but it is not meant to
change over the course of the application, or only change rarely.
Finally our crude classification of state has *dynamic* data. This class is the
data we generate over the course of the running program, get from user
input, create because other programs communicate with us and so
on. The class can be split into two major components: State we can
compute from other data and state we cannot compute. The computable
state is somewhat less of a problem. We can basically just recompute
it after a crash, so the real problem is the other kind of
In other words, we have three major kinds of state: *scratchpad,
static and dynamic*.
## The Error Kernel
Erlang programs have a concept called the *error kernel*. The kernel
is the part of the program which *must* be correct for its correct
operation. Good Erlang design begins with identifying the error kernel
of the system: What part *must* not fail or it will bring down the
whole system? Once you have the kernel identified, you seek to make it
minimal. Whenever the kernel is about to do an operation which is
dangerous and might crash, you "outsource" that computation to another
process, a dumb slave worker. If he crashes and is killed, nothing
really bad has happened - since the kernel keeps going.
Identifying the kernel plugs the "turtles all the way down" hole. As
soon as the kernel is hit, we assume correctness. But since the kernel is
small, the *trusted computing base* of our program is likewise. We
only need to trust a small part of the program, and that part is also
A visualization is this: A program is a patchwork of small
squares. Some of the squares are red, and these are the "error
kernel". Most (naively implemented) imperative programs are mostly
red, save for a few squares. These are the squares where exceptions
are handled explicitly and the error is correctly mitigated. The
kernel is thus fairly large. In contrast, robustness-aware Erlang
programs have few red squares - most of the patchwork is white. It is
a design-goal to get as few red squares as possible. It is achieved by
delegating dangerous work to the white areas so a crash does not
affect the kernel.
## Handling the state classes
Each class must be handled differently. First there is the
scratchpad/blackboard class. If a process crashes, the class is
interesting because it contains the stack trace and usually the data
which tells a story - namely *how* and *why* the process crashed. We
usually export this data via SASLs error logger, so we can look at a
crash report and understand what went wrong. After all, the internal
state is gone after the crash report is done and logged.
Next, there is the static class. The simplest thing is to have another
process feed in the static data. This can be done by, among others,
the supervisor, by asking an ETS table, by asking GProc (if you use
gproc in your system), by asking another process or by discovery
through the call `application:get_env/2`. It is important to note just
how static the data is. You have few options at easily changing some
of the approaches where others give you a lot of freedom, but require
more of the process in turn.
Finally, the fully dynamic data is the nasty culprit. If you can
recompute the data, you are lucky. As an example from my etorrent
application, each peer has a dynamic table of what parts of a torrent
file the given peer has. So the controlling process has an internal
table of this information. But if we crash and reconnect to the peer,
the virtue of the bittorrent protocol will send us this information
again. So that information is hardly worth keeping around. Other
times, you can simple recalculate the information when your process
restarts, and that is almost never a problem either.
So what about the user supplied data? This is where the error kernel
comes in. You need to *protect* data which you can not
reconstruct. You protect it by shoving it into the error kernel and
keep some simple state maintenance processes there to handle the
state. A word of warning though: If your state is corrupted, it means
that processes basing their work on the state will do something
wrong. To mitigate this, it is important to make some general sanity
checking of your data. Make it a priority to check your data for
invariants if you find them. And don't blindly trust non-error-kernel
parts of the system.
If a process crashes, you should definitely think how much of its
internal state you want to recycle. If you recycle everything you risk
hitting the exact same bug again and crash. Rather, there may be a
benefit to only recycling parts of the internal state.
## The next step: Onion-layered Error kernels
The next logical step up, is to recognize that the error kernel is not
discrete. You want to regard the error kernel as an onion. Whenever
you peel off a layer, you get a step closer to the trusted computing
base of the application. Then your system design is to push down state
maintenance the the outermost layer in the onion where it still makes
sense. This in effect protects one part of the application from
others. In Etorrent, we can download multiple torrent files at the
same time. If one such torrent download fails, there is no reason it
should affect the other torrent files. We can add a layer to the
onion: Some state which is local to the torrent is kept in a separate
supervisor tree - to mitigate the error if that part fails.
The net effect is program robustness: A bug in the program will
suddenly need perseverance. It has to penetrate several layers in the
onion before it can take the full program down. And if the Erlang
system is well designed, even the most grave bugs can only penetrate
so far before the stopping power of the onion layers brings it to a
Furthermore, it underpins a mantra of Erlang programs: Small bugs have
small impact. They won't even penetrate the first layer. And they will
hardly be a scratch in the fabric of computing.
(Aside: Good computer security engineering use the same onion-layered
model. There are strong similarities between protecting a computer
system against a well-armed intruder and protecting a program against
an aggressive, persistent, dangerous and maiming bug. End of Aside)