# On Erlang, State and Crashes | |
There are two things which are ubiquitous in Erlang: | |
* A Process has an internal state. | |
* When the process crashes, that internal state is gone. | |
These two facts pose some problems for new Erlang programmers. If my | |
state is gone, then what should I then do? The short answer to the | |
question is that some other process must have the state and provide | |
the backup, but this is hardly a fulfilling answer: It is turtles all | |
the way down. Now, that other process might die, and then *another* | |
beast of a process must have the state. And this observation continues | |
ad infinitum. So what is the Erlang programmer to do? This is my | |
attempt at answering the question. | |
## State Classification | |
The internal state of an Erlang process can naturally be | |
classified. First, state has different value. State related to the | |
current computation residing on the stack may not be important at all | |
after a process crash. It crashed for a reason and chance are that the | |
exact same state will bring down the process again with the same | |
error. The same observation might apply to some internal state: It is | |
like a scratchpad or a blackboard: when the next lecture starts, it | |
can be erased because it has served its purpose. | |
Next is static state. If a process is governing a TCP/IP connection | |
that process should probably connect to the same TCP/IP Address/Port | |
pair if it crashes and is restarted. We call that kind of data | |
configuration or *static* data. It is there, but it is not meant to | |
change over the course of the application, or only change rarely. | |
Finally our crude classification of state has *dynamic* data. This class is the | |
data we generate over the course of the running program, get from user | |
input, create because other programs communicate with us and so | |
on. The class can be split into two major components: State we can | |
compute from other data and state we cannot compute. The computable | |
state is somewhat less of a problem. We can basically just recompute | |
it after a crash, so the real problem is the other kind of | |
user/program-supplied information. | |
In other words, we have three major kinds of state: *scratchpad, | |
static and dynamic*. | |
## The Error Kernel | |
Erlang programs have a concept called the *error kernel*. The kernel | |
is the part of the program which *must* be correct for its correct | |
operation. Good Erlang design begins with identifying the error kernel | |
of the system: What part *must* not fail or it will bring down the | |
whole system? Once you have the kernel identified, you seek to make it | |
minimal. Whenever the kernel is about to do an operation which is | |
dangerous and might crash, you "outsource" that computation to another | |
process, a dumb slave worker. If he crashes and is killed, nothing | |
really bad has happened - since the kernel keeps going. | |
Identifying the kernel plugs the "turtles all the way down" hole. As | |
soon as the kernel is hit, we assume correctness. But since the kernel is | |
small, the *trusted computing base* of our program is likewise. We | |
only need to trust a small part of the program, and that part is also | |
fairly simple. | |
A visualization is this: A program is a patchwork of small | |
squares. Some of the squares are red, and these are the "error | |
kernel". Most (naively implemented) imperative programs are mostly | |
red, save for a few squares. These are the squares where exceptions | |
are handled explicitly and the error is correctly mitigated. The | |
kernel is thus fairly large. In contrast, robustness-aware Erlang | |
programs have few red squares - most of the patchwork is white. It is | |
a design-goal to get as few red squares as possible. It is achieved by | |
delegating dangerous work to the white areas so a crash does not | |
affect the kernel. | |
## Handling the state classes | |
Each class must be handled differently. First there is the | |
scratchpad/blackboard class. If a process crashes, the class is | |
interesting because it contains the stack trace and usually the data | |
which tells a story - namely *how* and *why* the process crashed. We | |
usually export this data via SASLs error logger, so we can look at a | |
crash report and understand what went wrong. After all, the internal | |
state is gone after the crash report is done and logged. | |
Next, there is the static class. The simplest thing is to have another | |
process feed in the static data. This can be done by, among others, | |
the supervisor, by asking an ETS table, by asking GProc (if you use | |
gproc in your system), by asking another process or by discovery | |
through the call `application:get_env/2`. It is important to note just | |
how static the data is. You have few options at easily changing some | |
of the approaches where others give you a lot of freedom, but require | |
more of the process in turn. | |
Finally, the fully dynamic data is the nasty culprit. If you can | |
recompute the data, you are lucky. As an example from my etorrent | |
application, each peer has a dynamic table of what parts of a torrent | |
file the given peer has. So the controlling process has an internal | |
table of this information. But if we crash and reconnect to the peer, | |
the virtue of the bittorrent protocol will send us this information | |
again. So that information is hardly worth keeping around. Other | |
times, you can simple recalculate the information when your process | |
restarts, and that is almost never a problem either. | |
So what about the user supplied data? This is where the error kernel | |
comes in. You need to *protect* data which you can not | |
reconstruct. You protect it by shoving it into the error kernel and | |
keep some simple state maintenance processes there to handle the | |
state. A word of warning though: If your state is corrupted, it means | |
that processes basing their work on the state will do something | |
wrong. To mitigate this, it is important to make some general sanity | |
checking of your data. Make it a priority to check your data for | |
invariants if you find them. And don't blindly trust non-error-kernel | |
parts of the system. | |
If a process crashes, you should definitely think how much of its | |
internal state you want to recycle. If you recycle everything you risk | |
hitting the exact same bug again and crash. Rather, there may be a | |
benefit to only recycling parts of the internal state. | |
## The next step: Onion-layered Error kernels | |
The next logical step up, is to recognize that the error kernel is not | |
discrete. You want to regard the error kernel as an onion. Whenever | |
you peel off a layer, you get a step closer to the trusted computing | |
base of the application. Then your system design is to push down state | |
maintenance the the outermost layer in the onion where it still makes | |
sense. This in effect protects one part of the application from | |
others. In Etorrent, we can download multiple torrent files at the | |
same time. If one such torrent download fails, there is no reason it | |
should affect the other torrent files. We can add a layer to the | |
onion: Some state which is local to the torrent is kept in a separate | |
supervisor tree - to mitigate the error if that part fails. | |
The net effect is program robustness: A bug in the program will | |
suddenly need perseverance. It has to penetrate several layers in the | |
onion before it can take the full program down. And if the Erlang | |
system is well designed, even the most grave bugs can only penetrate | |
so far before the stopping power of the onion layers brings it to a | |
halt. | |
Furthermore, it underpins a mantra of Erlang programs: Small bugs have | |
small impact. They won't even penetrate the first layer. And they will | |
hardly be a scratch in the fabric of computing. | |
(Aside: Good computer security engineering use the same onion-layered | |
model. There are strong similarities between protecting a computer | |
system against a well-armed intruder and protecting a program against | |
an aggressive, persistent, dangerous and maiming bug. End of Aside) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment