pkoch/Errors: Faults, Failures, and Inconsistencies.md

## Errors: Faults, Failures, and Inconsistencies.md

      
    Raw
  

              Errors: Faults, Failures, and Inconsistencies.md
            
          
    https://gist.github.com/pkoch/ed52f4635a92d1028d511363efee9758
There's a few codebases I've worked on that don't give errors enough consideration, and details about them are readily swept under a carpet. They return None left and right, regardless of what came in their way. This lack of consideration and communication of the nature of the problem makes dealing with it largely impossible.
I consider that, until the end of last decade, ergonomic error handling in programming languages (and respective standard libraries) has felt mostly like an afterthought. Some of them (/me looks menacingly to PHP and JavaScript) tried to pretend they didn't have to be a thing. Things are now feeling differently, especially so when I look at Rust and Swift.
I have a clear idea of how to "do errors right". This is me expressing my opinion.
Terminology

When talking about errors, I often see incoherent terminology, so I'll establish my own. These are all leaky buckets -- I can't really offer formal definitions -- but I need to have some common ground to build an argument, and these seem to paint enough of a picture.
Faults

Something beyond your control is misbehaving.

TCP connection died abruptly
The service you called had an internal error
You can't connect to some server

In all of these, somebody else broke an expectation. It's the kind of problem you expect to retry and, if it doesn't get any better, open a circuit breaker and/or give up and tell a human to take a look at.
There's no way to avoid these, and you should have measures in place to react orderly to the situation.
Failures

You asked your system to do something that it can't do, or it doesn't make sense.

You asked it to open a file, but it's not there
You asked for a user with a certain id, but it doesn't exist
You tried to call pop on an empty stack
You tried to divide by zero
You tried to write to a file handle that was already closed
You tried to add two types that don't go together (something like [1, 2, 3] + {a: 1, b: 2}).

In all of these, the outcome depends on the input. In some cases, it also depends on the current system's state. You are the best person to decide what to do next. Retrying it feels nonsensical. Before you say anything, race conditions are inconsistencies.
These outcomes should all be thought beforehand and handled well. Not doing so is a programming mistake. There's type systems and testing strategies (like property-based testing) to help you make sure you do.
When you're working with distributed systems, Faults have a tendency to start being handled like Failures, since proceeding in their presence is a concern that the programmer starts to care intently about.
Inconsistencies

The truth is disagreeing with itself.

A calculated value in a data structure is wrong: {count: 2, elements: ["a"]}
Some service's response is specified as adhering to a certain schema, but it doesn't (some field is missing, the value is outside it's domain, etc)

Someone before you messed up and produced state they thought was good (otherwise, they would have signaled the error), but you disagree. Retrying is a non-consideration, since you didn't try anything to begin with, you just noticed something doesn't add up.
All inconsistencies are bugs. There's an ill-designed program, yours or otherwise. There's no good standard way to continue forward, and the convention is to just stop immediately (i.e., crash).
Depending on your specific case, you might have a strategy available to try to recover: recalculate your structure, call the service again, etc. But these are you going above and beyond in trying to get out of a bad situation. You're not fixing the problem, you're working around the problem. It should be solved at its root. Once again, this starts to look like Failure handling.
Representing

Representing an error as a record is a fine choice. This can be a hierarchy of Python classes that derive from Exception, a Rust/Swift enum, etc. Whatever let's you:

Group things together,
name the condition, and
pass a bit more information about it when available.

I find error codes a poor choice. It gets the job done, but it does so in a clunky and limited way, and there's easily better options available. They're a gimmicky way of encoding an entry on a well maintained error table; you might have well just use that table entry's name. If you need to pass additional information about the problem, that's now something you need to tack on, adding to their clumsiness. I'd only use them in space constrained environments (like embedded devices).
If your errors are meant to stop the program (like exceptions or panics, as opposed to Result<T, E>), it needs to have a stack trace.
Being able to have exception chaining is nice. Python's approach seems to mirror the default practice on the matter. In practice, I've rarely seen chaining more clarity to the issue than letting the original exception bubble up, so I don't care particularly for it.
What to do about them?

In functions

Given we have an accurate and complete picture of what went wrong, the next thing I'm going to care about is how I can make functions ergonomically call each other.
Result<T, E>-style of error passing is ok, but a bit verbose without some help. Rust's ? and its analogs are great mechanisms to overcome this problem. Without help, code becomes too riddled with if error: statements, which greatly reduces the signal-to-noise ratio. Go is an obvious example of how to do this wrong.
I don't want this in my life:
fn read_username_from_file() -> Result<String, io::Error> {
    let f = File::open("hello.txt");

    let mut f = match f {
        Ok(file) => file,
        Err(e) => return Err(e),
    };

    let mut s = String::new();

    match f.read_to_string(&mut s) {
        Ok(_) => Ok(s),
        Err(e) => Err(e),
    }
}

This is more like what I'm looking for:
fn read_username_from_file() -> Result<String, io::Error> {
    let mut f = File::open("hello.txt")?;
    let mut s = String::new();
    f.read_to_string(&mut s)?;
    Ok(s)
}

This does have a small problem: it's now a bit harder to compose with. Using higher-order functions (like decorators) gets more awkward. If the a decorator calls something that fails, we might end up with undesirable nesting.
fn log_and_call<T>(f: Fn() -> T) -> Result<T, logs::Error> {
    log_something()?;
    f()
}

let a: Result<Result<String, io::Error>, logs::Error> = log_and_call(read_username_from_file)

Exceptions are a pretty decent alternative.
In APIs

In an interactive context (web request, CLI, etc):

Fault: 500, let the caller retry calling you if they want to.
Failure: 400 if it's a problem with user supplied input, 500 when you didn't account for some outcome.
Inconsistency: 500.

In a background worker context:

Fault: log them and stop; have a sane retry mechanism as close to the start of the execution as possible.
Failure:
Inconsistency:

What to not do about them?

Anything that makes functions hard to compose together. Let's take a look at common offenders.
Retry liberally

Retries should be as most top-level as possible, as they don't really compose well. Let's look at a common scenario:
$ cat http_controller.py
$ cat my_service.py
$ cat other_service.py

Make them unrecoverable

Rust's panic! and Go's panic aren't that great. You can get away without using them, so just prefer not to. These options seem to work well in more simplistic cases, but you lose the freedom to do it if you're a library. Even if the current function isn't in a library, it might start to get used by different callers that have different ways of dealing with the situation, and one of them doesn't just want to crash. Both languages have their way of recovering from panics, but these should be used only on plumbing code (i.e., in frameworks), not as a normal way of handling errors. So, I feel this kinda leaves us with them being lame-duck versions of exceptions, which is what they were trying to avoid being to begin with.
Crashing your whole program is even worse. I recall (but can't find it anymore) seeing people complain about using git's code as a lib being a pain because it did perror(); exit(1) at the slightest hiccup
Return tuples

Go, Erlang, and Elixir get on my nerves over this. They make composing functions terrible.
References


https://twitter.com/hillelogram/status/1427322216350355459
http://joeduffyblog.com/2016/02/07/the-error-model/
https://docs.swift.org/swift-book/LanguageGuide/ErrorHandling.html
https://github.com/apple/swift/blob/main/docs/ErrorHandlingRationale.rst
https://doc.rust-lang.org/book/ch09-00-error-handling.html