dreynaud/error-handling.md

## error-handling.md

      
    Raw
  

              error-handling.md
            
          
    Error Handling in Practice

My experience is mostly with Java backend services in the cloud, so the advice in this post will almost certainly be biased towards this kind of error handling. But hopefully, some of it will be generally applicable and help you maintain and debug your programs in the long run.
I don't claim that these are universal best practices, but I have found these to be useful as general guidelines. As always, use your best judgment and do things that make sense in your context.
Log the whole thing

In Java, a full exception is:

a specific class (unless someone throws new Exception())
a stack trace
an optional error message
an optional cause (another exception)

When you're investigating an issue, you want all the relevant information you can get your hands on. Which in some cases means having access to everything, including the stack trace.
Logging the message only is almost certainly a bad idea because error messages are optional. Using the exception's toString() method is better because it typically logs both the exception type and the error message. Beware though, you never know who may create an exception class with a less-than-useful toString() override.
    // logs the message only (could be null, no type, stack trace or cause)
    ❌ log.error("error frobulating the gluxotrons, reason: {}", e.message)
    
    // MAY log the type and message (no type, stack trace or cause)
    ❌ log.error("error frobulating the gluxotrons, reason: {}", e.toString())

    // logs ✨everything✨
    🌈 log.error("error frobulating the gluxotrons", e) 
Note: stack traces can be very verbose, but please be deliberate if you decide to omit them from your logs. You don't want an investigation to be delayed because you wanted to save a bit of disk space and screen real estate.
Log only where it makes sense

You don't need to log the exception when you generate one:
SomethingBadException e = new SomethingBadException("terrible things");
log.error("about to throw", e); // ❌ no need to log here
throw e;
When you throw an exception you're passing it some other method for handling, so let them handle it. The same logic applies when rethrowing or wrapping in another exception type:
catch (LowLevelException e) {
    log.error("just caught something!", e); // ❌ no need to log here
    throw new HighLevelException(e);
}
Ideally, you would only log the exception where you handle it or in the top-level catch-all. This way, you avoid duplicate stack traces in logs (which would have poor consequences on metrics and log search).
Don't leak 3rd party exceptions

In your public-facing APIs, be careful not to expose 3rd party exception classes. This couples your interface with something you don't control. This can happen a couple ways:

you're not sure how to handle a checked exception from a dependency, so you bubble it up. Since it's a checked exception, it's now part of your throws clause.
the same thing happens with an unchecked exception, but it is sneakier because you don't need to declare it in your throws clause. This can also happen accidentally if you fail to catch unchecked exceptions from your dependencies.

In both cases, you want to avoid client code having to call you like this:
try {
    yourCoolLibrary();
} catch(SomeLeakedException e) {
    // ❌ oh no, people are now relying on leaked exceptions
}
To avoid this, yourCoolLibrary could instead catch-and-wrap:
/**
 * @throws CoolException is a runtime exception so that we don't force client code to catch it,
 *   but we're not sneaky so we document it here instead
 */
public void yourCoolLibrary() {
    try {
        coolLibraryInternal(); // may throw SomeLeakedException
    } catch(Exception e) {
        // 🌈 every exception is now a CoolException, owned by the cool library
        // handling and logging is up to the caller, we just pass everything along
        throw new CoolException(e);
    }
}
Let's look at an example of a situation I had to debug. After updating dependencies in one of my apps, the code would no longer compile. The issue was that we relied on an internal library with a method like this:
// ❌ JSONException is an implementation detail, it should not leak to your throws clause
public void validateJson(String str) throws org.json.JSONException
It seemed innocuous at the time. After all, we were using org.json internally and a JSONException seemed to make sense for something that validates json. JSONException used to be a checked exception (a subclass of Exception). But in a newer version, it became an unchecked exception instead (a subclass of RuntimeException), breaking code using it. Later still, that library switched to a different json library, causing yet more churn.
Don't drop exceptions

This is the kind of mistake that is one typo away and easy to miss during code review. When you intend to wrap an exception, double-check that you didn't drop it instead:
catch(SomeException e) {
    ❌ throw new OtherException("failed to process the bleep doop");
    🌈 throw new OtherException("failed to process the bleep doop", e);
}
In the second case, the original exception can be accessed along with its message and stack trace. A fresh OtherException without the cause will almost certainly be of no help to anyone. The errorprone static code checker can warn you about this: https://errorprone.info/bugpattern/UnusedException
Capture useful error messages

Don't do this:
if (user == null || user.permissions == null) {
    throw new PermissionException("no user or user does not have permissions"); // ❌
}
Do this:
if (user == null) {
    throw new PermissionException("no user");
}

if (user.permissions == null) {
    throw new PermissionException("User " + user + " does not have permissions"); // 🌈
}
The first case doesn't tell you what happened, so during an incident or investigation, you have to chase both. It's a shame because the information was available but has been lost, and you now have to do the work to recover it. In the second case, there is no doubt about what is happening. It also captures the value of the user object, but be mindful of PII or other sensitive data ending up in logs.
Generally speaking, if an exception is created with a fixed string, it probably fails to capture all the relevant information. Going back to the permissions example, an ideal error message would contain details like user $U with permissions $P can not access resource $R with required permissions $RP.
See also Item 63 in Effective Java 2nd Ed, "Include failure-capture information in detail messages"
Avoid handling exceptions based on implementation details

Sometimes you don't have a choice, but beware of code that looks like this:
catch (SomeException e) {
    if (e.cause instanceof CoolRPCNetworkException) {
        if (e.cause.message.startsWith("Timeout")) {
            // handle timeout
            // ...but will this work? 🤷‍♂
        }
    }
}
This type of code is essentially untestable and relies on very specific versions of very specific libraries... Ultimately, this means that the dependencies used here have shortcomings. They don't provide a clear signal about the underlying error, so the error handling code has to do extra work.
Returning errors to REST API clients

When returning errors to end-users, you can't leak internal error details. Instead, consider returning the following elements:

one or more enum-style error codes that can be treated programmatically on the client side. Note that codes don't have to be numeric ids, something like 'token_expired' reads a lot better than 'error 120236'. For example, Stripe's API makes great use of delightful error codes. These error codes can also be combined to create more structure. For instance, you could return a tuple of error codes with the first one being the error category and the second one the specific error.
a human-readable error message. Be explicit about whether this message can be displayed to the user or is just for logging. If it can make it all the way to the end-user, it would have to be localized!
extra error details that can be used for debugging. This should include at a minimum a request/trace/correlation id that you can search for in your central log aggregator. Having a way to tie a particular client error to a specific set of logs on the backend is a must-have!
if it makes sense, consider sending metadata like an isRetryable flag. It informs client implementations about how they should handle the error.

Exceptions and metrics

It would be nice if we could just slap an exception tag on our error metrics with the whole exception class and message, but we can't because:

tag values are limited in length (on the order of 10-100 characters)
the cardinality of error messages is too high (meaning the tag can have too many unique values)

Instead of the exception message, I sometimes find it useful to use the exception class as a tag value (both short and low cardinality). With Spectator, it would look like this:
registry.counter("requests.errors",
                 "exception", e.getClass().getSimpleName())
        .increment();
Keep in mind that by breaking the flow of control, exceptions can cause metrics-publishing lines of codes to be skipped. Make judicious use of finally blocks to avoid that, or better yet use patterns that already account for exceptions. For example, from the Spectator docs:
public Response handle(Request request) {
    return registry.timer("server.requestLatency").record(() -> handleImpl(request));
}
We pass a lambda function to Timer::record, which will internally deal with exceptions thrown in handleImpl with a finally block.
Don't swallow exceptions

If you are very confident that some exception is indeed perfectly fine to ignore, please at least leave a comment in the catch block explaining why. And be specific with the kind of exception you catch, I don't see how catch(Exception e) {} would ever be ok (both over-catching and swallowing!). Note that logging only is essentially the same as swallowing.
See also:

Google Java Style Guide Items 6.2
Item 65 of Effective Java, "Don't Ignore Exceptions"
[osdi14] for an example of data loss in HBase caused by exception swallowing.

Avoid nested try blocks

They are just impossible to reason about. If you feel the need to nest try/catch blocks, your method is probably doing too many different things. Some things to try instead:

break up long methods into smaller ones that do a single thing
reduce the scope of your try blocks
use a multicatch block

Don't over-catch

Try to avoid catch(Exception e) (and even more so catch(Throwable t)!) if you can. Over-catching has multiple drawbacks:

it provides the reader with very little information about the failure scenario, especially after a very long try block
it may catch more exception types than you think, including types that are not part of the API contract (for instance, InterruptedException needs to be individually handled)
it may cause benign exceptions to be treated as fatal

[osdi14] has an example of over-catching in HDFS:
try {
  namenode.registerDatanode();
} catch (Throwable t) {
  System.exit(-1);
}
There was a glitch in namenode causing a RemoteException to be thrown, which could be handled and retried. Instead, it was treated as fatal by the catch (Throwable t) block, bringing down an entire HDFS cluster.
In a service, you probably want only one of these catch-all handlers at the top-level, and it's probably already there in the framework you use. It's not always possible to avoid in application code, for instance if you call a method annotated with throws Exception (see Don't over-throw).
If you're using catch(Exception e) as a convenience because this is the superclass of all things you have to catch, use a multicatch block instead.
Don't over-throw

If you over-throw (declare a method with throws Exception), you are by definition causing callers of this method to over-catch.
Your thrown exceptions are parts of the public API of a method. A throws Exception method communicates that anything could happen. This is an unreasonable burden to put on every caller of this method.
Do reuse standard exceptions

Some exceptions already carry a lot of meaning, and it's perfectly fine to reuse them in your code. For instance, you don't need a clone of IllegalArgumentException or UnsupportedOperationException when the standard ones would work.
It is however possible to take this too far by reusing exceptions that are too generic, such as RuntimeException. Instantiating and throwing a RuntimeException is a case of over-throwing, which will cause over-catching on the receiving side. If there is no standard exception that makes sense, make your own exception subclass!
See also:

Item 60 of Effective Java, "Favor the use of standard exceptions"
https://errorprone.info/bugpattern/ThrowSpecificExceptions

More in part 2

I feel like there is more to cover, so there might be a part 2 with topics such as:

testing exception handlers
logging at the right level
keeping exceptions exceptional (or what to do about "expected" exceptions)
keeping throws clauses to a minimum (but documenting unchecked exceptions)
log loaves, not breadcrumbs (a pattern to log related things together)

Thanks to Rob Fletcher, Lorin Hochstein and Mark Vulfson for reviewing drafts of this.
Discuss on Twitter