Skip to content

Instantly share code, notes, and snippets.

@codemartial
Last active September 22, 2017 08:54
Show Gist options
  • Save codemartial/0899c3c70845c2f7c7fd6558d8a21eb8 to your computer and use it in GitHub Desktop.
Save codemartial/0899c3c70845c2f7c7fd6558d8a21eb8 to your computer and use it in GitHub Desktop.
3 Principles of Error Handling

Error Handling 1-2-3

As we build infrastructure components that support a large user base, it becomes critical to be able to provide quick support to end users. A large chunk of support issues simply arise due to end users encountering an error that they are clueless about. E.g. an HTTP 500 error with a generic "failed to do " message leads the user to a dead end and causes frustration.

Apart from clear messaging to the end user, it's also important to provide maximum visibility into error logs, so that someone who is looking at the logs should be able to determine – even without being familiar with the system – whether the cause of the error is internal or external. This is extremely critical in high utilisation network applications because usually the source of error is an external failure rather than a code bug.

As an example of very poor logging, the other day I logged into a service backend hoping to find some trace of user reported issues in the error logs. After filtering out all the access log cruft, I finally found a HUGE stack trace that gave me ZERO information about what went wrong, except that it was a network I/O error.

Given the importance of being able to RCA issues rapidly, we need to provide high SNR telemetry in our applications. To that end, I have the following 3 recommendations for error handling:

  1. Never ignore an error. Go makes it very hard to accidentally ignore errors. Make sure you don't grow bad habits (e.g. using _ to receive errors) and always check the error value.
  2. Recover XOR Bubble Up XOR Log. When an error is encountered, you can do 3 things with it: a) take corrective action (e.g. retry a failed operation), b) return the error up to the caller if you don't know how to recover from it, and c) Log the encountered error. As a thumb rule, you should do exactly 1 of these 3 things.
  3. Augment with Context (if any). If you need to return the error to the caller from a non-trivial function, don't return the error unmodified. Add context. E.g. https://play.golang.org/p/a-2ZPBIc0o

Addenda

Regarding item #2, it's better to bubble up an unrecoverable error than to log it. Logging should be done at the top-most level where maximum context is available and it's quite obvious what impact the error had. It's also more likely that a higher level function would have strategies for recovery, than a lower level function. As an example, a REST API endpoint handler could do the logging, and possibly return a more informative error message to the caller of the API.

Regarding item #3, add context to bridge functional gaps (e.g. file i/o error leading to auth failure). Bubble up errors without augmentation where there's no useful information to add and avoid turning the context into a manually stitched stack trace.

In the example, I've used fmt and errors.New to build errors only for illustratory convenience. You are expected to come up with proper error types. It's also preferable to build context as a composition of typed error values, so that the cause can be programmatically determined by a higher level function. Here's a rudimentary augmenting error type example: https://play.golang.org/p/ceSpmr5DWI. A more complete and highly recommended package for this purpose is github.com/pkg/errors

Some people make an argument in favour of logging recoverable errors for informational purposes. I would much prefer to stick with the XOR relationship between the two, instead adding a metric (e.g. ErrorFoo.count) or an analytics data point in such cases. This ensures that the system has enough visibility, while also using the Right Tool for the Job™.

@codemartial
Copy link
Author

@dvrkps: Thanks for the suggestion. Incorporated :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment