Kriel/article.md Secret

## article.md

      
    Raw
  

              article.md
            
          
    Per Yuan et al., In A Study Of Five Distributed Software Systems, These Three Code Smells Cause 35% Of Catastrophic Outages

I manage software engineering for an application used at tens of thousands of locations across the United States and Canada.
An outage can cost thousands, even millions

So, ocassionally, I'll review research papers and books to get hard data to pass on to my engingeering team to boost code stability and reliability.
Buried in a 2014 paper, presented during the 11th Proceedings of the USENIX OSDI 2014 conference, is a surprising statistic

Researchers Ding Yuan, Yu Luo, et. al, found that...
Across 198 randomly selected catastrophic failures, for five top-tier distributed applications, 35% of catastrophic system failures resulted from just three easily identifiable and correctable code smells

Even better...
These three issues can be detected via code review and static analyzers

Specifically, the paper states that in their review of the 198 randomly-selected catastrophic failures:
"35% of the catastrophic failures are caused by trivial mistakes in error handling logic — ones that simply violate best programming practices; and that can be detected without system specific knowledge."
For reference, the paper is titled: "Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems".
The distributed applications studied were: Cassandra DB, HBase, Hadoop Distributed File System, Hadoop MapReduce, and Redis.
All three of these issues are related to CATCH blocks in try-catch statements

In their research, they found that for these well-tested, distributed, applications '92% of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software".
This implies that these distributed systems were very well tested -- e.g. better than industry average, yet still most of these outages came from poor handling of failure modes.
Anecdotally, I've seen two of these issues first-hand in software systems I oversee.
But this research paper helps me, and hopefully you, see that we can anticipate these errors aggressively. We simply need our teams to keep an eye out for these code smells.
Below are the three code smells found in the paper

Code Smell #1: The log-only error handler

The first case is having an error handler that logs the error, but does nothing else.
25% of catastrophic failures were caused by this case alone.
Whether this is done unintentionally or on purpose, Yuan et al. found that this type of catch-block can cause non-fatal issues to fester and culminate in application collapse.
For example, the act of catching an error for the purpose of logging it could unintentionally swallow the error. Hiding it from error mitigation code further up the stack.
The solution

Developers should take a hard look at all non-fatal error logging to see if:

It was a mistake to log only, and if the developer meant to re-throw the exception
If the developer is taking an easier route -- by simply logging the issue -- rather than taking advantage of simple proactive steps for runt-time mitigation

If a catch block is entirely empty, the above suggestions apply as well.
As with any code smell, an error handler that does nothing but log is not always wrong.
But it should cause your spidey-sense to tingle.
Code Smell #2: Re-throwing an exception on an overly generic exception handler

On the opposite end of the spectrum is the error handler that's too broad.
This issue caused 8% of avoidable catastrophic failures.
For example code like the below:
try {
    ...
}
catch(Exception ex) { //catching the base class Exception will catch ALL exception types that inherit from Exception
    ...
    throw;
}

Such handlers abort an entire application when perhaps only one of many possible exception types are truly fatal.
For example, imagine that certain file handling code can encounter the below exceptions:

The disk is full
The disk is busy
The file is locked

Most likely, only one or two of those exceptions is truly fatal.
Also, each of those exceptions should have different error handling logic.
However, if a developer is handling ALL exception types with a single catch-block and especially one that naively re-throws, non-fatal issues will trigger catastrophic failures.
Or, an easily recoverable problem -- such as the disk being busy -- could be treated with the same panic as a disk being permanently unwritable because it is full.
The solution

Check to see if the developer has truly thought through the nuances of the different exception types a block of code may trigger.
Then, check that each exception type is handled appropriately.
Some exceptions may indeed require a re-throw.
But, can some exception types be handled gracefully?
As always, exception handling should be as specific as possible.
Code reviews and static code analysis can easily spot this danger before the overly generic error handler makes it to production.
Code Smell #3: The 'To-Do' Error Handler

Only 2% of failures were caused by this issue, but it is still worth mentioning.
As a developer builds a large feature, he may add comments like "FIXME" or "TODO" to remind himself to come back and address an issue.
And admittedly, especially when following an Agile methodology, some production code may ship with "FIXME" and "TODO" still in place.
However, when statements like "FIXME" and "TODO" are in an exception catch-block, BE VIGILANT!
It could be that the programmer forgot to go back and add error handling.
Or, the developer may truly believe he can come back to this error handling later.
As you would expect, Yuan et al. found that these sorts of comments in error handling blocks become explosive landmines in production.
The solution

During code reviews, or via static analysis, scan the codebase for TODO-like comments in error handlers.
Vigilantly verify that these are not an oversight or improper postponemnt by the developer.
Again, it's probably too strict of a rule to say that code must ship with zero TODOs.
But, when TODOs are found in error handling code, you should be especially concerned.
A final bonus thought

Beyond the three above mentioned smells, the study had another alarming statistic: in 23% of catastrophic failures, error handling logic for a non-fatal error was fundamentally flawed. These flaws were so evident that basic statement coverage testing or more thorough code reviews would have easily caught them.
So again, having error handlers in our code is excellent.
But...
Beyond simply looking for the presence of error handlers during a code review

You'll want review error handlers closely to ensure the error handling code itself is correct and robust.
Conclusion

Yuan, et al.'s research gives us three powerful code review guidelines that we can use to make sure our developers ship robust, reliable code to production.
Adding exception handling to our software is good

But...
To truly make our systems robust, our code reviews should pay close attention to the contents of our exception handlers

Even better, use a good static code analyzer that helps you find these three code smells.
Per Yuan et al., making these three or four simple changes to your code reviews and static analsyses could reduce your number of catastrophic production outages by up to 35%!
Please put these tips into action NOW to help your team release software that is more reliable and robust!
Failure to monitor these types of code smells will leave you with software that is much less reliable than it could be and a higher occurence of incidents!
Also, please let me know how these tips have helped you! I love hearing how the information I share helps you and your team hit your goals!