bbeck/effective-post-mortems.md

## effective-post-mortems.md

      
    Raw
  

              effective-post-mortems.md
            
          
    Foreword

In order to get the most out of a post-mortem process it is imperative that it
be run as inclusively as possible and in a blameless fashion.  Always assume
non-malicious intent because often the root cause of problems that happen are
people making mistakes.  The purpose of a post-mortem is for everyone involved
to grow and to learn from each other's mistakes in a safe environment which
can't happen if blame is being assigned.
Running an effective post-mortem

A post-mortem is broken up into three phases:

Understanding WHAT went wrong and it's impact on the business.
Understanding WHY things went wrong and most importantly the root cause.
Determining HOW to avoid the root cause in the future.

Phase 1 - What went wrong?

The goal of this phase is to get everyone in the room on the same page about
what problem has happened in order to allow them to have an informed discussion
about it.  Usually it's led by a single person describing in detail what went
wrong with the system and what impact it had on the business.
This phase will often also include a timeline of events as they unfolded.  The
timeline is usually not at all important in understanding what went wrong, but
often helps during the root cause analysis that happens prior to the post-mortem
as well as during the discussion in subsequent phases of the post-mortem.
It's also important to try to quantify the impact to the business as well to
help everyone in the room to understand the severity of the problem.  This is
usually done with metrics like:

Duration of problem
Impact of the problem (% of requests, % of users, % of data corrupted, etc.)
Cleanup required

Example


On Tuesday June 30th the front end load balancer started returning HTTP 504
responses to approximately 33% of incoming requests.  This resulted in many
of the public facing pages on the website not rendering at all or only
partially rendering.  The impact of this was that nearly every internal user
being unable to perform their job in this application during this
degradation.

Phase 2 - Why did it go wrong?

The goal of this phase is to present the results of the root cause analysis
of the problem that was performed prior to the post-mortem.  Usually it's led
by a single engineer with intimate knowledge of the systems that were identified
to be at fault during the root cause analysis.
This phase is the presenter's opportunity to get the engineers in the room on
the same page about what specifically failed with the technology and caused the
problem.  It's incredibly important during this phase to describe the logical
consequences of the problem so that people in the room can distinguish between
the actual problem and an effect caused by the problem.  In complex systems it's
not uncommon for problems to cascade, so being able to present and understand
the single root cause is critical in having an effective post-mortem.
Example


It was identified that one of the three front end servers had a process
running that was consuming nearly 100% of the CPU on the machine.  This in
result caused incoming requests to not be handled quickly enough as the
load balancer in front of the front end servers timed out the requests after
10 seconds.
This process was only present on one of the three front end servers and was
later determined to be a process kicked off by a daily triggered cron job
that started running just before the outage window.  It was also determined
via commit history that the cron job had last changed the weeks prior on
June 15th and that the change added considerable more work to the cron job.

Phase 3 - How to prevent it from ever happening again?

This phase is the entire reason for having the post-mortem and is where the room
should spend the majority of their time.  Usually it's led by a single person
that acts as a moderator for the members of the room.  The goal of this phase is
to determine what actions can be taken as a group to prevent the determined
root cause from happening again and thus avoiding the previously described
impact and cleanup.  Additionally teams will often also include actions that
help them to be more aware of the system as a whole to catch problems sooner.
During this phase interactivity from the entire room is essential.
It is very important to recognize that there are generally two types of changes
that can be enacted in order to avoid problems, technology oritened changes and
process oriented changes.
Technology oriented changes are changes that impact that way the software and
its environment behave.  These are things like fixing the underlying bug that
caused the problem, adding automated test cases, adding an additional monitor
and corresponding alert, and so on.
Process oriented changes are changes that impact the way that the team works
with each other and the software.  These are things like implementing code
reviews, adding steps to a deployment runbook, introducing maintenance windows,
and so on.
Typically the way this phase works is that the moderator keeps the discussion
focused on these two types of changes and enumerates them on a whiteboard or in
a shared document of some sort.  A framework that I've seen work successfully
as the moderator is to try to start with process changes and then for each
suggested process change link one or more technical changes to it.  Once
complete the moderator will often take each suggested change and convert them
into an executable action item to be prioritized in subsequent development
cycles.
Example


Process changes


Split cron jobs up into multiple, smaller jobs.
Update the code review process to consider CPU impact.
Deploy more often, don't allow multiple changes to stack up.

Technical changes


Move cron-like tasks to their own server so they don't impact users.
Add CPU monitoring and alerting where impactful.
Add load balancer monitoring and alerting for when timeouts occur.
Retire manual deployment and implement an automated deployment pipeline.
Eliminate cron-jobs entirely (long term goal).