Skip to content

Instantly share code, notes, and snippets.

@bbeck
Last active December 8, 2021 20:52
Show Gist options
  • Save bbeck/01c5d2c12cc9b06632840bf85965679d to your computer and use it in GitHub Desktop.
Save bbeck/01c5d2c12cc9b06632840bf85965679d to your computer and use it in GitHub Desktop.

Foreword

In order to get the most out of a post-mortem process it is imperative that it be run as inclusively as possible and in a blameless fashion. Always assume non-malicious intent because often the root cause of problems that happen are people making mistakes. The purpose of a post-mortem is for everyone involved to grow and to learn from each other's mistakes in a safe environment which can't happen if blame is being assigned.

Running an effective post-mortem

A post-mortem is broken up into three phases:

  1. Understanding WHAT went wrong and it's impact on the business.
  2. Understanding WHY things went wrong and most importantly the root cause.
  3. Determining HOW to avoid the root cause in the future.

Phase 1 - What went wrong?

The goal of this phase is to get everyone in the room on the same page about what problem has happened in order to allow them to have an informed discussion about it. Usually it's led by a single person describing in detail what went wrong with the system and what impact it had on the business.

This phase will often also include a timeline of events as they unfolded. The timeline is usually not at all important in understanding what went wrong, but often helps during the root cause analysis that happens prior to the post-mortem as well as during the discussion in subsequent phases of the post-mortem.

It's also important to try to quantify the impact to the business as well to help everyone in the room to understand the severity of the problem. This is usually done with metrics like:

  • Duration of problem
  • Impact of the problem (% of requests, % of users, % of data corrupted, etc.)
  • Cleanup required

Example

On Tuesday June 30th the front end load balancer started returning HTTP 504 responses to approximately 33% of incoming requests. This resulted in many of the public facing pages on the website not rendering at all or only partially rendering. The impact of this was that nearly every internal user being unable to perform their job in this application during this degradation.

Phase 2 - Why did it go wrong?

The goal of this phase is to present the results of the root cause analysis of the problem that was performed prior to the post-mortem. Usually it's led by a single engineer with intimate knowledge of the systems that were identified to be at fault during the root cause analysis.

This phase is the presenter's opportunity to get the engineers in the room on the same page about what specifically failed with the technology and caused the problem. It's incredibly important during this phase to describe the logical consequences of the problem so that people in the room can distinguish between the actual problem and an effect caused by the problem. In complex systems it's not uncommon for problems to cascade, so being able to present and understand the single root cause is critical in having an effective post-mortem.

Example

It was identified that one of the three front end servers had a process running that was consuming nearly 100% of the CPU on the machine. This in result caused incoming requests to not be handled quickly enough as the load balancer in front of the front end servers timed out the requests after 10 seconds.

This process was only present on one of the three front end servers and was later determined to be a process kicked off by a daily triggered cron job that started running just before the outage window. It was also determined via commit history that the cron job had last changed the weeks prior on June 15th and that the change added considerable more work to the cron job.

Phase 3 - How to prevent it from ever happening again?

This phase is the entire reason for having the post-mortem and is where the room should spend the majority of their time. Usually it's led by a single person that acts as a moderator for the members of the room. The goal of this phase is to determine what actions can be taken as a group to prevent the determined root cause from happening again and thus avoiding the previously described impact and cleanup. Additionally teams will often also include actions that help them to be more aware of the system as a whole to catch problems sooner. During this phase interactivity from the entire room is essential.

It is very important to recognize that there are generally two types of changes that can be enacted in order to avoid problems, technology oritened changes and process oriented changes.

Technology oriented changes are changes that impact that way the software and its environment behave. These are things like fixing the underlying bug that caused the problem, adding automated test cases, adding an additional monitor and corresponding alert, and so on.

Process oriented changes are changes that impact the way that the team works with each other and the software. These are things like implementing code reviews, adding steps to a deployment runbook, introducing maintenance windows, and so on.

Typically the way this phase works is that the moderator keeps the discussion focused on these two types of changes and enumerates them on a whiteboard or in a shared document of some sort. A framework that I've seen work successfully as the moderator is to try to start with process changes and then for each suggested process change link one or more technical changes to it. Once complete the moderator will often take each suggested change and convert them into an executable action item to be prioritized in subsequent development cycles.

Example

Process changes

  • Split cron jobs up into multiple, smaller jobs.
  • Update the code review process to consider CPU impact.
  • Deploy more often, don't allow multiple changes to stack up.

Technical changes

  • Move cron-like tasks to their own server so they don't impact users.
  • Add CPU monitoring and alerting where impactful.
  • Add load balancer monitoring and alerting for when timeouts occur.
  • Retire manual deployment and implement an automated deployment pipeline.
  • Eliminate cron-jobs entirely (long term goal).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment