Skip to content

Instantly share code, notes, and snippets.

@hexgnu
Created June 30, 2016 18:15
Show Gist options
  • Save hexgnu/3c1eb881e6722a072444136c6af992af to your computer and use it in GitHub Desktop.
Save hexgnu/3c1eb881e6722a072444136c6af992af to your computer and use it in GitHub Desktop.

Post Mortems for infrastructure

Post mortems are an essential part of keeping our site up and reliable. The point of them is not to cast blame, it's not to point fingers, and if anything it's only about one thing:

  • Learning

This doc explains when to run a post mortem, who should be involved, where it should happen, and what the outcome is.

When to run a post mortem?

Post mortems should be run if any of these circumstances happen:

  • User-visible downtime or degradation beyond a 5% threshold. If 5% of all traffic is an error then we should have a post mortem.
  • Data loss of any kind
  • On-call engineer intervention (release rollback, rerouting traffic, production database fixes)
  • Resolution time above 30 minutes
  • Monitoring failure (there was a manual incident discovery)

Who should be involved?

The people who should be involved are:

  • The on-call engineer (when the incident happened).
  • A product owner.
  • The original code writer (if applicable).
  • QA Person
  • Reliability or devops versed engineer for solutions.

Where should it happen?

The post mortem should happen in the production control center in Sococo

What should the outcome be?

Post Mortems are action based. At the end of a post mortem one of the following should be an outcome:

  • A new monitor signal
  • A new kaizen process in place
  • An adjustment to the current on-call schedule
  • An adjustment to peer review in github
  • A new project to rewrite code that is buggy

But always there should be a written up document appended to this for future reference: In general it should follow the structure What happened, What was done to intervene, What are we doing to prevent it from happening in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment