hexgnu/post mortems.md

## post mortems.md

      
    Raw
  

              post mortems.md
            
          
    Post Mortems for infrastructure

Post mortems are an essential part of keeping our site up and reliable. The point of them is not to cast blame, it's not to point fingers, and if anything it's only about one thing:

Learning

This doc explains when to run a post mortem, who should be involved, where it should happen, and what the outcome is.
When to run a post mortem?

Post mortems should be run if any of these circumstances happen:

User-visible downtime or degradation beyond a 5% threshold. If 5% of all traffic is an error then we should have a post mortem.
Data loss of any kind
On-call engineer intervention (release rollback, rerouting traffic, production database fixes)
Resolution time above 30 minutes
Monitoring failure (there was a manual incident discovery)

Who should be involved?

The people who should be involved are:

The on-call engineer (when the incident happened).
A product owner.
The original code writer (if applicable).
QA Person
Reliability or devops versed engineer for solutions.

Where should it happen?

The post mortem should happen in the production control center in Sococo
What should the outcome be?

Post Mortems are action based. At the end of a post mortem one of the following should be an outcome:

A new monitor signal
A new kaizen process in place
An adjustment to the current on-call schedule
An adjustment to peer review in github
A new project to rewrite code that is buggy

But always there should be a written up document appended to this for future reference: In general it should follow the structure What happened, What was done to intervene, What are we doing to prevent it from happening in the future.