Post mortems are an essential part of keeping our site up and reliable. The point of them is not to cast blame, it's not to point fingers, and if anything it's only about one thing:
- Learning
This doc explains when to run a post mortem, who should be involved, where it should happen, and what the outcome is.
Post mortems should be run if any of these circumstances happen:
- User-visible downtime or degradation beyond a 5% threshold. If 5% of all traffic is an error then we should have a post mortem.
- Data loss of any kind
- On-call engineer intervention (release rollback, rerouting traffic, production database fixes)
- Resolution time above 30 minutes
- Monitoring failure (there was a manual incident discovery)
The people who should be involved are:
- The on-call engineer (when the incident happened).
- A product owner.
- The original code writer (if applicable).
- QA Person
- Reliability or devops versed engineer for solutions.
The post mortem should happen in the production control center in Sococo
Post Mortems are action based. At the end of a post mortem one of the following should be an outcome:
- A new monitor signal
- A new kaizen process in place
- An adjustment to the current on-call schedule
- An adjustment to peer review in github
- A new project to rewrite code that is buggy
But always there should be a written up document appended to this for future reference: In general it should follow the structure What happened, What was done to intervene, What are we doing to prevent it from happening in the future.