joram/Post Mortem Template.md

## Post Mortem Template.md

      
    Raw
  

              Post Mortem Template.md
            
          
    TLDR What Happened?

keep this blameless: don’t use peoples names, don’t say “John deployed shitty code“ that’s not useful. better wording would be “a developer deployed, bypassing jenkins“
Who was affected?

api customers? UI customers? which products? how long? what sort of impact?
Incident Timeline:

be detailed
be concise
use bullet points
timestamps if possible
Fixes

Short term

what was done to fix the problem and get things working
Prevention, Alerting, and Faster Resolutions:

how will this not happen again?
if it happens again, how can we fix it faster?
process changes? smoke tests? periodic review of existing process?
Long Term Fixes

Links to the tickets that will be done during regular work flow

  
## Procedure.md

      
    Raw
  

              Procedure.md
            
          
    triage (pager duty rotation)

One person at a time is on-call. We use pagerduty to coordinate this, and automated/manual triggers of possible incidents go to their phone (day or night). It’s their job to decide what level of incident this is, and if needed, start the appropriate response. They should feel comfortable summoning anyone else to help triage, or prioritize.
Response(s)

Low

A feature is broken, or not working as expected. A data provider (contacted through pipeline) is down create a ticket. Talk to PMs in #triage communicate with support/customers about expected delays
Response

Create a ticket, regular work flow
Medium

A small subset of customers are unable to do checks.
Response

This is handled right away, during working hours, but can be left till the morning if outside of working hours
High


Hemorrhaging Money/Data/Reputation
The system (or a vertical of our system) is not usable by most of our customers.

Resonse

Start an Incident Response (see below)
The rest of this doc is about responding to HIGH blast radius incidents.
Roles

At the start of the incident the person on call wears all hats. It is their responsibility to deligate these responsibilities, and not get overwelmed. Another point to deligate is when they need to rotate out and get some rest (don’t wear yourself out).
Incident Coordinator

assign roles
make sure the right people are in the room
make sure the team is diagnosing and moving towards a solution
don’t do the work
coordinate the people
request more people when needed
rotate out people when they are tired
release people when they are not needed
coordinate the post mortem(s)
dev wide for review
company wide if needed for transparency
coordinate the tickets, and long term work need to avoid these issues in the future
Communicator

periodic updates to the rest of the company when appropriate (every 15min is a good rule of thumb)
getting the approriate expectations to CS and customers (the status page if/when we get one)
pin a convo in #fires with the current roles so people know
Doer(s)/Resolver(s)

Usually Devs and DevOps folks.
figure out what is going wrong
figure out the bandaid solution (if necessary)
do bandaid solution
communicate long term fixes needed (during a post mortem)
who can play the roles

Anyone who feels comfortable doing so. If you are new at this, ask for a shadow and feedback afterwards.
For anyone interested in joining the rotation, talk to your team lead and/or someone on rotation.
The process for training would be roughly:
onboarding discussion with someone from the rotation
adding you to the rotation when/if you feel ready
have a shadow (someone already on rotation) during your rotations until you feel comfortable to go at it alone.