Skip to content

Instantly share code, notes, and snippets.

@joram
Last active August 16, 2021 16:04
Show Gist options
  • Save joram/ec4155346d6918977fcb56d20a06a2fb to your computer and use it in GitHub Desktop.
Save joram/ec4155346d6918977fcb56d20a06a2fb to your computer and use it in GitHub Desktop.
Incident Response

TLDR What Happened?

keep this blameless: don’t use peoples names, don’t say “John deployed shitty code“ that’s not useful. better wording would be “a developer deployed, bypassing jenkins“

Who was affected?

api customers? UI customers? which products? how long? what sort of impact?

Incident Timeline:

be detailed

be concise

use bullet points

timestamps if possible

Fixes

Short term

what was done to fix the problem and get things working

Prevention, Alerting, and Faster Resolutions:

how will this not happen again?

if it happens again, how can we fix it faster?

process changes? smoke tests? periodic review of existing process?

Long Term Fixes

Links to the tickets that will be done during regular work flow

triage (pager duty rotation)

One person at a time is on-call. We use pagerduty to coordinate this, and automated/manual triggers of possible incidents go to their phone (day or night). It’s their job to decide what level of incident this is, and if needed, start the appropriate response. They should feel comfortable summoning anyone else to help triage, or prioritize.

Response(s)

Low

A feature is broken, or not working as expected. A data provider (contacted through pipeline) is down create a ticket. Talk to PMs in #triage communicate with support/customers about expected delays

Response

Create a ticket, regular work flow

Medium

A small subset of customers are unable to do checks.

Response

This is handled right away, during working hours, but can be left till the morning if outside of working hours

High

  • Hemorrhaging Money/Data/Reputation
  • The system (or a vertical of our system) is not usable by most of our customers.

Resonse

Start an Incident Response (see below)

The rest of this doc is about responding to HIGH blast radius incidents.

Roles

At the start of the incident the person on call wears all hats. It is their responsibility to deligate these responsibilities, and not get overwelmed. Another point to deligate is when they need to rotate out and get some rest (don’t wear yourself out).

Incident Coordinator

assign roles

make sure the right people are in the room

make sure the team is diagnosing and moving towards a solution

don’t do the work

coordinate the people

request more people when needed

rotate out people when they are tired

release people when they are not needed

coordinate the post mortem(s)

dev wide for review

company wide if needed for transparency

coordinate the tickets, and long term work need to avoid these issues in the future

Communicator

periodic updates to the rest of the company when appropriate (every 15min is a good rule of thumb)

getting the approriate expectations to CS and customers (the status page if/when we get one)

pin a convo in #fires with the current roles so people know

Doer(s)/Resolver(s)

Usually Devs and DevOps folks.

figure out what is going wrong

figure out the bandaid solution (if necessary)

do bandaid solution

communicate long term fixes needed (during a post mortem)

who can play the roles

Anyone who feels comfortable doing so. If you are new at this, ask for a shadow and feedback afterwards.

For anyone interested in joining the rotation, talk to your team lead and/or someone on rotation. The process for training would be roughly:

onboarding discussion with someone from the rotation

adding you to the rotation when/if you feel ready

have a shadow (someone already on rotation) during your rotations until you feel comfortable to go at it alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment