Skip to content

Instantly share code, notes, and snippets.

@shankfoo
Last active October 13, 2022 13:28
Show Gist options
  • Save shankfoo/b1d6b69f9b6275f2541169eb9bb597ec to your computer and use it in GitHub Desktop.
Save shankfoo/b1d6b69f9b6275f2541169eb9bb597ec to your computer and use it in GitHub Desktop.
COE - Correction of Error - document template

Correction Of Error (COE)

Incident:

[Incident Title]

Ticket: [Link to ops ticket]
Authors: [Names of CoE authors]
Escalations before response: [Number of team members the escalation went through before somebody accepted]
Time to first response: [How long from first page until a team member accepted and responded]
Participants: [Number of responding team members]

Incident Description

Provide a high level summary or abstract of what happened.

  • What system or subsystem was affected?
  • Who are the customers that were affected?
  • What was the incident's impact on the affected customer?
  • Use journalistic style; see: https://en.wikipedia.org/wiki/News_style

Timeline

Describe in brief the incident timeline in chronological order, from the time an issue started, through customer impact, ending with incident resolution, noting in retrospect the effect of each event on the timeline.

  • Which events had positive impact on the outcome?
  • Which events had no impact on restoring service?
  • Which events had negative impact on the outcome?
  • RUNBOOK : Note the positive events as possible contributions to the runbook.
  • Event 1
  • Event 2
  • Event 3

Detection

Describe how the incident was detected or reported, including any and all supporting graphs.

  • Did we find out about this from an existing alarm?
  • Did we find out from user complaints?
  • Did we find out from some other method?
  • GRAPHS: Show graphs of the primary impact, and other secondary graphs, if any.
    • If no graphs, then open a ticket to instrument this part of the system.

Investigation

Describe in detail what was inspected and/or investigated to confirm and remedy the problem.

  • What discovery or investigation was done?
  • What process was used?
  • What tools were used?
  • What did you find in each tool?
  • What were your hunches or assumptions?
  • What did you rule out?
  • RUNBOOK : Was there an existing runbook addressing confirmation and mitigation steps?
    • If no runbook or bad steps in the runbook, then open a ticket to edit/create the runbook.

Response

Describe the specific mitigation steps that were taken during the incident to eliminate or lessen the surface problem and its "Blast Radius", typically before root cause was fully understood.

  • How did you make the pain stop?
  • How effective were the mitigation steps? Did they permanently fix the issue?
  • Step 1
  • Step 2
  • Step 3

Root Cause

AKA the Five Whys Worksheet:

  • Start with asking why the surface problem occurred, and work backwards and/or inwards toward the root cause.
  • Keep asking "why" until you get to the root cause.
  • There can be more than 5 questions, but consider 5 the minimum.
  • It's okay to follow question branches.
  • See: https://en.wikipedia.org/wiki/Five_whys
Problem: [Describe the surface problem]
1. Why? [underlying cause...]
2. Why? [underlying cause...]
3. Why? [underlying cause...]
4. Why? [underlying cause...]
5. Why? [underlying cause...]
Root Cause: [Actual Root Cause]

Resolution

Describe the remedies applied to the system in order to resolve the problem and its underlying causes after root cause was fully understood.

  • How was the issue resolved and its impact fully remedied?
  • What resources or teams were required to apply the full remedy?
  • How was the customer made whole post-incident?
  • PENDING WORK: Identify and open tickets for any underlying causes that are not fully resolved.

Correlation

  • Was this a known failure mode?
  • Were there existing backlog items for this issue?
  • Did we already know about this potential failure mode?
  • Did we have scheduled work?
  • Were there existing backlog items that would have prevented this issue?

Learnings and recommendations

  • How can we do better (in alarms, process, automation, time to response, etc)?
  • What could we have done to prevent this issue from occurring?
  • How do we ensure this incident never happens again?
  • What can we do to improve how the incident was handled?
  • Think big, think outside the box.
  • As a thought exercise, how could the blast radius for a similar event be cut in half?
  • What follow-up actions are we taking?
  • What scheduled work was created?

Discussion

Describe any feedback, comments, and questions from stakeholders and observers.

  • What went well?
  • What went wrong?
  • Where did we get lucky?
  • What are the actionable tasks and follow-ups?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment