Skip to content

Instantly share code, notes, and snippets.

@skyzyx
Last active December 8, 2022 19:32
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save skyzyx/2f22d0699204d5cf139f7ce858cfaeec to your computer and use it in GitHub Desktop.
Save skyzyx/2f22d0699204d5cf139f7ce858cfaeec to your computer and use it in GitHub Desktop.
Internal (not customer-facing) Root Cause Analysis (aka Post-Mortem) template

[20XX-XX-XX] Service Name downtime

  • This is a blameless Post-mortem.
  • We will not focus on the past events as they pertain to "could've", "should've", etc.
  • All follow up action items will be assigned to a team/individual before the end of the meeting.
  • If the item is not going to be top priority leaving the meeting, don't make it a follow up item.

Business Unit {UNIT}
Resolver Group Name {TEAM}
Applications Affected {APPLICATIONS}
Tracking Ticket {LINK}
Slack channel #channel

Incident Leader: @Person

Current Incident Status

  • In-Progress (Red)
  • Monitoring (Yellow)
  • Resolved (Green)

Description

{Service} {Environment} was {Status} for {Duration}

Succinct description of the issue, root cause, stabilization steps, additional info. This section is the most important of the entire RCA, and should be written in Inverted Pyramid style, aka “I’m OK; The Bull is Dead” format. Within a few sentences, the author will describe the customer impact in terms of severity and radius, what actions were taken to remediate, and the initial conclusion as to the preventive actions (automation, additional testing, architecture or design changes, etc.) implemented to prevent a recurrence.

Timeline (US/Pacific)

Time Person Event Notes
11:10–11:11am New Relic Synthetics detected that none of the servers were able to connect to the app and triggered an alert. The alert was routed via PagerDuty to Toph Bei Fong, the primary on-call for this week. Source
11:19am Katara Notified the Slack channel that the Non-Prod environment was offline. Source, screenshot image
11:25am Katara Posted an update to the Slack channel showing that the New Relic APM graphs were reporting service activity again. Source
11:25am–12:00pm Monitoring state. Everything was fine. Moving to Resolved state.

ℹ️ Everything above this point should be filled out at the tail-end of the incident when the information is fresh in the author’s mind. The developer or SRE responsible for the application/system under discussion is responsible for filling out this section. The timeline is sourced from the information provided in the Slack channel. It is ESSENTIAL that key events from the conference bridge are written down in the Slack channel, as it makes it very easy to reconstruct the timeline at the conclusion of the call. Only key events need to be noted.

Contributing Factors (Root Causes, Triggers)

  • List out the known root causes for this issue

Stabilization Steps

  • List out all the steps taken to stabilize even if they were not successful. Indicate which were successful and which were not. Be explicit here so that future issues can benefit from the information.

Lessons Learned

What went well

  • List what happened that went well for us

What went wrong

  • List what went wrong that we need to improve on

Where we got lucky

  • List things that worked in our favor as a result of blind luck, if any

Corrective Actions

Areas to focus on:

  • Prevention: Understanding the cause and fixes go in this bucket.

  • Detection: For P1 events that can be monitored and alerted on, how can we detect and alert within minutes so team knows about it ASAP?

  • Resolution: Once we engaged on resolving, are there any improvements that we could make to reduce the resolution time?

The focus of the JIRA tickets should stay away from "hoping for the best" or "we'll do better next time" responses. They should follow the S.M.A.R.T. criteria and be:

  • Specific: Target a specific area for improvement.
  • Measurable: Quantify or at least suggest an indicator of progress.
  • Assignable: Specify who will do it.
  • Realistic: State what results can realistically be achieved, given available resources.
  • Time-related: Specify when the result(s) can be achieved.

List of corrective actions:

  • TBD
  • TBD
  • TBD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment