Skip to content

Instantly share code, notes, and snippets.

@toufik-airane
Created September 14, 2021 09:19
Show Gist options
  • Save toufik-airane/18af4632d1d43e9723542f15bc257aa9 to your computer and use it in GitHub Desktop.
Save toufik-airane/18af4632d1d43e9723542f15bc257aa9 to your computer and use it in GitHub Desktop.
Post-mortem Template

Incident Postmortem Template

Clear documentation is key to an effective incident postmortem process. Many teams use a comprehensive template to collect consistent details during each postmortem review. Below is an example of an incident postmortem template, based on the postmortem outlined in our Incident Handbook. You can cut and paste these for documenting your own postmortems.

Incident summary

Write a summary of the incident in a few sentences. Include what happened, why, the severity of the incident and how long the impact lasted.

Leadup

Describe the sequence of events that led to the incident, for example, previous changes that introduced bugs that had not yet been detected.

Fault

Describe how the change that was implemented didn't work as expected. If available, attach screenshots of relevant data visualizations that illustrate the fault.

Impact

Describe how the incident impacted internal and external users during the incident. Include how many support cases were raised.

Detection

When did the team detect the incident? How did they know it was happening? How could we improve time-to-detection? Consider: How would we have cut that time by half?

Response

Who responded to the incident? When did they respond, and what did they do? Note any delays or obstacles to responding.

Recovery

Describe how the service was restored and the incident was deemed over. Detail how the service was successfully restored and you knew how what steps you needed to take to recovery.

Root cause identification (The Five Whys)

Note the final root cause of the incident, the thing identified that needs to change in order to prevent this class of incident from happening again.

The Five Whys is a root cause identification technique. Here’s how you can use it:

Begin with a description of the impact and ask why it occurred. Note the impact that it had.
Ask why this happened, and why it had the resulting impact. Then, continue asking “why” until you arrive at a root cause. List the "whys" in your postmortem documentation.

Backlog check

Review your engineering backlog to find out if there was any unplanned work there that could have prevented this incident, or at least reduced its impact? A clear-eyed assessment of the backlog can shed light on past decisions around priority and risk.

Recurrence

Now that you know the root cause, can you look back and see any other incidents that could have the same root cause? If yes, note what mitigation was attempted in those incidents and ask why this incident occurred again.

Mitigation and resolution

What steps did you take to resolve this incident? Describe the corrective action ordered to prevent this class of incident in the future. Note who is responsible and when they have to complete the work and where that work is being tracked.

Lessons learned

What went well? What could have gone better? What else did you learn? Check out our article on postmortem templates for more example questions to include on a postmortem report. Discuss what went well in the incident response, what could have been improved, and where there are opportunities for improvement.

What else to include on a postmortem report

Screenshots

Attach relevant screenshots, especially ones the response team took during the outage. What did you see change in the product? What product behavior didn’t happen as expected?

Tickets

Link to any relevant tickets related to the incident.

Customer feedback

Did any customer feedback come in about the incident? These could be reported in places like a help desk, over email, on social media. Don’t worry about including all of it.

Charts and grafs

What data visualizations help show the impact of the incident?

Data

Any other key data points about the incident or its impact?

Chat exchanges

If the team uses a chat tool like Slack during the response effort, consider including any key messages or exchanges from the chat history.

Timeline

A clear timeline of the incident is an excellent aid for incident analysis. What were the key events and their timestamps during the incident. Detail the incident timeline. We recommend using UTC to standardize for timezones.

Include any notable lead-up events, any starts of activity, the first known impact, and escalations. Note any decisions or changed made, and when the incident ended, along with any post-impact events of note.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment