Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
OVO's post-mortem template

Post-mortem title


Incident date:

Postmortem date:

Attendees

Summary

High level overview of the problem

Incident Description

Incident timeline

Record the timeline of the incident here

  • 08:32: Alert triggered ....
  • 09:41: Service became unavailable ....

Impact

Number of users who couldn't log in, couldn't take payments for 4 hours, etc.

Contributing factors

An explanation of the circumstances in which this incident happened. What do we think caused it? It’s often helpful to use a technique such as the 5 Whys to understand the contributing factors.

Trigger

What caused the incident? Did something happen to cause it such as a sudden influx of traffic?

Response

Detection

How were we alerted to the problem? Did the right person (or team) detect the issue / get alerted to the issue? If not, why not? How long did it take to get the right response?

Resolution

What was done to restore service / resolve the problem

Response Improvements

Consider long term and short term fixes.

Detection

How could we have spotted this issue sooner? Consider alerting, metrics, access to experts, escalations

Resolution

What could be done to speed up recovery next time? Consider development processes, available metrics, system feedback. Were the right people available

Mitigation Improvements

How do we prevent this issue from occurring in future? Consider system design, testing, chaos engineering, failure domains.

Actions

Item Action Owner Priority JIRA ref
1
2
3

Context on actions

use this section to describe the actions in more detail if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.