OVO's post-mortem template

Post-mortem title

Incident date:

Postmortem date:



High level overview of the problem

Incident Description

Incident timeline

Record the timeline of the incident here

  • 08:32: Alert triggered ....
  • 09:41: Service became unavailable ....


Number of users who couldn't log in, couldn't take payments for 4 hours, etc.

Contributing factors

An explanation of the circumstances in which this incident happened. What do we think caused it? It’s often helpful to use a technique such as the 5 Whys to understand the contributing factors.


What caused the incident? Did something happen to cause it such as a sudden influx of traffic?



How were we alerted to the problem? Did the right person (or team) detect the issue / get alerted to the issue? If not, why not? How long did it take to get the right response?


What was done to restore service / resolve the problem

Response Improvements

Consider long term and short term fixes.


How could we have spotted this issue sooner? Consider alerting, metrics, access to experts, escalations


What could be done to speed up recovery next time? Consider development processes, available metrics, system feedback. Were the right people available

Mitigation Improvements

How do we prevent this issue from occurring in future? Consider system design, testing, chaos engineering, failure domains.


Item Action Owner Priority JIRA ref

Context on actions

use this section to describe the actions in more detail if needed

