lukebriscoe/OVO Post-mortem template.md

## OVO Post-mortem template.md

      
    Raw
  

              OVO Post-mortem template.md
            
          
    Post-mortem title


Incident date:
Postmortem date:
Attendees

Summary

High level overview of the problem
Incident Description

Incident timeline

Record the timeline of the incident here

08:32: Alert triggered ....
09:41: Service became unavailable ....

Impact

Number of users who couldn't log in, couldn't take payments for 4 hours,
etc.
Contributing factors

An explanation of the circumstances in which this incident
happened.  What do we think caused it? It’s often helpful to use a technique
such as the 5 Whys to understand the contributing factors.
Trigger

What caused the incident? Did something happen to cause it such as a sudden
influx of traffic?
Response

Detection

How were we alerted to the problem? Did the right person (or team) detect the
issue / get alerted to the issue? If not, why not? How long did it take to get
the right response?
Resolution

What was done to restore service / resolve the problem
Response Improvements

Consider long term and short term fixes.
Detection

How could we have spotted this issue sooner? Consider alerting, metrics,
access to experts, escalations
Resolution

What could be done to speed up recovery next time? Consider development
processes, available metrics, system feedback. Were the right people available
Mitigation Improvements

How do we prevent this issue from occurring in future? Consider
system design, testing, chaos engineering, failure domains.
Actions


Item
Action
Owner
Priority
JIRA ref


1


2


3


Context on actions

use this section to describe the actions in more detail if needed