High level overview of the problem
Record the timeline of the incident here
- 08:32: Alert triggered ....
- 09:41: Service became unavailable ....
Number of users who couldn't log in, couldn't take payments for 4 hours, etc.
An explanation of the circumstances in which this incident happened. What do we think caused it? It’s often helpful to use a technique such as the 5 Whys to understand the contributing factors.
What caused the incident? Did something happen to cause it such as a sudden influx of traffic?
How were we alerted to the problem? Did the right person (or team) detect the issue / get alerted to the issue? If not, why not? How long did it take to get the right response?
What was done to restore service / resolve the problem
Consider long term and short term fixes.
How could we have spotted this issue sooner? Consider alerting, metrics, access to experts, escalations
What could be done to speed up recovery next time? Consider development processes, available metrics, system feedback. Were the right people available
How do we prevent this issue from occurring in future? Consider system design, testing, chaos engineering, failure domains.
Context on actions
use this section to describe the actions in more detail if needed