Incident Commander: [Name of IC]
[A quick one or two sentence description of the issue from a high-level view.]
[The amount of time of the user impact. This does not include any time spent after restoration of service but before the formal end of the incident.]
[Describe the impact to end users, such as which features or services were unavailable.]
[A timeline of events from the start of the incident until the Incident Commander declares it over. All times should be in UTC.]
- 01:23 - Incident start.
- 23:34 - Incident concluded.
[The most direct trigger of incident. Often times this will be a human error such as a code bug that was missed in review or an operational mistake. Our process is blameless and we document our mistakes so that we can learn from them. But try to not turn this into a personal callout, even of yourself.]
[Root causes are the underlying deeper problem that lead to the incident. For example, if the proximate trigger was a bug missed in review then a root cause might be missing static analysis tooling in CI that could have caught it, or missing code review guidelines. Root causes should never be human error, they are systemic issues that create the conditions for human error to become a problem. Also root causes are sometimes slippery, you can trace the chain of events back infinitely far if you try hard enough but putting "Root Cause: 13 billion years ago a singularity expanded into the Big Bang" is not productive. Look for root causes that help guide our future path rather than documenting every contributing factor for its own sake.]
[How was this problem noticed? User reports, automated alerts, etc.]
[What steps were taken to resolve the incident. Try to be specific, such as linking to a commit/PR for code fixes or listing the commands used for an interactive fix, as these can help guide future improvements.]
[Any notes about things that went well in our process during the handling of the incident.]
[Like the above but things that went poorly. This is only related to the process and handling, the incident itself is probably something that went poorly too but that is discussed above.]
[Any places where things went well but more due to happenstance, such as one bug cancelling out another or an issue being noticed early before automated detection warned us.]
[Tasks we should take away from this incident to prevent it from recurring or to improve our handling of similar incidents in the future. Action items can be divided into 5 categories listed below. It's possible not all categories will have an action item, they are only to help guide your thinking.]
[Ways to improve the detection of problems.]
[Ways to get eyeballs on the incident faster.]
[Ways to limit the damage of incidents.]
[Ways to reduce the chances of the proximate triggers of this incident.]
[Ways to solve the root causes of this incident so as to make it structurally impossible.]
All times in UTC.