dmitri-lerko/postmortem-actions.md

## postmortem-actions.md

      
    Raw
  

              postmortem-actions.md
            
          
Investigate this incident: what happened to cause this incident and why? Determining the root causes is your ultimate goal.
Examples: logs analysis, diagramming the request path, reviewing headdumps
Mitigate this incident: what immediate actions can we take to resolve and manage this specific event?
Examples: rolling back, cherry-picking, pushing configs, communicating with affected users
Repair damage from this incident: how can we resolve immediate or collateral damage from this incident?
Examples: restoring data, fixing machines, removing traffic re-routes
Detect future incidents: how can we the time to accurately detect a similar failure?
Examples: monitoring, alerting, plausibility checks on input/output
Mitigate future incidents: how can we decrease the severity and/or duration of future incidents like this? How can we reduce the percentage of users affected by this class of failure next time it happens?
Examples: graceful degradation, dropping non-critical results; failing open; augmenting current practices with dashboards, playbooks, incident management protocols, and/or war rooms
Prevent (mandatory) future incidents: how can we prevent a recurrence of this sort of failure?
Examples: stability improvements in the code base, more thorough unit tests, input validation and robustness to error conditions, provisioning changes