Skip to content

Instantly share code, notes, and snippets.

@dmitri-lerko
Created August 28, 2019 13:13
Show Gist options
  • Save dmitri-lerko/90cbc2dde749ada84f17f69395ab3a10 to your computer and use it in GitHub Desktop.
Save dmitri-lerko/90cbc2dde749ada84f17f69395ab3a10 to your computer and use it in GitHub Desktop.
Google SRE's Postmortem actions
  • Investigate this incident: what happened to cause this incident and why? Determining the root causes is your ultimate goal. Examples: logs analysis, diagramming the request path, reviewing headdumps
  • Mitigate this incident: what immediate actions can we take to resolve and manage this specific event? Examples: rolling back, cherry-picking, pushing configs, communicating with affected users
  • Repair damage from this incident: how can we resolve immediate or collateral damage from this incident? Examples: restoring data, fixing machines, removing traffic re-routes
  • Detect future incidents: how can we the time to accurately detect a similar failure? Examples: monitoring, alerting, plausibility checks on input/output
  • Mitigate future incidents: how can we decrease the severity and/or duration of future incidents like this? How can we reduce the percentage of users affected by this class of failure next time it happens? Examples: graceful degradation, dropping non-critical results; failing open; augmenting current practices with dashboards, playbooks, incident management protocols, and/or war rooms
  • Prevent (mandatory) future incidents: how can we prevent a recurrence of this sort of failure? Examples: stability improvements in the code base, more thorough unit tests, input validation and robustness to error conditions, provisioning changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment