Skip to content

Instantly share code, notes, and snippets.

@mjpitz
Last active November 2, 2020 15:25
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mjpitz/fd7cb715d4d6f77dff97b89baebc60a6 to your computer and use it in GitHub Desktop.
Save mjpitz/fd7cb715d4d6f77dff97b89baebc60a6 to your computer and use it in GitHub Desktop.
runbook.md

Runbook Template

General

A quick description of the services. 1 to 2 sentences max. Why does this service matter? What is it's core functionality? What Features does it provide users?

Failure Mode and Effect Analysis

FMEA is a method of failure analysis that helps teams create reliable systems and develop comprehensive on-call response patterns.

Service Failure Mode Possible Cause Effects Probability (P) Severity (S) Detection (D) Risk
DockerHub Outage / Unreachable DockerHub DDOSd Cannot update or deploy extractor remote (B) no effect (I) high low

Production Outage Scenarios

Dashboards

Links to the Dashboards for this service

Alerts

Links to the Alerts for this service

For Every Alert there should be a corresponding section in alphabetical order

Alert Title

Alert Description: Why do we have this alert? What does it mean? What is typically the cause of this alert?

Impact to Customers:

How does this situation impact our customers? If the customers are not being impacted, this is a good indicator that the alert can be deleted.

Remediation Steps:

Checklist manifesto style steps for how to resolve this alert. A person who has never worked on our stack should be able to follow these steps and remediate the incident. If it cannot be remediated, include escalation steps here.

  1. Do this
  2. Check this scenario
  3. Do this thing
  4. Do this other thing
  5. Verify service has recovered
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment