Skip to content

Instantly share code, notes, and snippets.

Last active March 3, 2020 16:20
Show Gist options
  • Save c4milo/f33158464a65d3c55b46641e4cbe0c2b to your computer and use it in GitHub Desktop.
Save c4milo/f33158464a65d3c55b46641e4cbe0c2b to your computer and use it in GitHub Desktop.


The title should be the name of the alert (e.g., Generic Alert_AlertTooGeneric).


Address the following: What does this alert mean? Is it a paging or an email-only alert? What factors contributed to the alert? What parts of the service are affected? What other alerts accompany this alert? Who should be notified?

Alert Severity

Indicate the reason for the severity (email or paging) of the alert and the impact of the alerted condition on the system or service.


Provide specific instructions on how to verify that the condition is ongoing.


List and describe debugging techniques and related information sources. Include links to relevant dashboards. Include warnings. Address the following: What shows up in the logs when this alert fires? What debug handlers are available? What are some useful scripts or commands? What sort of output do they generate? What are some additional tasks that need to be done after the alert is resolved?


List and describe possible solutions for addressing this alert. Address the following: How do I fix the problem and stop this alert? What commands should be run to reset things? Who should be contacted if this alert happened due to user behavior? Who has expertise at debugging this issue?


List and describe paths of escalation. Identify whom to notify (person or team) and when. If there is no need to escalate, indicate that.

Related Links

Provide links to relevant related alerts, procedures, and overview documentation.

Service overview


What is it? What does it do? Describe at a high level the functionality provided to clients (end users, components, etc.).


Explain how the architecture works. Describe the data flows between components. Consider adding a system diagram with critical dependencies, and request and data flows.

Clients and Dependencies

List any upstream clients (owned by other teams) that rely on it and downstream services (owned by other teams) that it relies on. (These can also be shown in the system diagram.)

Code and Configs

Explain the production setup. Where does it run? List binary names, jobs, data centers, and config file setup, or point to canonical location of these. Also provide code location and build info if relevant.

List and describe the configuration files, changes, and ports needed to operate this product or service.

Address the following: What configuration files have been modified for this product or service? How is the configuration handled?


Address the following: What daemons and other processes must be running to carry out the service? What control scripts were created to manage this service?


List and describe the log files created by or within the component and the monitoring running against it. Address the following: What log files are generated by the component? What does each file contain? What recommendations do you have for examining these log files? What aspects of the component must be monitored to ensure reliable service?

Dashboards and Tools

Link to the relevant dashboards and tools.


List the capacity of a single instance; per-DC; globally: QPS, bandwidth, and latency numbers.


Give availability targets.

Common Procedures

Add links to procedures. These could include load testing, updates/pushes/flag flips, etc. Link to alert documentation in the alerts playbook.


Link to design docs on the component or related components, typically written by developer teams, and other related information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment