Skip to content

Instantly share code, notes, and snippets.

@c4milo
Last active March 3, 2020 16:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save c4milo/f33158464a65d3c55b46641e4cbe0c2b to your computer and use it in GitHub Desktop.
Save c4milo/f33158464a65d3c55b46641e4cbe0c2b to your computer and use it in GitHub Desktop.

Title

The title should be the name of the alert (e.g., Generic Alert_AlertTooGeneric).

Overview

Address the following: What does this alert mean? Is it a paging or an email-only alert? What factors contributed to the alert? What parts of the service are affected? What other alerts accompany this alert? Who should be notified?

Alert Severity

Indicate the reason for the severity (email or paging) of the alert and the impact of the alerted condition on the system or service.

Verification

Provide specific instructions on how to verify that the condition is ongoing.

Troubleshooting

List and describe debugging techniques and related information sources. Include links to relevant dashboards. Include warnings. Address the following: What shows up in the logs when this alert fires? What debug handlers are available? What are some useful scripts or commands? What sort of output do they generate? What are some additional tasks that need to be done after the alert is resolved?

Solution

List and describe possible solutions for addressing this alert. Address the following: How do I fix the problem and stop this alert? What commands should be run to reset things? Who should be contacted if this alert happened due to user behavior? Who has expertise at debugging this issue?

Escalation

List and describe paths of escalation. Identify whom to notify (person or team) and when. If there is no need to escalate, indicate that.

Related Links

Provide links to relevant related alerts, procedures, and overview documentation.

Service overview

Overview

What is it? What does it do? Describe at a high level the functionality provided to clients (end users, components, etc.).

Architecture

Explain how the architecture works. Describe the data flows between components. Consider adding a system diagram with critical dependencies, and request and data flows.

Clients and Dependencies

List any upstream clients (owned by other teams) that rely on it and downstream services (owned by other teams) that it relies on. (These can also be shown in the system diagram.)

Code and Configs

Explain the production setup. Where does it run? List binary names, jobs, data centers, and config file setup, or point to canonical location of these. Also provide code location and build info if relevant.

List and describe the configuration files, changes, and ports needed to operate this product or service.

Address the following: What configuration files have been modified for this product or service? How is the configuration handled?

Processes

Address the following: What daemons and other processes must be running to carry out the service? What control scripts were created to manage this service?

Output

List and describe the log files created by or within the component and the monitoring running against it. Address the following: What log files are generated by the component? What does each file contain? What recommendations do you have for examining these log files? What aspects of the component must be monitored to ensure reliable service?

Dashboards and Tools

Link to the relevant dashboards and tools.

Capacity

List the capacity of a single instance; per-DC; globally: QPS, bandwidth, and latency numbers.

SLA

Give availability targets.

Common Procedures

Add links to procedures. These could include load testing, updates/pushes/flag flips, etc. Link to alert documentation in the alerts playbook.

References

Link to design docs on the component or related components, typically written by developer teams, and other related information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment