A quick description of the services. 1 to 2 sentences max. Why does this service matter? What is it's core functionality? What Features does it provide users?
Failure Mode and Effect Analysis
FMEA is a method of failure analysis that helps teams create reliable systems and develop comprehensive on-call response patterns.
|Service||Failure Mode||Possible Cause||Effects||Probability (P)||Severity (S)||Detection (D)||Risk|
|DockerHub||Outage / Unreachable||DockerHub DDOSd||Cannot update or deploy extractor||remote (B)||no effect (I)||high||low|
Production Outage Scenarios
Links to the Dashboards for this service
Links to the Alerts for this service
For Every Alert there should be a corresponding section in alphabetical order
Alert Description: Why do we have this alert? What does it mean? What is typically the cause of this alert?
Impact to Customers:
How does this situation impact our customers? If the customers are not being impacted, this is a good indicator that the alert can be deleted.
Checklist manifesto style steps for how to resolve this alert. A person who has never worked on our stack should be able to follow these steps and remediate the incident. If it cannot be remediated, include escalation steps here.
- Do this
- Check this scenario
- Do this thing
- Do this other thing
- Verify service has recovered