A quick description of the services. 1 to 2 sentences max. Why does this service matter? What is it's core functionality? What Features does it provide users?
FMEA is a method of failure analysis that helps teams create reliable systems and develop comprehensive on-call response patterns.
Service | Failure Mode | Possible Cause | Effects | Probability (P) | Severity (S) | Detection (D) | Risk |
---|---|---|---|---|---|---|---|
DockerHub | Outage / Unreachable | DockerHub DDOSd | Cannot update or deploy extractor | remote (B) | no effect (I) | high | low |
Links to the Dashboards for this service
Links to the Alerts for this service
For Every Alert there should be a corresponding section in alphabetical order
Alert Description: Why do we have this alert? What does it mean? What is typically the cause of this alert?
How does this situation impact our customers? If the customers are not being impacted, this is a good indicator that the alert can be deleted.
Checklist manifesto style steps for how to resolve this alert. A person who has never worked on our stack should be able to follow these steps and remediate the incident. If it cannot be remediated, include escalation steps here.
- Do this
- Check this scenario
- Do this thing
- Do this other thing
- Verify service has recovered