Date: 17/09/2018
- Author: Filipe Mendes
- Team: Android Team
- Severity: Minor/Moderate/Major/Critical
- Status: Work in Progress / Complete
Summary
Short description (5 sentences). List the duration along with start and end times. State the impact (most user requests resulted in 500 errors, at peak 100%), main root cause.
Root cause
Go as deep as you can to better understand what needs to be improved, do not sugarcoat.
Trigger
What triggered the incident.
Detection
How the incident was detected. If it was detected by multiple sources (monitoring system, clients, customer support, accidental discovery), list them all.
Impact
How many requests have failed, how many users, companies affected. If not sure, estimate based on historical data.
Resolution
How did you solve the problem? If an RFC was created, please write it here.
Timeline
Date | Event |
---|---|
1st January | Describe what happened... |
2nd February | Describe what action was taken... |
Lessons learned
- What went well
- What went wrong
- Where we got lucky
Other notes
Add anything you feel important related to the incident. E.g. screenshots of the graphs related to the incidents, links to the resources etc.
Minor: minor issue not visible for most of the customers or an issue not disturbing customers' regular workflow. Examples: response times of some of APIs have doubled.
Moderate: low importance functionality not available for all/most of the customers or important functionality doesn't work for several companies. Examples: some users can't add activity.
Major: high importance functionality not available for all/most of the users or app doesn't work for several companies. Examples: search doesn't work, login takes too long, user 123 can't login.
Critical: critical functionality not available for all/most of the users, easy to reproduce or data loss. Examples: deals dont load, cant login, cant see homepage