🎥 Video walkthrough of emergency handling
`[VIDEO IN TOOL LIKE LOOM]`- GRAFANA
- Exception tracker
- CDN
- [DATABASE MONITORING TOOLS]
- LOGS
- Status pages
- Documentation
Steps to diagnose an incident:


- Communicate
2. All communication should be
#engineering-emergency
1. Post every action you do related to the emergency 3. Acknowledge the incident in#feedback
- Investigate the issue
- Fix the issue
- Monitor if fix worked
- Do postmortem
- Implement improvements
- Revert deploys till the last working deploy
- In monitoring tools
- expand to 48 hours period and look for spikes
- watch time, CPU load or memory
- Check all recent changes, don’t ignore any of those
- focus first on database, dependancies or infrastructure changes
- Have a theory about every unusual behavior of the system, test your theories should explain every one
- Isolate the issue to the lowest point in the tech stack
- No need to open Pull Request for hotfix, you can just merge in master
[ADD PROJECT SPECIFIC TIPS]
This is a non-definitive list. It is just a shortlist of ways to investigate symptoms of a bad deployment and how to fix them possibly.
My changes are not appearing after deploy
`[PROJECT SPECIFIC REASON AND HOW TO HANDLE]`Database is under heavy load
`[PROJECT SPECIFIC REASON AND HOW TO HANDLE]`Site performance is reduced or resulting in 503s
`[PROJECT SPECIFIC REASON AND HOW TO HANDLE]`When in doubt - rollback
`[HOW TO ROLLBACK DEPLOY]`[PROJECT SPECIFIC ISSUES]
How to restart services
`[LINK TO DOCUMENTAION]`How to revert deploy
`[LINK TO DOCUMENTAION]`How to scale number of servers
`[LINK TO DOCUMENTAION]`How to fast track emergency fixes
`[LINK TO DOCUMENTAION]`[PROJECT SPECIFIC HOW-TOS]
- Write a postmortem, use this template.
- Share in
#egineering
Stack channel - Add to the agenda of next engineering meeting