RStankov/emergency_kit.md

## emergency_kit.md

      
    Raw
  

              emergency_kit.md
            
          
    ⛑️ Emergency Kit


  🎥 Video walkthrough of emergency handling
  `[VIDEO IN TOOL LIKE LOOM]`

Tools


GRAFANA
Exception tracker
CDN
[DATABASE MONITORING TOOLS]
LOGS
Status pages

AWS
Sendgrid


Documentation

System architecture


Process

Steps to diagnose an incident:


Communicate
2. All communication should be #engineering-emergency
1. Post every action you do related to the emergency
3. Acknowledge the incident in #feedback
Investigate the issue
Fix the issue
Monitor if fix worked
Do postmortem

Implement improvements


Tips for handling issues


Revert deploys till the last working deploy
In monitoring tools

expand to 48 hours period and look for spikes
watch time, CPU load or memory


Check all recent changes, don’t ignore any of those

focus first on database, dependancies or infrastructure changes


Have a theory about every unusual behavior of the system, test your theories should explain every one
Isolate the issue to the lowest point in the tech stack
No need to open Pull Request for hotfix, you can just merge in master

[ADD PROJECT SPECIFIC TIPS]
Common Issues and Solutions

This is a non-definitive list. It is just a shortlist of ways to investigate symptoms of a bad deployment and how to fix them possibly.

  My changes are not appearing after deploy
  `[PROJECT SPECIFIC REASON AND HOW TO HANDLE]`


  Database is under heavy load
  `[PROJECT SPECIFIC REASON AND HOW TO HANDLE]`


  Site performance is reduced or resulting in 503s
  `[PROJECT SPECIFIC REASON AND HOW TO HANDLE]`


  When in doubt - rollback
  `[HOW TO ROLLBACK DEPLOY]`

[PROJECT SPECIFIC ISSUES]
How-tos


  How to restart services
  `[LINK TO DOCUMENTAION]`


  How to revert deploy
  `[LINK TO DOCUMENTAION]`


  How to scale number of servers
  `[LINK TO DOCUMENTAION]`


  How to fast track emergency fixes
  `[LINK TO DOCUMENTAION]`

[PROJECT SPECIFIC HOW-TOS]

Postmortem


Write a postmortem, use this template.
Share in #egineering Stack channel
Add to the agenda of next engineering meeting