jayar95/incident-response.md

## incident-response.md

      
    Raw
  

              incident-response.md
            
          
    Incident Response

Objectives

The objective of any incident is to resolve the impact as quickly as possible (responsibly). This incident response framework will provide process and guidelines to achieve maximum availability for our Customers while taking into consideration the wellbeing of our engineers and stakeholders.
Note: The "Customers" are usually end-users in most contexts, but can vary by incident. For instance, if all of our build agents broke and pipelines were completely unavailable, the "Customers" would be internal engineers in the context of that incident.
Suggested Reading


https://response.pagerduty.com/training/courses/incident_response/
https://response.pagerduty.com/oncall/being_oncall/
https://sre.google/sre-book/being-on-call/
https://sre.google/sre-book/effective-troubleshooting/

Incident Severity

Incident severity will vary depending on the impact. If impact is unclear or you're stuck between to severity levels (ie: "Is this a P2, or P3?"), use a higher severity level to start the incident. The severity level can be adjusted after remediation.

Most organizations will have five different levels of severity:

    
        Severity Level
        Impact Examples
        Notification Type
        Resolution Target
    
    
        SEV-1 / P1 / Major
        
            
                Customer unable to login
                Customer unable to make purchases
                Severe system-wide degradation
                Critical third-party integration outage (ie, Auth0)
                Security event (DDOS, exploitation, etc)
            
        
        Phone Call + Text
        30 minutes <=
    
    
        SEV-2 / P2 / Major
        
            
                Significant impact to customer experience, but they're still able to login and transact
                Latency much slower than usual
                Delayed events
                Impact exacerbates over time
            
        
        Phone Call + Text
        60 minutes <=
    
    
        SEV-3-5 / P3-5 / Minor
        
            
                NULL or Minor impact to customer experience
                Impact will not exacerbate over time
            
        
        Text / Slack / Email
        >24 Hours or Next business day
    
    
TLDR:

Major Incidents (SEV-1 and 2) are active issues that result in active loss of income, reputation, etc. An on-call engineer should always be automatically called and texted for these issues, regardless of time or day.

Minor Incidents (SEV-3-5) are active issues that do not require immediate response. An on-call engineer will be notified, but not in a way that will wake them from their beauty sleep.
Incident Lifecycle for Technical Responders (A Generalized Playbook)


Identification

A monitor or human alerts the on-call engineer of on an issue
The on-call engineer quickly reviews the alert and confirms impact
An incident is created. The on-call engineer automatically becomes the Incident Commander
The Incident Commander pages necessary technical resources and stakeholders
An "Impact Statement", a brief description of the impact, is shared with all responders, along with a link to a voice bridge if there is major severity.
(Optional) The Incident Commander position can be reassigned at this point if needed
(Optional) If possible, a Scribe / Internal Liaison should be assigned.


Triage

Troubleshoot the platform based on the impact statement.
When did the issue start to occur?
Have any changes occurred before the issue started? (Either human or machine-initiated changes are always a possibility. Always ask)
Who is impacted? Quantify if possible
Is the impact intermittent in nature, steady, or continually getting worse?
Did any monitors fire? What were they alerting?
Are there any known / recurring issues that could be causing the observed impact?
Are any of our third-party dependencies (Azure, Auth0, Experian, etc) reporting outages?
Isolate the cause of the incident by sharing anomalous logs and metrics in chat. If you're sharing a screenshot, include a link to what you're looking at


Remediation

After the cause has been properly isolated, discuss possible solutions with technical responders on the call
Pick the solution considering several factors: probability of success, technical difficulty, and potential side-effects
Always ask "How could implementing this fix go wrong?" Avoid solutions that could increase the severity of impact
Receive clear and obvious acknowledgement that resources are ready to initiate the fix and perform validation


Validation

Perform a functional test manually if possible
Share telemetry that properly reflects that platform / system metrics have returned to baseline and have sustained that threshold for an appropriate amount of time
If validation fails, repeat step 3 (Remediation). Otherwise, the incident commander should excuse everyone from the bridge.


Incident Follow up

Communicate to stakeholders and customers that the incident is resolved
If there was a monitoring gap, create one ASAP
How can we prevent this in the future or reduce the resolution time if the severity
If a permanent fix is still required, or the root cause is not apparent, initiate those workstreams
Create, finalize, and share a post-mortem
Schedule a retro to review the incident


Incident Roles

Many companies have formalized incident roles for specific duties and to define chain-of-command. Given our size, and for the sake of simplicity, we have only three roles. These roles are mutually exclusive from your position / level within the company. For instance, a software engineer might be best suited to perform the Commander position, while the emperor of the entire galaxy could best feasibly perform the scribe duties (and vice versa). The key to success with this model is to work together, stay ego free, don't take things personally, and focus on the Objective: resolving the impact as quick as possible.
Incident Commander
An Incident Commander has sufficient technical and leadership attributes to drive an incident to a resolved state. As the name entails, they are commanding the current incident and should be comfortable delegating technical work, investigation, and communications. They should have at least a broad / high-level technical understanding of the platform affected. This role is automatically assigned to the first responder, but they are strongly encouraged to hand off this position if they feel another responder is better suited to handle the incident.
Technical Responder
A technical responder is an engineer with working domain knowledge of the services impacted. They will be delegated tasks to assist with incident resolution.
Scribe / Internal Liaison
Any resource capable of documenting a timeline of decisions and actions performed. They should communicate workstreams and updates to key stakeholders throughout the incident lifecycle.
Incident Etiquette


The key to success is to work together, stay ego free, don't take things personally, and focus on the Objective: resolving the impact as quick as possible.

Major outages are high-stress, adrenaline fueled rollercoasters. During an incident, while we are interfacing with machines, it is important to remember that we are also interfacing with humans. Currently, unlike humans, machines lack emotions, temperament, ego, and cortisol. Friction between humans is an inevitability during these high-stress situations, so Incident Etiquette is prescribed to reduce the frequency and severity of human friction during an incident's lifecycle.
Instead of re-inventing wheel, some of the following excerpts have been adapted from PagerDuty's Call Etiquette:
For Everyone:

The incident voice bridge should be reserved for technical communication only.
Keep your microphone muted until you have something to say.
Identify yourself when you join the call; state your name and the system you are the expert for.
Be direct and factual. Avoid common logical fallacies during disagreements.
Keep conversations/discussions short and to the point.
Bring any concerns to the Incident Commander (IC) on the call.
Respect time constraints given by the Incident Commander.
Avoid talking over people. Chaos will ensue when there is chatter from multiple parties at once. The Incident Commander should make sure this doesn't happen.

For Technical Responders:

Follow all instructions from the Incident Commander - without exception.
Do not perform any actions unless the Incident Commander has told you to do so.
The Incident Commander will typically poll for any strong objections before performing a large action. This is your time to raise any objections if you have them.
Once the Incident Commander has made a decision, that decision is final and should be followed, even if you disagreed during the poll.
Answer any questions the Incident Commander asks you in a clear and concise way.

Answering that you "don't know" something is perfectly acceptable. Do not try to guess.


The Incident Commander may ask you to investigate something and get back to them in X minutes. Make sure you are ready with an answer within that time.

Answering that you need more time is perfectly acceptable, but you need to give the Incident Commander an estimate of how much time.


For Incident Commanders:

Make sure the voice bridge is always conductive and forward moving. Actively stop side-discussions and distractions.
Make sure the voice bridge stays technical.
Ensure communications to stakeholders outside the voice bridge are delivered by you or the scribe in a timely manner.
Squash and deter any finger-pointing or heated arguments
Delegate as much as possible
You will issue a lot of imperatives; make sure you are kindly delegating.
Establish and negotiate clear time-constraints for workstreams to avoid the frequent, "Hey X, are you done yet?" questions that only delay and stress responders. Make sure these constraints are communicated to all stakeholders, so they are not left wondering.
Hand over your IC role to someone else on the call when appropriate. Do not feel bad for doing this; it is strongly encouraged if you deem they are better suited.
Poll for any objections before performing a significant action.
Be mindful of people's time. Excuse responders from voice bridges as soon as reasonably possible postmortem.

On-Call

Rotations

With a complex systems, multiple teams are often needed to handle the ongoing maintenance and development of their respective domain. While it is good to have all engineers maintain a level of cross-domain knowledge, it is impossible for all engineers to have deep-technical knowledge of all domains of the platform. As such, it is appropriate and needed for each team to have an on-call rotation. Also, there is power in numbers.
Additionally, with complex systems, redundancy is important for the reliability of the platform in case of failures. Likewise, with on-call rotations, it is important to have a backup / secondary on-call engineer in case the current, primary on-call engineer doesn't answer their phone.
Primary On-Call Responsibilities:


Have your phone / text / slack notification settings up-to-date in OpsGenie
Ack P1/P2 alerts within 15 minutes
Keep your phone ringer on 24/7 during your rotation, unless secondary coverage is notified and available
Keep your laptop with you 24/7 during your rotation, unless secondary coverage is notified and available
Coordinate with your secondary for planned dates/times you will be unavailable ASAP so they can provide coverage
Coordinate with your Rotation Manager planned PTO

Rotation Manager


Ensure your team's on-call notifications is up to date. Keep this in mind especially when you are onboarding/offboarding
Make adjustments to the schedule as needed. If a primary on-call engineer is on PTO or OOO, make sure coverage is in place
If you are a manager/owner of a system, it is recommended that you join all major incidents.
Ensure your escalation policies never stop escalating for major severity alerts until someeone acknowledges. You should be on this escalation path
Severity Level	Impact Examples	Notification Type	Resolution Target
SEV-1 / P1 / Major	Customer unable to login Customer unable to make purchases Severe system-wide degradation Critical third-party integration outage (ie, Auth0) Security event (DDOS, exploitation, etc)	Phone Call + Text	30 minutes <=
SEV-2 / P2 / Major	Significant impact to customer experience, but they're still able to login and transact Latency much slower than usual Delayed events Impact exacerbates over time	Phone Call + Text	60 minutes <=
SEV-3-5 / P3-5 / Minor	NULL or Minor impact to customer experience Impact will not exacerbate over time	Text / Slack / Email	>24 Hours or Next business day