Skip to content

Instantly share code, notes, and snippets.

@jayar95
Created November 16, 2022 16:35
Show Gist options
  • Save jayar95/f0230a9d190ec80027f253524eecf95d to your computer and use it in GitHub Desktop.
Save jayar95/f0230a9d190ec80027f253524eecf95d to your computer and use it in GitHub Desktop.
generic incident response framework

Incident Response

Objectives

The objective of any incident is to resolve the impact as quickly as possible (responsibly). This incident response framework will provide process and guidelines to achieve maximum availability for our Customers while taking into consideration the wellbeing of our engineers and stakeholders.

Note: The "Customers" are usually end-users in most contexts, but can vary by incident. For instance, if all of our build agents broke and pipelines were completely unavailable, the "Customers" would be internal engineers in the context of that incident.

Suggested Reading

Incident Severity

Incident severity will vary depending on the impact. If impact is unclear or you're stuck between to severity levels (ie: "Is this a P2, or P3?"), use a higher severity level to start the incident. The severity level can be adjusted after remediation.

Most organizations will have five different levels of severity:

Severity Level Impact Examples Notification Type Resolution Target
SEV-1 / P1 / Major
  • Customer unable to login
  • Customer unable to make purchases
  • Severe system-wide degradation
  • Critical third-party integration outage (ie, Auth0)
  • Security event (DDOS, exploitation, etc)
Phone Call + Text 30 minutes <=
SEV-2 / P2 / Major
  • Significant impact to customer experience, but they're still able to login and transact
  • Latency much slower than usual
  • Delayed events
  • Impact exacerbates over time
Phone Call + Text 60 minutes <=
SEV-3-5 / P3-5 / Minor
  • NULL or Minor impact to customer experience
  • Impact will not exacerbate over time
Text / Slack / Email >24 Hours or Next business day

TLDR:
Major Incidents (SEV-1 and 2) are active issues that result in active loss of income, reputation, etc. An on-call engineer should always be automatically called and texted for these issues, regardless of time or day.
Minor Incidents (SEV-3-5) are active issues that do not require immediate response. An on-call engineer will be notified, but not in a way that will wake them from their beauty sleep.

Incident Lifecycle for Technical Responders (A Generalized Playbook)

  1. Identification
    • A monitor or human alerts the on-call engineer of on an issue
    • The on-call engineer quickly reviews the alert and confirms impact
    • An incident is created. The on-call engineer automatically becomes the Incident Commander
    • The Incident Commander pages necessary technical resources and stakeholders
    • An "Impact Statement", a brief description of the impact, is shared with all responders, along with a link to a voice bridge if there is major severity.
    • (Optional) The Incident Commander position can be reassigned at this point if needed
    • (Optional) If possible, a Scribe / Internal Liaison should be assigned.
  2. Triage
    • Troubleshoot the platform based on the impact statement.
    • When did the issue start to occur?
    • Have any changes occurred before the issue started? (Either human or machine-initiated changes are always a possibility. Always ask)
    • Who is impacted? Quantify if possible
    • Is the impact intermittent in nature, steady, or continually getting worse?
    • Did any monitors fire? What were they alerting?
    • Are there any known / recurring issues that could be causing the observed impact?
    • Are any of our third-party dependencies (Azure, Auth0, Experian, etc) reporting outages?
    • Isolate the cause of the incident by sharing anomalous logs and metrics in chat. If you're sharing a screenshot, include a link to what you're looking at
  3. Remediation
    • After the cause has been properly isolated, discuss possible solutions with technical responders on the call
    • Pick the solution considering several factors: probability of success, technical difficulty, and potential side-effects
    • Always ask "How could implementing this fix go wrong?" Avoid solutions that could increase the severity of impact
    • Receive clear and obvious acknowledgement that resources are ready to initiate the fix and perform validation
  4. Validation
    • Perform a functional test manually if possible
    • Share telemetry that properly reflects that platform / system metrics have returned to baseline and have sustained that threshold for an appropriate amount of time
    • If validation fails, repeat step 3 (Remediation). Otherwise, the incident commander should excuse everyone from the bridge.
  5. Incident Follow up
    • Communicate to stakeholders and customers that the incident is resolved
    • If there was a monitoring gap, create one ASAP
    • How can we prevent this in the future or reduce the resolution time if the severity
    • If a permanent fix is still required, or the root cause is not apparent, initiate those workstreams
    • Create, finalize, and share a post-mortem
    • Schedule a retro to review the incident

Incident Roles

Many companies have formalized incident roles for specific duties and to define chain-of-command. Given our size, and for the sake of simplicity, we have only three roles. These roles are mutually exclusive from your position / level within the company. For instance, a software engineer might be best suited to perform the Commander position, while the emperor of the entire galaxy could best feasibly perform the scribe duties (and vice versa). The key to success with this model is to work together, stay ego free, don't take things personally, and focus on the Objective: resolving the impact as quick as possible.

Incident Commander

An Incident Commander has sufficient technical and leadership attributes to drive an incident to a resolved state. As the name entails, they are commanding the current incident and should be comfortable delegating technical work, investigation, and communications. They should have at least a broad / high-level technical understanding of the platform affected. This role is automatically assigned to the first responder, but they are strongly encouraged to hand off this position if they feel another responder is better suited to handle the incident.

Technical Responder

A technical responder is an engineer with working domain knowledge of the services impacted. They will be delegated tasks to assist with incident resolution.

Scribe / Internal Liaison

Any resource capable of documenting a timeline of decisions and actions performed. They should communicate workstreams and updates to key stakeholders throughout the incident lifecycle.

Incident Etiquette

The key to success is to work together, stay ego free, don't take things personally, and focus on the Objective: resolving the impact as quick as possible.

Major outages are high-stress, adrenaline fueled rollercoasters. During an incident, while we are interfacing with machines, it is important to remember that we are also interfacing with humans. Currently, unlike humans, machines lack emotions, temperament, ego, and cortisol. Friction between humans is an inevitability during these high-stress situations, so Incident Etiquette is prescribed to reduce the frequency and severity of human friction during an incident's lifecycle.

Instead of re-inventing wheel, some of the following excerpts have been adapted from PagerDuty's Call Etiquette:

For Everyone:

  • The incident voice bridge should be reserved for technical communication only.
  • Keep your microphone muted until you have something to say.
  • Identify yourself when you join the call; state your name and the system you are the expert for.
  • Be direct and factual. Avoid common logical fallacies during disagreements.
  • Keep conversations/discussions short and to the point.
  • Bring any concerns to the Incident Commander (IC) on the call.
  • Respect time constraints given by the Incident Commander.
  • Avoid talking over people. Chaos will ensue when there is chatter from multiple parties at once. The Incident Commander should make sure this doesn't happen.

For Technical Responders:

  • Follow all instructions from the Incident Commander - without exception.
  • Do not perform any actions unless the Incident Commander has told you to do so.
  • The Incident Commander will typically poll for any strong objections before performing a large action. This is your time to raise any objections if you have them.
  • Once the Incident Commander has made a decision, that decision is final and should be followed, even if you disagreed during the poll.
  • Answer any questions the Incident Commander asks you in a clear and concise way.
    • Answering that you "don't know" something is perfectly acceptable. Do not try to guess.
  • The Incident Commander may ask you to investigate something and get back to them in X minutes. Make sure you are ready with an answer within that time.
    • Answering that you need more time is perfectly acceptable, but you need to give the Incident Commander an estimate of how much time.

For Incident Commanders:

  • Make sure the voice bridge is always conductive and forward moving. Actively stop side-discussions and distractions.
  • Make sure the voice bridge stays technical.
  • Ensure communications to stakeholders outside the voice bridge are delivered by you or the scribe in a timely manner.
  • Squash and deter any finger-pointing or heated arguments
  • Delegate as much as possible
  • You will issue a lot of imperatives; make sure you are kindly delegating.
  • Establish and negotiate clear time-constraints for workstreams to avoid the frequent, "Hey X, are you done yet?" questions that only delay and stress responders. Make sure these constraints are communicated to all stakeholders, so they are not left wondering.
  • Hand over your IC role to someone else on the call when appropriate. Do not feel bad for doing this; it is strongly encouraged if you deem they are better suited.
  • Poll for any objections before performing a significant action.
  • Be mindful of people's time. Excuse responders from voice bridges as soon as reasonably possible postmortem.

On-Call

Rotations

With a complex systems, multiple teams are often needed to handle the ongoing maintenance and development of their respective domain. While it is good to have all engineers maintain a level of cross-domain knowledge, it is impossible for all engineers to have deep-technical knowledge of all domains of the platform. As such, it is appropriate and needed for each team to have an on-call rotation. Also, there is power in numbers.

Additionally, with complex systems, redundancy is important for the reliability of the platform in case of failures. Likewise, with on-call rotations, it is important to have a backup / secondary on-call engineer in case the current, primary on-call engineer doesn't answer their phone.

Primary On-Call Responsibilities:

  • Have your phone / text / slack notification settings up-to-date in OpsGenie
  • Ack P1/P2 alerts within 15 minutes
  • Keep your phone ringer on 24/7 during your rotation, unless secondary coverage is notified and available
  • Keep your laptop with you 24/7 during your rotation, unless secondary coverage is notified and available
  • Coordinate with your secondary for planned dates/times you will be unavailable ASAP so they can provide coverage
  • Coordinate with your Rotation Manager planned PTO

Rotation Manager

  • Ensure your team's on-call notifications is up to date. Keep this in mind especially when you are onboarding/offboarding
  • Make adjustments to the schedule as needed. If a primary on-call engineer is on PTO or OOO, make sure coverage is in place
  • If you are a manager/owner of a system, it is recommended that you join all major incidents.
  • Ensure your escalation policies never stop escalating for major severity alerts until someeone acknowledges. You should be on this escalation path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment