Skip to content

Instantly share code, notes, and snippets.

@skoenig
Last active May 9, 2024 15:05
Show Gist options
  • Save skoenig/7bacab2124c14a1288b65149bd0b5209 to your computer and use it in GitHub Desktop.
Save skoenig/7bacab2124c14a1288b65149bd0b5209 to your computer and use it in GitHub Desktop.
Some notes on and excerpts from Google's SRE books

Google's SRE Books

Some notes on and excerpts from Google's SRE trifecta. These are just my jottings, so they may be incomplete, not detailed and structured differently from the books.

  1. the original: https://sre.google/sre-book/table-of-contents/
    • philosophy and the principles of production engineering and operations at Google
  2. the follow-up: https://sre.google/workbook/table-of-contents/
    • companion to the first SRE book to address questions, requests, and comments
    • frequent question: how to put these principles into practice in my team/company?
  3. the security add on: https://google.github.io/building-secure-and-reliable-systems/raw/toc.html
    • focus on the security aspect of reliability engineering

Tenets of SRE

  • https://sre.google/sre-book/introduction/
    • comparison of Google's SREs vs traditional sysadmins
    • have software engineers do operations
    • results:
      • SREs would be bored by doing tasks by hand
      • have the skill set necessary to automate tasks
      • do the same work as an operations team, but with automation instead of manual labor
    • to avoid manual labor trap that causes team size to scale with service load, Google places a 50% cap on the amount of "ops" work for SREs (Upper bound. Actual amount of ops work is expected to be much lower)
    • Pros
      • cheaper to scale
      • circumvents Devs/Ops split
    • Cons
      • hard to hire for
      • may be unorthodox in ways that require management support (e.g., product team may push back against decision to stop releases for the quarter because the error budget is depleted)

Change Management

  • https://sre.google/sre-book/introduction/
    • 70% of outages due to changes in a live system. Mitigation:
      • Implement progressive rollouts
      • Monitoring
      • Rollback
    • Remove humans from the loop, avoid standard human problems on repetitive tasks

Risk Management

  • https://sre.google/sre-book/embracing-risk/
    • 100% availability/reliability is the wrong goal
    • trying to achieve 100% reliability isn't noticeable to most users and requires tremendous effort and costs
    • users mostly can't tell difference between 99.99% and 99.999% reliability
    • goal of SRE team isn't "zero outages"
    • goal: make systems reliable enough, but not too reliable!
    • goal should acknowledge the trade-off and leaves an error budget
    • error budget can be spent on anything: launching features, experiments, etc.
    • SREs and product devs are encouraged to spend the error budget to get maximum feature velocity
    • if constantly under error budget: perhaps you are not moving fast enough
    • set quarterly reliability targets
    • work with product owners to translate business objectives into explicit SLOs

Service Level Objectives (SLOs)

  • take down service deliberately to signal that it is not 100% reliable (so people don't get used to it)
  • different indicators for different kinds of systems (user-facing, storage, big data)
  • how many indicators to monitor:
    • too many: hard to model and pay attention to
    • too few: could ignore important behavior
  • studies show that people usually prefer slower average with better tail latency
  • how to choose targets?
    • avoid absolutes: "infinite" scale or "always" available not feasible
    • keep it simple, perfection can wait, SLOs can be redefined later
    • keep safety margin (internal SLOs can be defined more loosely than external SLOs) TODO: Really? Wouldn't it make more sense to have a stricter internal SLO, so that alarms strike on failure to meet it before the external SLO is violated?
    • don't overachieve

Eliminating Toil

  • definition 'toil':
    • manual
    • repetitive
    • automatable
    • tactical
    • no enduring value
  • toil > 50% is a sign that the manager should spread toil load more evenly

Capacity Planning / Demand Forecasting

  • example: capacity planning tool
  • transition from implementation-based to intent-based
  • instead of implementation "Want 50 cores in clusters X, Y and Z", specify requirements "Want to meet demand with N+2 redundancy"

Monitoring & Alerting

  • https://sre.google/sre-book/monitoring-distributed-systems/
    • alerts: human needs to take action immediately
    • tickets: human needs to take action eventually
    • logs: no action needed
    • TODO when to use metrics and when logs?
  • monitoring is non-trivial and is an ongoing amount of work / time / effort
  • trend towards simpler/faster monitoring systems, avoid "magic" systems
  • rules that generate alerts for humans should be clear and easy to understand

Emergency / Incident Response

  • https://sre.google/sre-book/emergency-response/

    • reliability as a function of MTTF (mean-time-to-failure) and MTTR (mean-time-to-recovery)
    • focus on MTTR for evaluating responses
    • playbook presence resulting in 3x lower MTTR
    • fostering a culture of documentation
  • https://sre.google/workbook/incident-response/#putting-best-practices-into-practice

    • chapter outlines Incident Command System (ICS), a standardized approach to emergency response
    • managed incidents resolved faster with clear roles such as Incident Commander, Operations Lead, Communications Lead, and Scribe
    • defining severity levels to categorize incidents and determine appropriate response
    • prioritizing impact stoppage, then root cause analysis (unless identified early on)
    • SREs gathering in a "war room" for incidents
    • clear and timely communication using standardized templates for updates
    • generic mitigations as actions to alleviate pain before full root cause understanding
      • examples:
        • rolling back a recent release correlated with an outage
        • reconfiguring load balancers to avoid a problematic region
      • caution against mitigations as blunt instruments potentially causing service disruptions
      • creation of general-purpose mitigation tools before incidents
    • conducting blameless postmortems (see below) to learn from incidents
    • regular training on incident response procedures and simulated incidents (game days)
    • developing tools for automating common response tasks to reduce human error
    • browsing postmortems to discover useful mitigations and tools for future incident management
    • learning from previous incidents (postmortem docs) for better future major incident response
    • using insights from incidents to drive system and process improvements
    • encouraging a culture of continuous learning and improvement

Postmortem Culture

NOTE: Here the term "postmortem" is used partially for the postmortem document/report and the postmortem review meeting.

What are postmortems?

  • postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.
  • primary goals:
    • ensure that the incident is documented
    • all contributing root cause(s) are well understood
    • effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.
  • reader should get a complete view of the outage and, more importantly, learn something new

How to introduce postmortems?

  • introducing postmortems into an organization is as much a cultural change as it is a technical one
  • senior leadership should actively support and promote the postmortem culture (visible rewards for doing the right thing)
  • start simple and basic, adapt to your organization, there's no one size fits all
  • tools to help bootstrap the process: standardized postmortem template, check lists, some automation (to populate postmortems with metadata)
  • if value of postmortems is questioned in relation to the cost of their preparation:
    • Ease postmortems into the workflow. A trial period with several complete and successful postmortems may help prove their value, in addition to helping to identify what criteria should initiate a postmortem.
    • Make sure that writing effective postmortems is a rewarded and celebrated practice, both publicly through the social methods mentioned earlier, and through individual and team performance management.
    • Encourage senior leadership's acknowledgment and participation.

Postmortem Document / Report

  • postmortem doc best practices:
    • write postmortems for the whole company as the audience, share and announce proactively
    • involve all incident participants in authoring the doc to not overlook contributing factors
    • add context (introduce terminology or link to glossary pages) to make the document accessible and comprehensible to a broader audience
      • otherwise postmortems might be misunderstood or even ignored
    • add measurable data, such as number of failed requests, cache hit ratio, duration of the incident
      • when you can't measure failure, you can't know that it is fixed
      • even rough estimates are better than nothing to show the extent of the outage
      • conclusions should be based on facts and data linked in the doc
    • add trigger and root cause; some of the most important sections of a postmortem doc
    • avoid finger pointing
      • do not mention individuals or teams in the postmortem in a blaming way
      • otherwise team members will become risk averse or try to hide mistakes
    • avoid animated language
      • subjective and dramatic expression distracts from key message and erodes psychological safety
      • instead: provide verifiable data to justify severity of statements
    • derive concrete action items and prioritize them, group them if there are too many, make sure at least some are prevention- or mitigation-focused
    • declare ownership of postmortem doc and action items
      • this promotes accountability and encourages to follow-up and complete the action items
    • add machine-readable metadata like tags to enable later analytics
    • keep the doc concise to balance verbosity with readability (outsource log excerpts to other doc...)

Postmortem Review Meeting

  • review every postmortem to ensure lessons are learned and improvements are made
  • conduct regular review sessions to finalize postmortems and foster knowledge sharing and collaboration

Cultural Decay

  • sign that culture is decaying: disengagement from postmortem process ("glad that I don't have to write that postmortem now")
  • when blame sneaks into postmortem process:
    • moving the narrative back in a more constructive direction
    • focus on investigating the root cause (also process-related) instead of assigning blame
  • lacking time to write postmortems:
    • when team is occupied with other tasks, hard keep up the quality of postmortems
    • try to find out why there is not enough time (not a priority?)
  • repeating similar incidents:
    • implementing action items takes too long?
    • perhaps feature velocity is considered more important
    • are issues always just ad-hoc mitigated instead properly addressed?

Reinforce Postmortem Culture

  • view postmortems as opportunities to fix weaknesses and enhance overall resilience, not just as formalities
  • ensure confidence in escalating issues by keeping the postmortem process blameless
  • rewards for working on postmortem docs and action items:
    • give owners opportunity to present lessons learned
    • peer bonus
    • positive performance reviews
    • even promotion
  • intrinsic motivation to live postmortem culture:
    • over time fewer incidents, more focus time for feature velocity
    • gamification: leaderboards to show closing action items 'scores'
    • Wheel of Misfortune: training new on-call engineers on simulations of real incidents
  • regularly ask for feedback on effectiveness of the postmortem process and seek ways to improve it
  • ask such as:
    • is the culture supporting your work?
    • does writing a postmortem entail too much toil?
    • what best practices does your team recommend for other teams?
    • what kinds of tools would you like to see developed?
  • tools are no silver bullet but can make the process more smooth and free up time

Tracking Outages

  • Escalator / Outalator principles

Preventing Cascading Failures

  • load testing is crucial in a realistic environment to anticipate traffic spikes
  • fail quickly and with minimal resource usage when overloaded to prevent exacerbating the situation
  • implement request rejection at various system levels (reverse proxy, load balancer, task) to prevent overloading downstream services
  • conduct regular capacity planning to ensure readiness for traffic increases

Graceful Degradation

  • serve degraded results to maintain service availability under stress
  • regular testing of the degradation path is essential, potentially by overloading a subset of servers to ensure reliability
  • maintain simplicity in degradation mechanisms for predictability and ease of understanding

Retries

  • use retries with randomized exponential backoff to avoid synchronized retry storms

Timeouts vs. Deadlines

  • set thoughtful deadlines based on system load and failure scenarios rather than arbitrary round numbers to prevent the accumulation of "zombie" requests

Data Integrity

  • data integrity is more critical than availability
  • backups are complex due to transactional diversity and versioning
  • replicas are not a sufficient solution because corrupted data will be synced

Defense Layers

  • soft Deletion:
    • protects against unintended data loss by delaying permanent deletion
  • backups:
    • recovery objectives dictate data loss tolerance, recovery time, and backup duration
    • Google typically maintains a 30 to 90-day backup window
  • early Detection:
    • integrity checks are vital but difficult to balance for accuracy without false positives or negatives
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment