skoenig/_googles-sre-books.md

## _googles-sre-books.md

      
    Raw
  

              _googles-sre-books.md
            
          
    Google's SRE Books

Some notes on and excerpts from Google's SRE trifecta. These are just my jottings, so they may be incomplete, not detailed and structured differently from the books.

the original: https://sre.google/sre-book/table-of-contents/

philosophy and the principles of production engineering and operations at Google


the follow-up: https://sre.google/workbook/table-of-contents/

companion to the first SRE book to address questions, requests, and comments
frequent question: how to put these principles into practice in my team/company?


the security add on: https://google.github.io/building-secure-and-reliable-systems/raw/toc.html

focus on the security aspect of reliability engineering


Tenets of SRE


https://sre.google/sre-book/introduction/

comparison of Google's SREs vs traditional sysadmins
have software engineers do operations
results:

SREs would be bored by doing tasks by hand
have the skill set necessary to automate tasks
do the same work as an operations team, but with automation instead of manual labor


to avoid manual labor trap that causes team size to scale with service load, Google places a 50% cap on the amount of "ops" work for SREs
(Upper bound. Actual amount of ops work is expected to be much lower)
Pros

cheaper to scale
circumvents Devs/Ops split


Cons

hard to hire for
may be unorthodox in ways that require management support (e.g., product team may push back against decision to stop releases for the quarter because the error budget is depleted)


Change Management


https://sre.google/sre-book/introduction/

70% of outages due to changes in a live system. Mitigation:

Implement progressive rollouts
Monitoring
Rollback


Remove humans from the loop, avoid standard human problems on repetitive tasks


Risk Management


https://sre.google/sre-book/embracing-risk/

100% availability/reliability is the wrong goal
trying to achieve 100% reliability isn't noticeable to most users and requires tremendous effort and costs
users mostly can't tell difference between 99.99% and 99.999% reliability
goal of SRE team isn't "zero outages"
goal: make systems reliable enough, but not too reliable!
goal should acknowledge the trade-off and leaves an error budget
error budget can be spent on anything: launching features, experiments, etc.
SREs and product devs are encouraged to spend the error budget to get maximum feature velocity
if constantly under error budget: perhaps you are not moving fast enough
set quarterly reliability targets
work with product owners to translate business objectives into explicit SLOs


Service Level Objectives (SLOs)


take down service deliberately to signal that it is not 100% reliable (so people don't get used to it)
different indicators for different kinds of systems (user-facing, storage, big data)
how many indicators to monitor:

too many: hard to model and pay attention to
too few: could ignore important behavior


studies show that people usually prefer slower average with better tail latency
how to choose targets?

avoid absolutes: "infinite" scale or "always" available not feasible
keep it simple, perfection can wait, SLOs can be redefined later
keep safety margin (internal SLOs can be defined more loosely than external SLOs) TODO: Really? Wouldn't it make more sense to have a stricter internal SLO, so that alarms strike on failure to meet it before the external SLO is violated?
don't overachieve


Eliminating Toil


definition 'toil':

manual
repetitive
automatable
tactical
no enduring value


toil > 50% is a sign that the manager should spread toil load more evenly

Capacity Planning / Demand Forecasting


example: capacity planning tool
transition from implementation-based to intent-based
instead of implementation "Want 50 cores in clusters X, Y and Z", specify requirements "Want to meet demand with N+2 redundancy"

Monitoring & Alerting


https://sre.google/sre-book/monitoring-distributed-systems/

alerts: human needs to take action immediately
tickets: human needs to take action eventually
logs: no action needed


https://sre.google/workbook/monitoring/

monitoring is non-trivial and is an ongoing amount of work / time / effort
trend towards simpler/faster monitoring systems, avoid "magic" systems
rules that generate alerts for humans should be clear and easy to understand


Metrics vs. Logs


metrics-based monitoring system should be primary source
logs have information about every event. unlimited cardinality, but this limits the number of fields
metrics aggregate across events. you can have many metrics (>10000) but with limited cardinality
logs and metrics should be used complimentary

Emergency / Incident Response


https://sre.google/sre-book/emergency-response/

reliability as a function of MTTF (mean-time-to-failure) and MTTR (mean-time-to-recovery)
focus on MTTR for evaluating responses
playbook presence resulting in 3x lower MTTR
fostering a culture of documentation


https://sre.google/workbook/incident-response/#putting-best-practices-into-practice

chapter outlines Incident Command System (ICS), a standardized approach to emergency response
managed incidents resolved faster with clear roles such as Incident Commander, Operations Lead, Communications Lead, and Scribe
defining severity levels to categorize incidents and determine appropriate response
prioritizing impact stoppage, then root cause analysis (unless identified early on)
SREs gathering in a "war room" for incidents
clear and timely communication using standardized templates for updates
generic mitigations as actions to alleviate pain before full root cause understanding

examples:

rolling back a recent release correlated with an outage
reconfiguring load balancers to avoid a problematic region


caution against mitigations as blunt instruments potentially causing service disruptions
creation of general-purpose mitigation tools before incidents


conducting blameless postmortems (see below) to learn from incidents
regular training on incident response procedures and simulated incidents (game days)
developing tools for automating common response tasks to reduce human error
browsing postmortems to discover useful mitigations and tools for future incident management
learning from previous incidents (postmortem docs) for better future major incident response
using insights from incidents to drive system and process improvements
encouraging a culture of continuous learning and improvement


Postmortem Culture

NOTE: Here the term "postmortem" is used partially for the postmortem document/report and the postmortem review meeting.

https://sre.google/sre-book/postmortem-culture/
https://sre.google/workbook/postmortem-culture/

What are postmortems?


postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.
primary goals:

ensure that the incident is documented
all contributing root cause(s) are well understood
effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.


reader should get a complete view of the outage and, more importantly, learn something new

How to introduce postmortems?


introducing postmortems into an organization is as much a cultural change as it is a technical one
senior leadership should actively support and promote the postmortem culture (visible rewards for doing the right thing)
start simple and basic, adapt to your organization, there's no one size fits all
tools to help bootstrap the process: standardized postmortem template, check lists, some automation (to populate postmortems with metadata)
if value of postmortems is questioned in relation to the cost of their preparation:

Ease postmortems into the workflow. A trial period with several complete and successful postmortems may help prove their value, in addition to helping to identify what criteria should initiate a postmortem.
Make sure that writing effective postmortems is a rewarded and celebrated practice, both publicly through the social methods mentioned earlier, and through individual and team performance management.
Encourage senior leadership's acknowledgment and participation.


Postmortem Document / Report


postmortem doc best practices:

write postmortems for the whole company as the audience, share and announce proactively
involve all incident participants in authoring the doc to not overlook contributing factors
add context (introduce terminology or link to glossary pages) to make the document accessible and comprehensible to a broader audience

otherwise postmortems might be misunderstood or even ignored


add measurable data, such as number of failed requests, cache hit ratio, duration of the incident

when you can't measure failure, you can't know that it is fixed
even rough estimates are better than nothing to show the extent of the outage
conclusions should be based on facts and data linked in the doc


add trigger and root cause; some of the most important sections of a postmortem doc
avoid finger pointing

do not mention individuals or teams in the postmortem in a blaming way
otherwise team members will become risk averse or try to hide mistakes


avoid animated language

subjective and dramatic expression distracts from key message and erodes psychological safety
instead: provide verifiable data to justify severity of statements


derive concrete action items and prioritize them, group them if there are too many, make sure at least some are prevention- or mitigation-focused
declare ownership of postmortem doc and action items

this promotes accountability and encourages to follow-up and complete the action items


add machine-readable metadata like tags to enable later analytics
keep the doc concise to balance verbosity with readability (outsource log excerpts to other doc...)


Postmortem Review Meeting


review every postmortem to ensure lessons are learned and improvements are made
conduct regular review sessions to finalize postmortems and foster knowledge sharing and collaboration

Cultural Decay


sign that culture is decaying: disengagement from postmortem process ("glad that I don't have to write that postmortem now")
when blame sneaks into postmortem process:

moving the narrative back in a more constructive direction
focus on investigating the root cause (also process-related) instead of assigning blame


lacking time to write postmortems:

when team is occupied with other tasks, hard keep up the quality of postmortems
try to find out why there is not enough time (not a priority?)


repeating similar incidents:

implementing action items takes too long?
perhaps feature velocity is considered more important
are issues always just ad-hoc mitigated instead properly addressed?


Reinforce Postmortem Culture


view postmortems as opportunities to fix weaknesses and enhance overall resilience, not just as formalities
ensure confidence in escalating issues by keeping the postmortem process blameless
rewards for working on postmortem docs and action items:

give owners opportunity to present lessons learned
peer bonus
positive performance reviews
even promotion


intrinsic motivation to live postmortem culture:

over time fewer incidents, more focus time for feature velocity
gamification: leaderboards to show closing action items 'scores'
Wheel of Misfortune: training new on-call engineers on simulations of real incidents


regularly ask for feedback on effectiveness of the postmortem process and seek ways to improve it
ask such as:

is the culture supporting your work?
does writing a postmortem entail too much toil?
what best practices does your team recommend for other teams?
what kinds of tools would you like to see developed?


tools are no silver bullet but can make the process more smooth and free up time

Tracking Outages


Escalator / Outalator principles

Preventing Cascading Failures


load testing is crucial in a realistic environment to anticipate traffic spikes
fail quickly and with minimal resource usage when overloaded to prevent exacerbating the situation
implement request rejection at various system levels (reverse proxy, load balancer, task) to prevent overloading downstream services
conduct regular capacity planning to ensure readiness for traffic increases

Graceful Degradation


serve degraded results to maintain service availability under stress
regular testing of the degradation path is essential, potentially by overloading a subset of servers to ensure reliability
maintain simplicity in degradation mechanisms for predictability and ease of understanding

Retries


use retries with randomized exponential backoff to avoid synchronized retry storms

Timeouts vs. Deadlines


set thoughtful deadlines based on system load and failure scenarios rather than arbitrary round numbers to prevent the accumulation of "zombie" requests

Data Integrity


data integrity is more critical than availability
backups are complex due to transactional diversity and versioning
replicas are not a sufficient solution because corrupted data will be synced

Defense Layers


soft Deletion:

protects against unintended data loss by delaying permanent deletion


backups:

recovery objectives dictate data loss tolerance, recovery time, and backup duration
Google typically maintains a 30 to 90-day backup window


early Detection:

integrity checks are vital but difficult to balance for accuracy without false positives or negatives