Some notes on and excerpts from Google's SRE trifecta. These are just my jottings, so they may be incomplete, not detailed and structured differently from the books.
- the original: https://sre.google/sre-book/table-of-contents/
- philosophy and the principles of production engineering and operations at Google
- the follow-up: https://sre.google/workbook/table-of-contents/
- companion to the first SRE book to address questions, requests, and comments
- frequent question: how to put these principles into practice in my team/company?
- the security add on: https://google.github.io/building-secure-and-reliable-systems/raw/toc.html
- focus on the security aspect of reliability engineering
- https://sre.google/sre-book/introduction/
- comparison of Google's SREs vs traditional sysadmins
- have software engineers do operations
- results:
- SREs would be bored by doing tasks by hand
- have the skill set necessary to automate tasks
- do the same work as an operations team, but with automation instead of manual labor
- to avoid manual labor trap that causes team size to scale with service load, Google places a 50% cap on the amount of "ops" work for SREs (Upper bound. Actual amount of ops work is expected to be much lower)
- Pros
- cheaper to scale
- circumvents Devs/Ops split
- Cons
- hard to hire for
- may be unorthodox in ways that require management support (e.g., product team may push back against decision to stop releases for the quarter because the error budget is depleted)
- https://sre.google/sre-book/introduction/
- 70% of outages due to changes in a live system. Mitigation:
- Implement progressive rollouts
- Monitoring
- Rollback
- Remove humans from the loop, avoid standard human problems on repetitive tasks
- 70% of outages due to changes in a live system. Mitigation:
- https://sre.google/sre-book/embracing-risk/
- 100% availability/reliability is the wrong goal
- trying to achieve 100% reliability isn't noticeable to most users and requires tremendous effort and costs
- users mostly can't tell difference between 99.99% and 99.999% reliability
- goal of SRE team isn't "zero outages"
- goal: make systems reliable enough, but not too reliable!
- goal should acknowledge the trade-off and leaves an error budget
- error budget can be spent on anything: launching features, experiments, etc.
- SREs and product devs are encouraged to spend the error budget to get maximum feature velocity
- if constantly under error budget: perhaps you are not moving fast enough
- set quarterly reliability targets
- work with product owners to translate business objectives into explicit SLOs
- take down service deliberately to signal that it is not 100% reliable (so people don't get used to it)
- different indicators for different kinds of systems (user-facing, storage, big data)
- how many indicators to monitor:
- too many: hard to model and pay attention to
- too few: could ignore important behavior
- studies show that people usually prefer slower average with better tail latency
- how to choose targets?
- avoid absolutes: "infinite" scale or "always" available not feasible
- keep it simple, perfection can wait, SLOs can be redefined later
- keep safety margin (internal SLOs can be defined more loosely than external SLOs) TODO: Really? Wouldn't it make more sense to have a stricter internal SLO, so that alarms strike on failure to meet it before the external SLO is violated?
- don't overachieve
- definition 'toil':
- manual
- repetitive
- automatable
- tactical
- no enduring value
- toil > 50% is a sign that the manager should spread toil load more evenly
- example: capacity planning tool
- transition from implementation-based to intent-based
- instead of implementation "Want 50 cores in clusters X, Y and Z", specify requirements "Want to meet demand with N+2 redundancy"
- https://sre.google/sre-book/monitoring-distributed-systems/
- alerts: human needs to take action immediately
- tickets: human needs to take action eventually
- logs: no action needed
- TODO when to use metrics and when logs?
- monitoring is non-trivial and is an ongoing amount of work / time / effort
- trend towards simpler/faster monitoring systems, avoid "magic" systems
- rules that generate alerts for humans should be clear and easy to understand
-
https://sre.google/sre-book/emergency-response/
- reliability as a function of MTTF (mean-time-to-failure) and MTTR (mean-time-to-recovery)
- focus on MTTR for evaluating responses
- playbook presence resulting in 3x lower MTTR
- fostering a culture of documentation
-
https://sre.google/workbook/incident-response/#putting-best-practices-into-practice
- chapter outlines Incident Command System (ICS), a standardized approach to emergency response
- managed incidents resolved faster with clear roles such as Incident Commander, Operations Lead, Communications Lead, and Scribe
- defining severity levels to categorize incidents and determine appropriate response
- prioritizing impact stoppage, then root cause analysis (unless identified early on)
- SREs gathering in a "war room" for incidents
- clear and timely communication using standardized templates for updates
- generic mitigations as actions to alleviate pain before full root cause understanding
- examples:
- rolling back a recent release correlated with an outage
- reconfiguring load balancers to avoid a problematic region
- caution against mitigations as blunt instruments potentially causing service disruptions
- creation of general-purpose mitigation tools before incidents
- examples:
- conducting blameless postmortems (see below) to learn from incidents
- regular training on incident response procedures and simulated incidents (game days)
- developing tools for automating common response tasks to reduce human error
- browsing postmortems to discover useful mitigations and tools for future incident management
- learning from previous incidents (postmortem docs) for better future major incident response
- using insights from incidents to drive system and process improvements
- encouraging a culture of continuous learning and improvement
NOTE: Here the term "postmortem" is used partially for the postmortem document/report and the postmortem review meeting.
- postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.
- primary goals:
- ensure that the incident is documented
- all contributing root cause(s) are well understood
- effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.
- reader should get a complete view of the outage and, more importantly, learn something new
- introducing postmortems into an organization is as much a cultural change as it is a technical one
- senior leadership should actively support and promote the postmortem culture (visible rewards for doing the right thing)
- start simple and basic, adapt to your organization, there's no one size fits all
- tools to help bootstrap the process: standardized postmortem template, check lists, some automation (to populate postmortems with metadata)
- if value of postmortems is questioned in relation to the cost of their preparation:
- Ease postmortems into the workflow. A trial period with several complete and successful postmortems may help prove their value, in addition to helping to identify what criteria should initiate a postmortem.
- Make sure that writing effective postmortems is a rewarded and celebrated practice, both publicly through the social methods mentioned earlier, and through individual and team performance management.
- Encourage senior leadership's acknowledgment and participation.
- postmortem doc best practices:
- write postmortems for the whole company as the audience, share and announce proactively
- involve all incident participants in authoring the doc to not overlook contributing factors
- add context (introduce terminology or link to glossary pages) to make the document accessible and comprehensible to a broader audience
- otherwise postmortems might be misunderstood or even ignored
- add measurable data, such as number of failed requests, cache hit ratio, duration of the incident
- when you can't measure failure, you can't know that it is fixed
- even rough estimates are better than nothing to show the extent of the outage
- conclusions should be based on facts and data linked in the doc
- add trigger and root cause; some of the most important sections of a postmortem doc
- avoid finger pointing
- do not mention individuals or teams in the postmortem in a blaming way
- otherwise team members will become risk averse or try to hide mistakes
- avoid animated language
- subjective and dramatic expression distracts from key message and erodes psychological safety
- instead: provide verifiable data to justify severity of statements
- derive concrete action items and prioritize them, group them if there are too many, make sure at least some are prevention- or mitigation-focused
- declare ownership of postmortem doc and action items
- this promotes accountability and encourages to follow-up and complete the action items
- add machine-readable metadata like tags to enable later analytics
- keep the doc concise to balance verbosity with readability (outsource log excerpts to other doc...)
- review every postmortem to ensure lessons are learned and improvements are made
- conduct regular review sessions to finalize postmortems and foster knowledge sharing and collaboration
- sign that culture is decaying: disengagement from postmortem process ("glad that I don't have to write that postmortem now")
- when blame sneaks into postmortem process:
- moving the narrative back in a more constructive direction
- focus on investigating the root cause (also process-related) instead of assigning blame
- lacking time to write postmortems:
- when team is occupied with other tasks, hard keep up the quality of postmortems
- try to find out why there is not enough time (not a priority?)
- repeating similar incidents:
- implementing action items takes too long?
- perhaps feature velocity is considered more important
- are issues always just ad-hoc mitigated instead properly addressed?
- view postmortems as opportunities to fix weaknesses and enhance overall resilience, not just as formalities
- ensure confidence in escalating issues by keeping the postmortem process blameless
- rewards for working on postmortem docs and action items:
- give owners opportunity to present lessons learned
- peer bonus
- positive performance reviews
- even promotion
- intrinsic motivation to live postmortem culture:
- over time fewer incidents, more focus time for feature velocity
- gamification: leaderboards to show closing action items 'scores'
- Wheel of Misfortune: training new on-call engineers on simulations of real incidents
- regularly ask for feedback on effectiveness of the postmortem process and seek ways to improve it
- ask such as:
- is the culture supporting your work?
- does writing a postmortem entail too much toil?
- what best practices does your team recommend for other teams?
- what kinds of tools would you like to see developed?
- tools are no silver bullet but can make the process more smooth and free up time
- Escalator / Outalator principles
- load testing is crucial in a realistic environment to anticipate traffic spikes
- fail quickly and with minimal resource usage when overloaded to prevent exacerbating the situation
- implement request rejection at various system levels (reverse proxy, load balancer, task) to prevent overloading downstream services
- conduct regular capacity planning to ensure readiness for traffic increases
- serve degraded results to maintain service availability under stress
- regular testing of the degradation path is essential, potentially by overloading a subset of servers to ensure reliability
- maintain simplicity in degradation mechanisms for predictability and ease of understanding
- use retries with randomized exponential backoff to avoid synchronized retry storms
- set thoughtful deadlines based on system load and failure scenarios rather than arbitrary round numbers to prevent the accumulation of "zombie" requests
- data integrity is more critical than availability
- backups are complex due to transactional diversity and versioning
- replicas are not a sufficient solution because corrupted data will be synced
- soft Deletion:
- protects against unintended data loss by delaying permanent deletion
- backups:
- recovery objectives dictate data loss tolerance, recovery time, and backup duration
- Google typically maintains a 30 to 90-day backup window
- early Detection:
- integrity checks are vital but difficult to balance for accuracy without false positives or negatives