- don’t expect a tool to solve
- cultural change and need “believers” in senior role to advocate within company
- people need to absorb info within their own mindset
- it is a process that can span 6-9 months in orgs w/ 5000 engineers; nothing happens immediately
- Step 1: “I want to be reliable when I grow up” (you must believe you have problem first)
- Step 2: “Read the book!” and watch SRE v DevOps
- Step 3: “Panic!” (myth: fire team and retrain; not the case and can retrain team in house)
- Step 4: Start small, be patient, celebrate each step - spread the word
- developers incentivized with rapid change (ship more)
- operators incentivized w/ stability (limit change)
- research in cognitive psychology ( human perception cannot tell diff w/ 3 9s to 4 9s )
- “art of an SLO” - figure out how much error budget we can still keep customers happy with
- not about number of 9s (not badge of honor) “we need to be good enough and keep our customers happy”
- it may be more beneficial to invest engineering efforts in features vs. more reliability
- defining goals is fundamental collaboration between dev and ops together
- not everything needs to be measured (i.e. 3 clicks deep in left nav panel)
- Shared ownership - reduce org silos
- Error budgets - accept failure as normal (blame is not helpful)
- Reduce cost of failure - implement gradual changes
- Automate common case - leverage tooling and automation (if you have a decision tree, or processes not written down, automation helpful)
- Measure toil and reliability - measure everything* (alert only on CUJ, but measure all)
- need to be very transparent in what we measure
- latency difficult end-to-end (browser through to load balancer and backend)
- initial discussion “how many 9s can we achieve” (wrong approach)
- correct: what does success look like to customer and their expectation
- customer research, review report cases, measure actual experienced latency
- if lacking info: 1st goal should be current experience (current latency, availability)
- should be re-evaluated week 1, 4, 6 - first 6 weeks critical to nail down goals
- after 6 weeks, associate an alert to it (need it around long enough before too much noise)
- at least once/quarter re-evaluate; sit with all parties involved and decide whether to increase target
- SLOs have a life of their own; think about how to maintain over time (not set it and forget it)
- need to be meaningful to usage of customer; if customer use changes then need to change/evaluate
- avoid “collecting 9s as badge of honor”; starting with 3 9s is great (or even 2)
- our internal SLO exposed to public
- legal consequences; refund is strongest use case
- if you have an SLA, you should create stricter SLO and catch violations earlier (create error budget buffer)
- should not publish SLA to customer with unreasonable number of 9s
- should not publish SLA on something you historically fail on
- be very careful how you think about it to not erode trust of customers
- consider: global and local SLAs (CUJ for each: i.e. NA, EMEA, APAC)
- dev teams like receiving allowance for ice cream (spend all in 1 day, or spread it out)
- don’t spend budget, it’s taken away - if you don’t spend miss innovation opportunity (don’t horde it)
- gives opportunity to take risk without angering others
- historically if you don’t know budget, you become risk averse
- Common incentives for DEVs and SREs - find right balance between innovation and reliability
- DEV teams can manage the risk themselves - they decide how to spend their error budget
- Unrealistic reliability goals are unattractive - dampen velocity of innovation
- DEV team becomes self-policing - error budget is valuable resource for them
- Shared responsibility for system uptime - infra failures eat into dev’s error budget
Error budget agreement
What happens if we exhaust our budget? (agree up front, not when things on fire)
![Screen Shot 2021-05-27 at 9 14 12 AM](https://user-images.githubusercontent.com/5553105/119856132-b9d23180-becf-11eb-825a-5e4007757738.png)
SRE Customer Journey
SLO Setup Guide