Skip to content

Instantly share code, notes, and snippets.

@brennovich
Last active April 25, 2019 09:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brennovich/c309a55e25b812a8d3e7651cd24ce72a to your computer and use it in GitHub Desktop.
Save brennovich/c309a55e25b812a8d3e7651cd24ce72a to your computer and use it in GitHub Desktop.
Key points of [Embrancing Risk chapter of Google's SRE book](https://landing.google.com/sre/sre-book/chapters/embracing-risk/)

We conceptualize risk as a continuum

Which is worse for the service: a constant low rate of failures, or an occasional full-site outage? Both types of failure may result in the same absolute number of errors, but may have vastly different impacts on the business.


Software fault tolerance

How hardened do we make the software to unexpected events? Too little, and we have a brittle, unusable product. Too much, and we have a product no one wants to use (but that runs very stably).

Testing

Again, not enough testing and you have embarrassing outages, privacy data leaks, or a number of other press-worthy events. Too much testing, and you might lose your market.

Push frequency

Every push is risky. How much should we work on reducing that risk, versus doing other work? Canary duration and size

It’s a best practice to test a new release on some small subset of a typical workload, a practice often called canarying. How long do we wait, and how big is the canary?


Usually, preexisting teams have worked out some kind of informal balance between them as to where the risk/effort boundary lies. Unfortunately, one can rarely prove that this balance is optimal, rather than just a function of the negotiating skills of the engineers involved. Nor should such decisions be driven by politics, fear, or hope

The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.

What happens if a network outage or datacenter failure reduces the measured SLO? Such events also eat into the error budget. As a result, the number of new pushes may be reduced for the remainder of the quarter. The entire team supports this reduction because everyone shares the responsibility for uptime.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment