Skip to content

Instantly share code, notes, and snippets.

@StevenACoffman
Last active April 13, 2023 09:56
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save StevenACoffman/836295e378dbb3e2d9bc1dac074086ad to your computer and use it in GitHub Desktop.
Save StevenACoffman/836295e378dbb3e2d9bc1dac074086ad to your computer and use it in GitHub Desktop.
Observability

Monitoring

Some thoughts on monitoring.

Source Documents:

Organization Metrics:

From DevOps Research and Assessment (DORA) and "Accelerate: The Science of Lean Software and DevOps":

  • Deployment Frequency
  • Lead Time for Changes
  • MTTR - Mean Time To Resolution
  • Change Failure Rate

The thorough State of DevOps reports have focused on data-driven and statistical analysis of high-performing organizations. The result of this multiyear research, published in Accelerate, demonstrates a direct link between organizational performance and software delivery performance. The researchers have determined that only four key metrics differentiate between low, medium and high performers: lead time, deployment frequency, mean time to restore (MTTR) and change fail percentage. Indeed, we've found that these four key metrics are a simple and yet powerful tool to help leaders and teams focus on measuring and improving what matters. A good place to start is to instrument the build pipelines so you can capture the four key metrics and make the software delivery value stream visible.

Team Metrics:

  • Story Counting = Number of items in the Backlog
  • Lead Time = Average Time From Backlog to Done
  • Cycle Time = Average Time From Started to Done
  • Wait/Queue Time = Lead Time - Cycle Time

SLIs drive SLOs, which inform SLAs.

A Service Level Indicator (SLI) is a metric derived measure of health for a service. For example, I could have an SLI that says my 95th percentile latency of homepage requests over the last 5 minutes should be less than 300 milliseconds.

A Service Level Objective (SLO) is a goal or target for an SLI. We take an SLI, and extend its scope to quantify how we expect our service to perform over a strategic time interval. Using the SLI from the previous example, we could say that we want to meet the criteria set by that SLI for 99.9% of a trailing year window.

A Service Level Agreement (SLA) is an agreement between a business and a customer, defining the consequences for failing to meet an SLO. Generally, the SLOs which your SLA is based upon will be more relaxed than your internal SLOs, because we want our internal facing targets to be more strict than our external facing targets.

We also recommend watching the YouTube video "SLIs, SLOs, SLAs, oh my!" from Seth Vargo and Liz Fong Jones to get an in depth understanding of the difference between SLIs, SLOs, and SLAs.

SLIs are RED

What SLIs best quantify host and service health? Over the past several years, there have been a number of emerging standards. The top standards are the USE method, the RED method, and the “four golden signals” discussed in the Google SRE book.

USE = utilization, saturation, errors
  • Utilization: the average time that the resource was busy servicing work
  • Saturation: the degree to which the resource has extra work which it can't service, often queued
  • Errors: the count of error events

This disambiguates utilization and saturation, making it clear that utilization is "busy time %" and saturation is “backlog.” These terms are very different from things a person might confuse with them, such as “disk utilization” as an expression of how much disk space is left.

RED = Rate, Errors, and Duration

Tom Wilkie introduced the RED method a few years ago. With RED we monitor request rate, request errors, and request duration. The Google SRE book talks about using latency, traffic, errors, and saturation metrics. These “four golden signals” are targeted at service health, and is similar to the RED method, but extends it with saturation. In practice, it can be difficult to quantify service saturation.

Again, the RED Method defines the three key metrics you should measure for every microservice in your architecture as:

  • (Request) Rate - the number of requests, per second, your services are serving.
  • (Request) Errors - the number of failed requests per second.
  • (Request) Duration - distributions of the amount of time each request takes.
USE RED Together?

What may not be obvious is that USE and RED are complementary to one another. The USE method is an internal, service-centric view. The system or service’s workload is assumed, and USE directs attention to the resources that handle the workload. The goal is to understand how these resources are behaving in the presence of the load.

The RED method, on the other hand, is about the workload itself, and treats the service as a black box. It’s an externally-visible view of the behavior of the workload as serviced by the resources. Here workload is defined as a population of requests over a period of time. It is important to measure the workload, since the system’s raison d’etre is to do useful work.

Taken together, RED and USE comprise minimally complete, maximally useful observability—a way to understand both aspects of a system: its users/customers and the work they request, as well as its resources/components and how they react to the workload. (I include users in the system. Users aren’t separate from the system; they’re an inextricable part of it.)

  • U = Utilization, as canonically defined
  • S = Saturation - Measure Concurrency
  • E = Error Rate, as a throughput metric
  • R = Rate - Request Throughput, in requests per second
  • E = Error - Request Error Rate, as either a throughput metric or a fraction of overall throughput
  • D = Duration - Request Latency, Residence Time, or Response Time; all three are widely used

SLOs

Once we define all the indicators and collect metrics for all of them, we then need to decide what is good and what is bad. To do this we should make 2 steps: baseline the metrics, decide what is acceptable for every metric and where the acceptable range ends.

With the numeric definition of the acceptable ranges we define Service Level Objectives (SLOs). You can read more about SLOs on Wikipedia. Examples of SLOs are that service should have 99.9% availability over a year, or that the 95th percentile of latency for responses should be below 300ms over the course of a month. It’s always better to keep some buffer between the announced SLO and zones where things start going really badly.

SLAs

A SLA is a two way agreement: clients of our service agree on conditions for using it, and we promise them that under those conditions the service will perform within some boundaries.

Clients of the service want to know what can they expect from it in terms of performance and availability: how many requests per second it can process, length of expected downtime during maintenance, or how long it takes on average to process a request. Usually, performance and availability of a service can be expressed using very few parameters, and in most cases the list can be applied to other services also.

If request rate grows beyond the agreed level, we can start throttling requests or denying to serve requests (first communicating this action to the client, of course). If latency grows beyond declared limits and there is no significant increase of the request rate, then we know that something is wrong on our side and it’s time for us to begin troubleshooting.

Divide and Conquer strategy

When a product is a network of interconnected services with a rich collection of external dependencies, it’s really hard to identify bottlenecks or unhealthy services. A single model for the whole system is too complex to define and understand.

A good strategy here is to divide and conquer. Monitoring can be simplified dramatically by focusing on every service separately and tracking how others use it and how it uses the services it depends on. This can be accomplished by following three simple rules:

  1. Every service should have its own Service Level Agreement (SLA).
  2. Every instance of every service monitors how others use it and how it responds.
  3. Every instance of every service monitors how it uses other services and how they respond.

Of course every instance of every service has a health check and produces metrics about its internal state to ease troubleshooting. In other words, every instance of every service is a white box for owners of the service but it’s a black box for everyone else.

Alerts

Adapted from Rob Ewaschuk's chapter in Google's Site Reliability Engineering

The underlying point is to create a system that still has accountability for responsiveness, but doesn't have the high cost of waking someone up.

Summary

When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier:

  • Pages should be urgent, important, actionable, and real.
  • They should represent either ongoing or imminent problems with your service.
  • Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
  • You should almost always be able to classify the problem into one of: availability & basic functionality; latency; correctness (completeness, freshness and durability of data); and feature-specific problems.
  • Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.
  • Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.
  • The further up your serving stack you go, the more distinct problems you catch in a single rule. But don't go so far you can't sufficiently distinguish what's going on.
  • If you want a quiet oncall rotation, it's imperative to have a system for dealing with things that need timely response, but are not imminently critical.

Playbooks

Playbooks (or runbooks) are an important part of an alerting system; it's best to have an entry for each alert or family of alerts that catch a symptom, which can further explain what the alert means and how it might be addressed. The best playbooks I've seen have a few notes about exactly what the alert means, and what's currently interesting about an alert ("We've had a spate of power outages from our widgets from VendorX; if you find this, please add it to Bug 12345 where we're tracking things for patterns".) Most such notes should be ephemeral, so a wiki or similar is a great tool.

Matthew Skelton & Rob Thatcher have an excellent run book template. This template will help teams to fully consider most aspects of reliably operating most interesting software systems, if only to confirm that "this section definitely does not apply here" - a valuable realization.

Tracking & Accountability

Track your pages, and all your other alerts. If a page is firing and people just say "I looked, nothing was wrong", that's a pretty strong sign that you need to remove the paging rule, or demote it or collect data in some other way. Alerts that are less than 50% accurate are broken; even those that are false positives 10% of the time merit more consideration.

Having a system in place (e.g. a weekly review of all pages, and quarterly statistics) can help keep a handle on the big picture of what's going on, and tease out patterns that are lost when the pager is handed from one human to the next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment