nathanl/observability.md

## observability.md

      
    Raw
  

              observability.md
            
          
    Observability

What Is Observability?

Based on my reading and listening, observability is the ability to answer a wide range of questions about a system's behavior based on previously-captured data. Ideally, it lets you see how a system is performing for various use cases and users in real time, and watch how that changes as new code goes into production.
Observability folks like to talk about it as "testing in production", to which they add that everyone does this, like it or not, because only in production can we see the kinds of edge cases that happen with real data, real traffic, real network conditions, etc. Observability's goal is that when we test in production, we can get much more detailed information than "it works" or "it doesn't work", and thus find and fix problems much more easily.
For example, a user emails and say "doing X in the system is slow for me this morning." With poor observability, you might be able to look at the system's overall latency, or the overall CPU load of the servers, etc. This might indicate that overall, things are fine.
With good observability, you might be able query for events specific to doing X, specific to the user or their company, specific to the time period they mention, notice which server they were interacting with, examine the overall response time and the response time of the database queries, and find that this particular user is hitting a performance edge case due to the shape of their data.
How Do We Get Observability?

Observability requires capturing a lot of data as a system is running, but it doesn't require capturing everything all the time. You decide what to capture based on your business goals. If keeping high-paying customers happy is the main goal, you might decide to capture details of every single request that produces an error, and every single request by your premium customers, but only a random sampling of requests for basic customers.
You also have to decide which pieces of information to capture per event, based on the kinds of questions you think your business will want to ask. Here you have to made educated guesses. If you want to be able to query error rates by region of the world, you need to capture region as part of each measurement. If querying by the user's browser type or IP address range is less important, don't capture that. The trade off here is that the more data you capture, the more kinds of questions you can ask in the future, but capturing more data is more expensive.
In a sense, this is like detailed logging, but with the understanding that there will be far too many logs for a human to read through them all, so the data needs to be structured for querying.
Storage and Querying Requirements

The structure of the event data has implications for where you store it and how you query it.
Each event will have some number of attributes - eg, region, customer type, customer id, server, endpoint, etc. Some attributes, like plan type, will be "low cardinality", meaning that they have few possible values. Some attributes, like customer ID, will be "high cardinality", meaning that they have many possible values. Wherever you store the data, it needs to support querying by low or high cardinality data.
In addition, you will likely change your mind about which attributes you need to capture as time passes. Maybe you'll add two attributes and take away one. So the data store needs to support a flexible schema, and allow querying older events which don't include the new attributes.
Honeycomb

honeycomb.io is a service that supports storing and querying this kind of many-dimensional event data, and its founders' public blogs and podcast appearances are where I learned about these concepts.
Eg:

The Changelog Podcast: " Observability is for your unknown unknowns" with Christine Yen
O11ycast #1: "Monitoring vs Observability" with Rachel Chalmers and Charity Majors
Charity Majors' blog