Skip to content

Instantly share code, notes, and snippets.

@spencerwilson
Created September 7, 2022 07:34
Show Gist options
  • Save spencerwilson/d6105e7a60261e603bdebaaa22e4ee3e to your computer and use it in GitHub Desktop.
Save spencerwilson/d6105e7a60261e603bdebaaa22e4ee3e to your computer and use it in GitHub Desktop.
otel_balancers_and_limiters

Two kinds of samplers

Consider two broad goals in sampling:

  • Avoid {over,under}sampling of data
  • Ensure that overall data collection doesn't exceed acceptable limits

These can be treated separately. Doing so may yield a simpler and more flexible conceptual foundation for sampling.

Balancers

Define a balancer to be a sampler that does the following: For each input trace,

  1. Assign a "frequency" score to the trace.
  2. Sample the trace with a probability that's inversely related to the frequency score.

One implementation of (1) is to partition input traces into strata, compute from historical data the relative frequency of each stratum among the input traces, and assign traces frequency scores equal to the relative frequency of the stratum to which they belong. For example, if a trace comes in and belongs to the "route = /health" stratum, and that stratum constitutes 10% of recent traces, then any trace belonging to that stratum has frequency score 0.1. This scoring algorithm will result in sampling higher-volume strata, pulling their throughput down to be closer to that of the minimal-volume stratum, reducing the dynamic range of strata throughputs.

One implementation of (2) is to perform logarithmic balancing. Under this scheme, the frequency ratios and throughput ratios of a pair of spans are related in a certain way. By way of example,

  • 1:1 frequency => 1:1 throughput
  • 10:1 frequency => 2:1 throughput
  • 100:1 frequency => 3:1 throughput

Calculation details:

  • Traces with minimal frequency are sampled with probability 1. Others are sampled with probability < 1.
  • For any pair of traces a and b such that a's frequency f_a >= b's frequency f_b, define C = 1 + log10(f_a/f_b). Note that C >= 1.
  • Pick p-values s.t. for any pair of traces, ratio of trace expected throughputs = C. E.g., 10:1 frequency => 2:1 throughput. In this case reduce the p of the more frequent trace by a factor of (f_a/f_b)/C = 5.
  • Could make the base of the logarithm in C's definition a configurable parameter: the squeeze factor, since it reduces the variance in trace throughputs (equivalently, strata throughputs, if assigning frequency scores as described above). E.g.,
    • base-2: 10:1 frequency => 4.3:1 throughput
    • base-10: 10:1 frequency => 2:1 throughput
    • base-50: 10:1 frequency => 1.6:1 throughput
    • base-∞: 10:1 frequency => 1:1 throughput
      • All traces have throughput equal to that of the minimal-frequency traces

Limiters

A limiter is a sampler whose one job is to sample such that output throughputs are at or below some given threshold. For example,

  • Per-stratum limiting: Partition input traces into strata, and sample such that each stratum's throughput does not exceed a threshold.
  • Global limiting: Sample such that total throughput doesn't exceed a threshold.

Note that in limiting contexts, the throughput one wants to limit is probably span throughput, since that is the unit most trace storage systems work in, and possibly define quotas in: spans per unit time. In such cases the limiter implementation should take care not to impart bias by systematically preferring traces with fewer spans over traces with many spans.

Observations

Existing coarse-grained adaptive sampling implementations fuse together balancing and limiting into a single construct.

  • Jaeger adaptive: This attempts to sample all operations (pair of service and endpoint) at a per-operation target throughput. This is equivalent to partitioning traces along those two dimensions, running them through a base-∞ balancer, and finally through a per-stratum limiter with threshold equal to --sampling.target-samples-per-second many traces per second.
  • AWS X-Ray: Not quite "coarse-grained" adaptive sampling, since its configuration requires individual target throughputs, and its rule semantics map each trace to exactly one target throughput.
  • Honeycomb Refinery: Because Refinery nodes have no shared state, their limiting is not configured in terms of total cluster throughput, nor is it per-node throughput, but rather sampling probability: in EMADynamicSampler samplers the value GoalSampleRate. This sampler performs a user-configured partitioning of input traces, scores those traces according to estimated relative frequency of their respective strata, computes the current per-node target throughput using per-node strata sizes and GoalSampleRate, and then allocates shares of that target throughput to strata in proportion to the base-10 logarithm of each stratum's size. This is equivalent to running all traces through a base-10 logarithmic balancer, followed by a global limiter whose threshold is dynamically adjusting so that the global average trace sampling probability is some configured constant.
    • "limiter whose threshold is dynamically adjusting"—this may not be the behavior most people want from their limiters. That is, the configuration knob maybe should ideally be a throughput, not a sample rate.
@yurishkuro
Copy link

Avoid {over,under}sampling of data

Not clear what you mean by that. Are you referring to trying to achieve proportional distribution of samples that reflects real distribution? Because I would argue that's not a goal of sampling, but a side-effect of some sampling strategies.

@spencerwilson
Copy link
Author

spencerwilson commented Sep 7, 2022

By "Avoid {over,under}sampling of data" I mean "Collect a Goldilocks quantity of data, neither too little nor too much, such that estimates resulting from the data have error neither too large nor too small".

Thinking more, I'm sure at different times I've both used "oversampled" (e.g.) to mean "collected more data than necessary to get estimation error to an acceptable level" and "collected less data than necessary ...". The difficulty seems to come from how "to sample" can refer to opposing notions:

  • including a unit in the sample (broadly: collecting more of something)
    • example: The usage of "sample" in the RECORD_AND_SAMPLE value of Decision enum in the trace SDK
  • systematically collecting only a subset of a population (broadly: collecting less of something).

🤦‍♂️ After reading https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis I'm now thinking that my historical usage of over/undersampling has not been right at all, and that your interpretation was a natural one. That is, AFAICT it'd be accurate to say that the OTel system of p-values is a technique whereby a biased sample is taken, but to each sampling unit (span) there is attached metadata (its p-value) that enables unbiased estimates to be computed from the biased sample. Does that sound right?

@yurishkuro
Copy link

I never think of "sampling" as collecting more, always as collecting a subset of individuals from a population. In traditional statistics definition it is done "to estimate characteristics of the whole population", which I think broadly fits for telemetry as well.

@kentquirk
Copy link

I like to think about it in practical terms for users of telemetry. To me, the danger of "undersampling" in the context of telemetry is that you get an inaccurate view of the shape of your telemetry. For example, if you're shooting for something like "99.9% of requests complete within 1 second" and then you sample randomly at 1 in 10000, every long-duration request you sample is an indication that you've failed your SLO.

The primary danger of "oversampling" is that you send more telemetry than you can handle, either technically (overwhelms your systems for processing it) or financially (you get expensive bills for handling it).

I don't think most users are optimizing error rates, they're trying to understand their systems.

What I think we want is to help people achieve an understanding that is reasonably accurate and enables them to make predictions about them.

@yurishkuro
Copy link

@kentquirk I mostly agree with what you're saying, but you're only focusing on deriving accurate estimates, and I don't think it's the only use of telemetry. For example, if I know my SLO is broken (from metrics), and I want to investigate why it's broken, then I am interested having exemplars of traces and logs for those 0.1% of requests, but it doesn't mean I need those samples to be produced probabilistically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment