Consider two broad goals in sampling:
- Avoid {over,under}sampling of data
- Ensure that overall data collection doesn't exceed acceptable limits
These can be treated separately. Doing so may yield a simpler and more flexible conceptual foundation for sampling.
Define a balancer to be a sampler that does the following: For each input trace,
- Assign a "frequency" score to the trace.
- Sample the trace with a probability that's inversely related to the frequency score.
One implementation of (1) is to partition input traces into strata, compute from historical data the relative frequency of each stratum among the input traces, and assign traces frequency scores equal to the relative frequency of the stratum to which they belong. For example, if a trace comes in and belongs to the "route = /health
" stratum, and that stratum constitutes 10% of recent traces, then any trace belonging to that stratum has frequency score 0.1. This scoring algorithm will result in sampling higher-volume strata, pulling their throughput down to be closer to that of the minimal-volume stratum, reducing the dynamic range of strata throughputs.
One implementation of (2) is to perform logarithmic balancing. Under this scheme, the frequency ratios and throughput ratios of a pair of spans are related in a certain way. By way of example,
- 1:1 frequency => 1:1 throughput
- 10:1 frequency => 2:1 throughput
- 100:1 frequency => 3:1 throughput
Calculation details:
- Traces with minimal frequency are sampled with probability 1. Others are sampled with probability < 1.
- For any pair of traces a and b such that a's frequency f_a >= b's frequency f_b, define C = 1 + log10(f_a/f_b). Note that C >= 1.
- Pick p-values s.t. for any pair of traces, ratio of trace expected throughputs = C. E.g., 10:1 frequency => 2:1 throughput. In this case reduce the p of the more frequent trace by a factor of (f_a/f_b)/C = 5.
- Could make the base of the logarithm in C's definition a configurable parameter: the squeeze factor, since it reduces the variance in trace throughputs (equivalently, strata throughputs, if assigning frequency scores as described above). E.g.,
- base-2: 10:1 frequency => 4.3:1 throughput
- base-10: 10:1 frequency => 2:1 throughput
- base-50: 10:1 frequency => 1.6:1 throughput
- base-∞: 10:1 frequency => 1:1 throughput
- All traces have throughput equal to that of the minimal-frequency traces
A limiter is a sampler whose one job is to sample such that output throughputs are at or below some given threshold. For example,
- Per-stratum limiting: Partition input traces into strata, and sample such that each stratum's throughput does not exceed a threshold.
- Global limiting: Sample such that total throughput doesn't exceed a threshold.
Note that in limiting contexts, the throughput one wants to limit is probably span throughput, since that is the unit most trace storage systems work in, and possibly define quotas in: spans per unit time. In such cases the limiter implementation should take care not to impart bias by systematically preferring traces with fewer spans over traces with many spans.
Existing coarse-grained adaptive sampling implementations fuse together balancing and limiting into a single construct.
- Jaeger
adaptive
: This attempts to sample all operations (pair of service and endpoint) at a per-operation target throughput. This is equivalent to partitioning traces along those two dimensions, running them through a base-∞ balancer, and finally through a per-stratum limiter with threshold equal to--sampling.target-samples-per-second
many traces per second. - AWS X-Ray: Not quite "coarse-grained" adaptive sampling, since its configuration requires individual target throughputs, and its rule semantics map each trace to exactly one target throughput.
- Honeycomb Refinery: Because Refinery nodes have no shared state, their limiting is not configured in terms of total cluster throughput, nor is it per-node throughput, but rather sampling probability: in
EMADynamicSampler
samplers the valueGoalSampleRate
. This sampler performs a user-configured partitioning of input traces, scores those traces according to estimated relative frequency of their respective strata, computes the current per-node target throughput using per-node strata sizes andGoalSampleRate
, and then allocates shares of that target throughput to strata in proportion to the base-10 logarithm of each stratum's size. This is equivalent to running all traces through a base-10 logarithmic balancer, followed by a global limiter whose threshold is dynamically adjusting so that the global average trace sampling probability is some configured constant.- "limiter whose threshold is dynamically adjusting"—this may not be the behavior most people want from their limiters. That is, the configuration knob maybe should ideally be a throughput, not a sample rate.
By "Avoid {over,under}sampling of data" I mean "Collect a Goldilocks quantity of data, neither too little nor too much, such that estimates resulting from the data have error neither too large nor too small".
Thinking more, I'm sure at different times I've both used "oversampled" (e.g.) to mean "collected more data than necessary to get estimation error to an acceptable level" and "collected less data than necessary ...". The difficulty seems to come from how "to sample" can refer to opposing notions:
RECORD_AND_SAMPLE
value ofDecision
enum in the trace SDK🤦♂️ After reading https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis I'm now thinking that my historical usage of over/undersampling has not been right at all, and that your interpretation was a natural one. That is, AFAICT it'd be accurate to say that the OTel system of p-values is a technique whereby a biased sample is taken, but to each sampling unit (span) there is attached metadata (its p-value) that enables unbiased estimates to be computed from the biased sample. Does that sound right?