Skip to content

Instantly share code, notes, and snippets.

@spencerwilson
Last active September 7, 2022 16:41
Show Gist options
  • Save spencerwilson/11ce621dcc6e577fe2f32542868ebb3b to your computer and use it in GitHub Desktop.
Save spencerwilson/11ce621dcc6e577fe2f32542868ebb3b to your computer and use it in GitHub Desktop.
OTel: Sampler survey

Sampler survey

This doc compares the capabilities of popular telemetry sampling systems. The dimensions that are compared:

Dimensions related to limiting throughput

Temporal resolution: The time range that the limiting occurs on. E.g., limit the number of...

  • Spans per second
  • Spans per calendar month

Degree of limiting: In a steady state with spans created at a rate R span/s that is greater than the desired limit,

  • hard limiting: throughput = limit
  • soft limiting: E[throughput] = limit

Horizontally scalable: Is the desired limit enforced per-sampler, or is it a global limit?

  • Yes: Global
  • No: Per-sampler

Responsiveness: How quickly does the system return to steady state when perturbed (i.e., when R changes)?

Other dimensions

Supports statistical estimation: Modifies span metadata such that post hoc analysis can compute unbiased estimates from the data ("count the spans").

  • Yes
  • No

Sampling systems

otelcol's tailsampling processor

  • supports estimation: No
  • limiting:
    • temporal resolution: Spans per second
    • degree of limiting: Hard
    • horizontally scalable: No
    • responsiveness: < 1 s (token buckets are replenished each second)

The tailsampling processor implements a ratelimiting policy (src) equivalent to a token bucket with capacity of spans_per_second many tokens, replenished every second. Sampling a trace costs trace.SpanCount many tokens. Support for updating span p-values has been requested in #7962.

It also has a composite policy which is characterized by a sequence of sub-policies, each of which are subject to individual token bucket limiting. Each bucket's capacity is computed as a share of an overall max_total_spans_per_second, but otherwise the decisions are identical to those done by ratelimiting (src).

Takes a concept of "allocating bandwidth" (span throughput) to different families of traces. See design doc linked from open-telemetry/opentelemetry-collector-contrib#1410.

If there's more than one otelcol instance in the system, in order to guarantee complete traces you need to somehow guarantee that all spans in a given trace are routed to a given otelcol instance. One way to do that is with the loadbalancing exporter.

References

Jaeger

  • supports estimation: No
  • limiting (sampler.type == 'ratelimiting'):
    • temporal resolution: Traces per second
    • degree of limiting: Hard
    • horizontally scalable: No
    • responsiveness: < 1 s (token buckets are replenished each second)
  • limiting (SAMPLING_CONFIG_TYPE == 'adaptive')
    • temporal resolution: Traces per second
    • degree of limiting: Soft (typically) or none (if data is generated at a high enough volume for --sampling.min-sampling-probability to overtake --sampling.target-samples-per-second)
    • horizontally scalable: Yes
    • responsiveness: Configurable (at most jaeger-client's polling interval + jaeger-collector's --sampling.calculation-interval)

Jaeger SDKs (jaeger-client) get sampling policy various ways:

  • local: hardcoded AlwaysOn, AlwaysOff, probability (static p), ratelimiting (token bucket, parameter: maximum samples per sec). No stratification.
  • remote, file: per-stratum probability or ratelimiting. jaeger-collector reloads from filesystem or URL; clients polls jaeger-agent, who proxies requests to jaeger-collector.
  • remote, adaptive: each stratum as a target throughput + some minimums. jaeger-collector maintains policy based on spans it's received; client polls jaeger-agent, who proxies requests to jaeger-collector.
  • First two options use local memory for ratelimiting. Third option has cluster-level coordination.
  • Spans are stratified by a list of priority-ordered rules: (Service name, Span name) > Span name default > (Service name) > global default.
  • In adaptive, many jaeger-collectors write strata statistics to shared memory. From this data, every jaeger-collector can independently calculate the whole-system stats needed to adjust sampling probabilities. A collector reads statistics (from a configurable number of epochs back; 1 by default), combines them to get whole-cluster strata stats, and recalculates new per-strata sampling probabilities. Defaults:
    • stratum sampling probability: initial (1 in 1,000), minimum (1 in 100,000)
    • stratum throughput: target (1 /s), minimum (1 /min)
  • Because collectors receive spans, clients don't need to explicitly send statistics themselves (contrast w/ X-Ray, whose sampling and collection APIs are independent)

AWS X-Ray

  • supports estimation: No
  • limiting:
    • temporal resolution: Traces per second
    • degree of limiting: Soft
    • horizontally scalable: Yes
    • responsiveness: < 10 s (token buckets are replenished via GetSamplingTargets requests, which occur every 10 s by default)

Each actor performing sampling sends statistics to a central API describing how many spans it's seen in a period. At least two SDKs (Java, Go) have contrib Sampler implementations that obtain sampling configuration from AWS X-Ray. Like Jaeger's adaptive remote sampling, X-Ray serves advisory sampling policies to clients. An X-Ray based sampling system behaves like so (on average):

  1. Define a rule as a triple: a predicate over span attributes, a token bucket (e.g.), and a number in [0, 1] called the rule's fixed rate.
  2. Define the global sampling policy as an ordered collection of rules.
  3. Given a root span in need of a sampling decision,
    1. Match the span to the first rule whose predicate it satisfies.
    2. If the token bucket contains at least 1 token, deduct 1 token from the bucket and sample the span and its descendants.
    3. Else, sample with probability equal to the matched rule's fixed rate.

Docs refer to "reservoirs", which are per-rule token buckets: https://github.com/open-telemetry/opentelemetry-java-contrib/blob/42818333e243682bb50e510f4f91381016f61f71/aws-xray/src/main/java/io/opentelemetry/contrib/awsxray/SamplingRuleApplier.java#L272. Actors doing sampling are dynamically allotted portions of the desired reservoir size (token bucket capacity) called ReservoirQuota in the GetSamplingTargets API response (docs).

References:

Honeycomb Refinery

  • supports estimation: Yes, via span attribute SampleRate value = N in "1-in-N" (feature request to support p-value here)
  • limiting (EMADynamicSampler):
    • temporal resolution: Spans per second
    • degree of limiting: Soft
    • horizontally scalable: No (limiting is per Refinery node)
    • responsiveness: Configurable as AdjustmentInterval
  • limiting (TotalThroughputSampler):
    • temporal resolution: Spans per second
    • degree of limiting: Hard
    • horizontally scalable: No (limiting is per Refinery node)
    • responsiveness: Configurable as ClearFrequencySec

Horizontally scales by forwarding spans to the appropriate node as necessary. The node which ought to handle a given trace is determined via consistent hashing of trace ID (src). Peers are discovered via either Redis or specified in Refinery's configuration file (docs).

Not set-it-and-forget-it: as one's system's rate of telemetry production increases over time, either GoalSampleRate or their Honeycomb events-per-month quota will need to be adjusted.

Opinion: Ideal state

  • limiting: Support all of both spans per second, spans per month, GB per month (approximated)
  • degree of limiting: Soft is ok
  • horizontally scalable: Yes
  • Prioritize tail sampling in Collector over head sampling in SDK
  • Strive for a configuration that is "set it and forget it" (notwithstanding ad hoc changes to aid in investigation or incident response)
@yurishkuro
Copy link

Jaeger: supports estimation: No

I think this is very much incorrect. On the contrary, we went to great lengths to make sure that probabilistic sampling is the prevailing mode, where trace_weight = 1 / p, and p is captured on the root span. There are various rate-limiting capabilities, but they are mostly for overload protection and other edge cases. E.g. with adaptive sampling, you do not expect the rate limiters to ever fire.

@spencerwilson
Copy link
Author

p is captured on the root span

Ah, I didn't know this. Apologies. Haven't used Jaeger myself so all I'm piecing together is primarily from docs. Where is it stored?

@yurishkuro
Copy link

  1. All Jaeger samplers (in the SDKs) store sampler.type and sampler.param tags on the root span, which allows separating truly probabilistic decisions (which are suitable for extrapolations) from other decisions (like rate limiting).

  2. Yes, the core sampling strategy that is manipulated by the adaptive sampling framework is the probabilistic sampling. There are additional rate limiters in the mix, but they are primarily for edge cases, not for steady state sampling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment