codefromthecrypt/metrics-from-sampled.md

## metrics-from-sampled.md

      
    Raw
  

              metrics-from-sampled.md
            
          
    I have heard a number of APMs create "spans" (distributed tracing lingo
for an operation) and aggregate them for reasons like latency metrics.
In a way, Zipkin does this. The ever popular service dependency diagram
is an aggregated view of parent/child links between services with the
number of calls between them added for color.
The biggest issue with using a tracing api to back metrics is that
most of the time, tracing is sampled (like 1 out of 1000). Sampling is
done to reduce costs or prevent a surge of traffic from taking out the
tracing system. Unlike tracing data, metric data does not grow linearly
with request volume. That means it is safe to "save metrics" always, and
operations like saving request duration with metrics apis are intended
to be invoked on every request.
What happens if we don't use normal metrics hooks for latency and
instead derive them from tracing data? Well, the signal would be less,
especially if you aggressively sample during traffic spikes. You will
almost certainly miss outliers, and this will impact the quality of your
data. Anything layered on top of metrics will inherit this signal loss,
which at worst case could be alarms going off when they shouldn't or not
going off when they should.
Why would someone use a tracing api to capture metrics? First reason is
probably convenience. If you have one api, you might be tempted to use
this for all things related to requests, including logs and derived data
like metrics. In doing so, querying across common dimensions would be
easy as essentially everything stained the same way. The increased
overhead and storage costs might be worth it to a subset of users.
Regardless they need to be very careful to not misinterpret when
sampling.
Personally, I believe in using the right tool for the job. Metrics can
be stained with dimensions to allow for querying across or roll-ups.
That said, well written and thought through apis can safely interact, if
system concerns are thought through. Google's emerging Census project
has separate controls to measure tracing data and export it. This allows
some advanced functionality to help mitigate these concerns. Keep an eye
out!