Skip to content

Instantly share code, notes, and snippets.

@codefromthecrypt
Created July 2, 2017 18:54
Show Gist options
  • Save codefromthecrypt/b3bcc4fc7529891ebbf015152e22bc38 to your computer and use it in GitHub Desktop.
Save codefromthecrypt/b3bcc4fc7529891ebbf015152e22bc38 to your computer and use it in GitHub Desktop.
When you derive metrics from sampled traces

I have heard a number of APMs create "spans" (distributed tracing lingo for an operation) and aggregate them for reasons like latency metrics.

In a way, Zipkin does this. The ever popular service dependency diagram is an aggregated view of parent/child links between services with the number of calls between them added for color.

The biggest issue with using a tracing api to back metrics is that most of the time, tracing is sampled (like 1 out of 1000). Sampling is done to reduce costs or prevent a surge of traffic from taking out the tracing system. Unlike tracing data, metric data does not grow linearly with request volume. That means it is safe to "save metrics" always, and operations like saving request duration with metrics apis are intended to be invoked on every request.

What happens if we don't use normal metrics hooks for latency and instead derive them from tracing data? Well, the signal would be less, especially if you aggressively sample during traffic spikes. You will almost certainly miss outliers, and this will impact the quality of your data. Anything layered on top of metrics will inherit this signal loss, which at worst case could be alarms going off when they shouldn't or not going off when they should.

Why would someone use a tracing api to capture metrics? First reason is probably convenience. If you have one api, you might be tempted to use this for all things related to requests, including logs and derived data like metrics. In doing so, querying across common dimensions would be easy as essentially everything stained the same way. The increased overhead and storage costs might be worth it to a subset of users. Regardless they need to be very careful to not misinterpret when sampling.

Personally, I believe in using the right tool for the job. Metrics can be stained with dimensions to allow for querying across or roll-ups. That said, well written and thought through apis can safely interact, if system concerns are thought through. Google's emerging Census project has separate controls to measure tracing data and export it. This allows some advanced functionality to help mitigate these concerns. Keep an eye out!

@beberlei
Copy link

beberlei commented Jul 3, 2017

You can run the tracer in "metrics mode" when sampling decision was negative. The code is still being executed, so the same api can collect timing information for spans. This is what we do in Tideways with two modes full tracing and monitoring only.

@codefromthecrypt
Copy link
Author

codefromthecrypt commented Jul 3, 2017

@beberlei yep some concept of this "metrics mode" indeed solves the local signal loss problem. This seems like RECORD_EVENTS in census. One interesting point with "metrics mode" is it hints maybe not everything will be recorded vs when normal tracing is on. Or is metrics mode in TideWays still recording everything, just not sending it out of process?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment