Skip to content

Instantly share code, notes, and snippets.

@arkgil
Last active October 30, 2018 13:59
Show Gist options
  • Save arkgil/0f3ecaa345d7a4eb964eeadc5eb91551 to your computer and use it in GitHub Desktop.
Save arkgil/0f3ecaa345d7a4eb964eeadc5eb91551 to your computer and use it in GitHub Desktop.
Telemetry glossary

Glossary

In this document we'll try to come up with the glossary for the Telemetry.Metrics project.

Comparing metric types from various systems

In order to create a glossary, it's beneficial to look at how various metric systems call different entities. Note that this comparison aims to highlight only these differences related to data model and metric types.

StatsD

StatsD is not a full-fledged metric system, but an agent which aggregates metrics and forwards them to other system (Graphite by default) with specified interval. StatsD implements one-dimensional data model, i.e. the metric has a name and a value. DogStatsD, which is DataDog's implementation of the agent, supports optional tagging.

Since it's the StatsD agent who aggregates the metrics, each sample (or measurement, you call it) needs to be sent to it. This means that UDP might become a bottleneck on bigger workloads.

Main issue with StatsD is that it has many implementations, and there are small differences in metrics' behaviour between those implementations. The overview below is based on the original Etsy implementation.

Counter

StatsD counter can be incremented and decremented. StatsD agent publishes both total count and rate. After publishing, the counter is reset. You can specify counter's sampling rate.

Gauge

Gauge can be set, incremented and decremented. StatsD agent published only gauge value. Gauge is not reset when published.

Timer

Timer metric is produces summary statistics of the measured value, i.e. mean, maximum, minimum, quantiles, etc. Optionally, it can maintain a histogram of measured values. The same as with counter, you can specify the sampling rate.

Set

Set, as the name suggests, is a collection of unique measurements.

InfluxDB

InfluxDB is a time-series database. It doesn't have a notion of a metric. In InfluxDB, values are organized in measurements which are conceptually similar to relation tables. Each point (row) in a measurement consists of a timestamp, a set of fields, and a set of tags. Field and tag values can be numbers, strings or booleans. A series is a collection of points in the single measurement having the same tags.

Note: the only difference between fields and tags is that tags are indexed, thus they are usually used to break down the collection of points by some feature.

More information can be found in the InfluxDB key concepts documentation.

Prometheus

Prometheus is a pull-based metric system. A Prometheus time series is uinquely identified by metric name and a set of labels. Each sample in the series has a timestamp and a single, floating point value. As you can see, the model is very similar to the one used by InfluxDB, except that it doesn't support multiple values per sample and the value needs to be a number.

Unfortunately, Prometheus is not really consistent with the use of the word "metric", e.g. histogram is a "metric", but it produces multiple time series, each having different "metric" name.

Source

Counter

Prometheus counter is monotonically increasing. Only its value is published, but Prometheus query language allows to calculate the rate of things the counter is counting over selected time window.

Gauge

Very similiar to StatsD gauge, it can be set, but also incremented or decremented.

Histogram

Standard histogram of observations, i.e. tracks the number of observations which fall into configurable buckets. Histogram metric produces three series:

  • <metric_name>_sum with the sum of observations
  • <metric_name>_count with the count of observations
  • <metric_name>_bucket with le ("less than or equal") label for actual distribution of values

You can calculate the mean of observations using Prometheus query language. This metric allows for more advanced calculations, e.g. percentage of requests served in under X milliseconds.

Summary

Tracks the quantiles, sum and number of observations. Summary metric produces three series:

  • <metric_name>_sum with the sum of observations
  • <metric_name>_count with the count of observations
  • <metric_name> with quantile label for quantiles

You can calculate the mean of observations using Prometheus query language.

Note: histogram allows to estimate qunatiles from multiple instances exposing the same metric. With summary, we get almost correct qunatiles, but aggregating them across multiple instance doesn't make statistical sense.

OpenCensus

OpenCensus is not a metric system per se, but rather a standard for instrumenting the code across multiple languages and technology stacks. You can plug in an exporter to expose the data to external metric system, like Prometheus, Zipkin (since OpenCensus supports tracing as well), etc. In short, OpenCensus tries to do for all programming languages what Telemetry tries to do for Elixir.

Data model

OpenCensus has very detailed glossary around its data model. I think that we can learn much from it and piggyback on it a little.

All information here is taken from OpenCensus metrics documentation.

Measure

Measure is a metric type to be recorded. A measure has a name, description and a unit. Measure does not describe how the values are aggregated. For example, a library in some programming language could expose a measure and the user of the library could choose to aggregate it later, but by itself measure is just a logical stream of measurements (see below).

Measurement

Measurement is a data point/value collected for the measure. Each measurement has a value and a set of tags. OpenCensus has a dedicated API for recording measurements, and important fact about it is that it doesn't support sampling.

View

View describes how the data is aggregated. A view takes measurements from the specified measure and aggregates them with the selected aggregation method. Aggregations are broken down by selected set of tags, much like in Prometheus.

Aggregation types

OpenCensus aggregations are the closest entity to metrics in other systems/standards.

Counter

Counts the number of measurements.

Distribution

Tracks a histogram distribution of measurement values.

Sum

Sums up the measurement values.

LastValue

Keeps track of the last measuement value.

Telemetry glossary

Glossary

In this document we'll try to come up with the glossary for the Telemetry.Metrics project.

Comparing metric types from various systems

In order to create a glossary, it's beneficial to look at how various metric systems call different entities. Note that this comparison aims to highlight only these differences related to data model and metric types.

StatsD

StatsD is not a full-fledged metric system, but an agent which aggregates metrics and forwards them to other system (Graphite by default) with specified interval. StatsD implements one-dimensional data model, i.e. the metric has a name and a value. DogStatsD, which is DataDog's implementation of the agent, supports optional tagging.

Since it's the StatsD agent who aggregates the metrics, each sample (or measurement, you call it) needs to be sent to it. This means that UDP might become a bottleneck on bigger workloads.

Main issue with StatsD is that it has many implementations, and there are small differences in metrics' behaviour between those implementations. The overview below is based on the original Etsy implementation.

Counter

StatsD counter can be incremented and decremented. StatsD agent publishes both total count and rate. After publishing, the counter is reset. You can specify counter's sampling rate.

Gauge

Gauge can be set, incremented and decremented. StatsD agent published only gauge value. Gauge is not reset when published.

Timer

Timer metric is produces summary statistics of the measured value, i.e. mean, maximum, minimum, quantiles, etc. Optionally, it can maintain a histogram of measured values. The same as with counter, you can specify the sampling rate.

Set

Set, as the name suggests, is a collection of unique measurements.

InfluxDB

InfluxDB is a time-series database. It doesn't have a notion of a metric. In InfluxDB, values are organized in measurements which are conceptually similar to relation tables. Each point (row) in a measurement consists of a timestamp, a set of fields, and a set of tags. Field and tag values can be numbers, strings or booleans. A series is a collection of points in the single measurement having the same tags.

Note: the only difference between fields and tags is that tags are indexed, thus they are usually used to break down the collection of points by some feature.

More information can be found in the InfluxDB key concepts documentation.

Prometheus

Prometheus is a pull-based metric system. A Prometheus time series is uinquely identified by metric name and a set of labels. Each sample in the series has a timestamp and a single, floating point value. As you can see, the model is very similar to the one used by InfluxDB, except that it doesn't support multiple values per sample and the value needs to be a number.

Unfortunately, Prometheus is not really consistent with the use of the word "metric", e.g. histogram is a "metric", but it produces multiple time series, each having different "metric" name.

Source

Counter

Prometheus counter is monotonically increasing. Only its value is published, but Prometheus query language allows to calculate the rate of things the counter is counting over selected time window.

Gauge

Very similiar to StatsD gauge, it can be set, but also incremented or decremented.

Histogram

Standard histogram of observations, i.e. tracks the number of observations which fall into configurable buckets. Histogram metric produces three series:

  • <metric_name>_sum with the sum of observations
  • <metric_name>_count with the count of observations
  • <metric_name>_bucket with le ("less than or equal") label for actual distribution of values

You can calculate the mean of observations using Prometheus query language. This metric allows for more advanced calculations, e.g. percentage of requests served in under X milliseconds.

Summary

Tracks the quantiles, sum and number of observations. Summary metric produces three series:

  • <metric_name>_sum with the sum of observations
  • <metric_name>_count with the count of observations
  • <metric_name> with quantile label for quantiles

You can calculate the mean of observations using Prometheus query language.

Note: histogram allows to estimate qunatiles from multiple instances exposing the same metric. With summary, we get almost correct qunatiles, but aggregating them across multiple instance doesn't make statistical sense.

OpenCensus

OpenCensus is not a metric system per se, but rather a standard for instrumenting the code across multiple languages and technology stacks. You can plug in an exporter to expose the data to external metric system, like Prometheus, Zipkin (since OpenCensus supports tracing as well), etc. In short, OpenCensus tries to do for all programming languages what Telemetry tries to do for Elixir.

Data model

OpenCensus has very detailed glossary around its data model. I think that we can learn much from it and piggyback on it a little.

All information here is taken from OpenCensus metrics documentation.

Measure

Measure is a metric type to be recorded. A measure has a name, description and a unit. Measure does not describe how the values are aggregated. For example, a library in some programming language could expose a measure and the user of the library could choose to aggregate it later, but by itself measure is just a logical stream of measurements (see below).

Measurement

Measurement is a data point/value collected for the measure. Each measurement has a value and a set of tags. OpenCensus has a dedicated API for recording measurements, and important fact about it is that it doesn't support sampling.

View

View describes how the data is aggregated. A view takes measurements from the specified measure and aggregates them with the selected aggregation method. Aggregations are broken down by selected set of tags, much like in Prometheus.

Aggregation types

OpenCensus aggregations are the closest entity to metrics in other systems/standards.

Counter

Counts the number of measurements.

Distribution

Tracks a histogram distribution of measurement values.

Sum

Sums up the measurement values.

LastValue

Keeps track of the last measuement value.

Telemetry glossary

Metrics are responsible for aggregating Telemetry events with the same name in order to gain any useful knowledge about the events. A single metric may generate multiple aggregations, each aggregation being bound to a unique set of tag values. Tags are pairs of key-values derived from event metadata. In the simplest case, tags are a subset of the metadata. Based on the tag values, the value of the event will be used to generate one of the aggregations. Metric type defines how the values are aggregated (e.g. a sum or a distribution). Each aggregation may itself contain many values, which is dependent on the metric type.

Event

Telemetry event, with a name, numerical value and a metadata.

Metric

Consumes events from the stream and aggregates them according to each unique set of tags derived from those events. Metric has a name, description, type and unit. You also need to specify the name of events consumed by the metric.

Tags

Collection of key-values pairs derived from event metadata.

Metric type

Telemetry supports following metric types:

Counter

Aggregation value is the number of emitted events, regardless of their values. In multi-node deployments, the counter values can be safely merged (by adding) without losing statistical correctness.

Sum

Aggregation value is the sum of event values. Sum values can be safely merged without losing correctness.

LastValue

Aggregation value is the value carried by the most recent event in the stream. Values of this metric cannot be merged in the general case without losing correctness.

Distribution

Aggregation is a histogram distribution of event values, i.e. how many events were emitted with values falling into defined buckets. Aggregation contains a value for each bucket. Values of this metric can be safely merged by summing up per-bucket entries.

(optionally) Summary

Aggregation contains summary statistics of values of events in the stream, like mean, minimum, maximum, count and selected quantiles. Values of this metric can't be aggregated without losing correctness.

Note: summaries won't be available in the first version of Telemetry.Metrics, but may be added later if there is a need for them.

@josevalim
Copy link

Telemetry glossary

Metrics are responsible for aggregating Telemetry events with the same name in order to gain any useful knowledge about the events. A single metric may generate multiple aggregations, each aggregation being bound to a unique set of tag values. Tags are pairs of key-values derived from event metadata. In the simplest case, tags are a subset of the metadata. Based on the tag values, the value of the event will be used to generate one of the aggregations. Metric type defines how the values are aggregated (e.g. a sum or a distribution). Each aggregation may itself contain many values, which is dependent on the metric type.

Event

Telemetry event, with a name, numerical value and a metadata.

Metric

Consumes events from the stream and aggregates them according to each unique set of tags derived from those events. Metric has a name, description, type and unit. You also need to specify the name of events consumed by the metric.

Tags

Collection of key-values pairs derived from event metadata.

Metric type

Telemetry supports following metric types:

Counter

Aggregation value is the number of emitted events, regardless of their values. In multi-node deployments, the counter values can be safely merged (by adding) without losing statistical correctness.

Sum

Aggregation value is the sum of event values. Sum values can be safely merged without losing correctness.

LastValue

Aggregation value is the value carried by the most recent event in the stream. Values of this metric cannot be merged in the general case without losing correctness.

Distribution

Aggregation is a histogram distribution of event values, i.e. how many events were emitted with values falling into defined buckets. Aggregation contains a value for each bucket. Values of this metric can be safely merged by summing up per-bucket entries.

Summary

Aggregation contains summary statistics of values of events in the stream, like mean, minimum, maximum, count and selected quantiles. Values of this metric can't be aggregated without losing correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment