Skip to content

Instantly share code, notes, and snippets.

@OsQu
Created July 18, 2019 09:57
Show Gist options
  • Save OsQu/5448eb78b786db722bb91e22860f2552 to your computer and use it in GitHub Desktop.
Save OsQu/5448eb78b786db722bb91e22860f2552 to your computer and use it in GitHub Desktop.
Statsd + InfluxDB + Grafana

Statsd + InfluxDB + Grafana

  +-------------+       +--------+     Flush     +----------+
  |             |  Push |        |  Periodically            |
  | Application +-------> StatsD +---------------> InfluxDB |
  |             |       |        |               |          |
  +-------------+       +--------+               +-----^----+
                                                       |
                                                       |
                                                       |
                                                 +-----+----+
                                                 |          |
                                                 | Grafana  |
                                                 |          |
                                                 +----------+

    Figure 1. Data flow

How data flows from application to another

Figure 1. displays the data flow. Application pushes updates to different StatsD buckets using UDP (or optionally TCP) protocol. StatsD flushes them periodically (defaults to 10s intervals) to InfluxDB. Differnet data visualization tools such as Grafana can then query the data using InfluxDBs query language 1 to fetch and visualize data.

Different StatsD metric types explained

The list is from the most simplest to the most complex metric.

  • Counter: Increment a number and and flush periodically.
  • Gauge: Set value (also supports deltas) and flush periodically. If has not change since last flush, send the previous value.
  • Timing: Send timing data. StatsD calculates automatically different metrics based on it, such as percentiles, mean, standard deviation, sum, uppwer and lower bounds, etc and flushes periodically.

It is possible to sample counters. When sampling, StatsD sends only the given portion of the metrics but includes the sample rate in the metric, so it is compensated in the end of StatsD server. So for example when sending 3 with sample rate of 1/10, the resulting bucket will have 30 (3 * (1/10)^-1) as value.

StatsD has also Set metric type, but I have no experience on using it.

So what goes to InfluxDB?

In here we assume that the flush interval is the default, 10s.

Let's increment a counter at key foobar over time as following.

Increment: 2 3    2 1  3  2    8   1  1  2  3   2    1  1   1    2     3     4  1
T:         1    5    10    15    20    25    30    35    40    45    50    55     60

InfluxDB then receives this after 60 seconds:

                time      foobar
                ----      ------
                0
                10        8
                20        13
                30        7
                40        4
                50        3
                60        8

    Table 1. InfluxDB after StatsD has flushes values

And what this all actually means?

What is the unit of the query select foobar from some_measurement;? The correct answer is increments / flush_interval. And this is what makes things in my opinion complicated. Neither StatsD or InfluxDB do any magic. Also Grafana just draws what ever data we throw to it. Gauges are easier to understand, since it is always just a plain number in a given point of time. For example a gauge that tracks logged in users displays just that and nothing else: select logged_users from some_measurement; gives a table of logged users in the flush_interval times.

Having a metric with unit increments / 10s for example is not very intuitive. We need to do better than that. That's why InfluxDB offers group by function that can be used to group the data over given field. Probably the most useful one is the ability to group over time. For example using the data in Table 1. with query: select sum(foobar) from some_measurement group by time(20s) and we would get a following table:

                time      foobar
                ----      ------
                0
                20        21
                40        11
                60        11

    Table 2. Data grouped with 20s

Notice how you should always give an aggregation function when working with group bys. Without that, the InfluxDB does not know how to aggregate the values that are going to the same bucket. Choosing the correct aggregation function depends from the input data. For example, if you aggregate counters, you almost always want to use sum function, because that would add the values in one bucket together. Exactly the same thing that StatsD does within 10s interval!

However with gauges you most definitely don't want to use sum, because adding together your current user count in given point of time does not make sense. For that probably first or last functions are more suitable. Or if you want to have a metric about logged in and out users, you can use derivative function. Really depends from your use case.

To familiarize yourself with the different functions and what they do, login to your InfluxDB and play around with differnet functions found from 2.

What happens to empty values

Now that we know how we can group data, we might ask what happens if some bucket does not contain a value. For example if we grouped with 1s, we would have buckets from 1-9s empty because there is no data. And InfluxDB does not try to guess any value there, it really is empty. This is also displayed in both Tables

  1. and 2. in time 0.

InfluxDB has a fill command that can be used to fill the null values with some number, usually with zero. Another way is to tell Grafana how to handel null values. When editing a metric in Grafana, there is a "Null value" option under Display tab, that can be set to be "connected", "null" or "null as zero". Connected means that the Grafana will just ignore empty value and connect the two data points with a line. Null will not display a line at all and Null as zero will draw a zero line for those points.

Conclusion

tl;dr: To get a sensible value from the counter, use:

select sum(requests) from web_app group by time(1min) fill(0);

This will return requests/min.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment