+-------------+ +--------+ Flush +----------+
| | Push | | Periodically |
| Application +-------> StatsD +---------------> InfluxDB |
| | | | | |
+-------------+ +--------+ +-----^----+
|
|
|
+-----+----+
| |
| Grafana |
| |
+----------+
Figure 1. Data flow
Figure 1. displays the data flow. Application pushes updates to different StatsD buckets using UDP (or optionally TCP) protocol. StatsD flushes them periodically (defaults to 10s intervals) to InfluxDB. Differnet data visualization tools such as Grafana can then query the data using InfluxDBs query language 1 to fetch and visualize data.
The list is from the most simplest to the most complex metric.
- Counter: Increment a number and and flush periodically.
- Gauge: Set value (also supports deltas) and flush periodically. If has not change since last flush, send the previous value.
- Timing: Send timing data. StatsD calculates automatically different metrics based on it, such as percentiles, mean, standard deviation, sum, uppwer and lower bounds, etc and flushes periodically.
It is possible to sample counters. When sampling, StatsD sends only the given portion of the metrics but includes the sample rate in the metric, so it is compensated in the end of StatsD server. So for example when sending 3 with sample rate of 1/10, the resulting bucket will have 30 (3 * (1/10)^-1) as value.
StatsD has also Set metric type, but I have no experience on using it.
In here we assume that the flush interval is the default, 10s.
Let's increment a counter at key foobar over time as following.
Increment: 2 3 2 1 3 2 8 1 1 2 3 2 1 1 1 2 3 4 1
T: 1 5 10 15 20 25 30 35 40 45 50 55 60
InfluxDB then receives this after 60 seconds:
time foobar
---- ------
0
10 8
20 13
30 7
40 4
50 3
60 8
Table 1. InfluxDB after StatsD has flushes values
What is the unit of the query select foobar from some_measurement;
? The
correct answer is increments / flush_interval. And this is what makes things in
my opinion complicated. Neither StatsD or InfluxDB do any magic. Also Grafana
just draws what ever data we throw to it. Gauges are easier to understand, since
it is always just a plain number in a given point of time. For example a gauge
that tracks logged in users displays just that and nothing else:
select logged_users from some_measurement;
gives a table of logged users in
the flush_interval times.
Having a metric with unit increments / 10s for example is not very intuitive. We
need to do better than that. That's why InfluxDB offers group by
function that
can be used to group the data over given field. Probably the most useful one is
the ability to group over time. For example using the data in Table 1. with
query: select sum(foobar) from some_measurement group by time(20s)
and we would get
a following table:
time foobar
---- ------
0
20 21
40 11
60 11
Table 2. Data grouped with 20s
Notice how you should always give an aggregation function when working with
group bys. Without that, the InfluxDB does not know how to aggregate the values
that are going to the same bucket. Choosing the correct aggregation function
depends from the input data. For example, if you aggregate counters, you almost
always want to use sum
function, because that would add the values in one
bucket together. Exactly the same thing that StatsD does within 10s interval!
However with gauges you most definitely don't want to use sum, because adding
together your current user count in given point of time does not make sense. For
that probably first
or last
functions are more suitable. Or if you want to
have a metric about logged in and out users, you can use derivative
function.
Really depends from your use case.
To familiarize yourself with the different functions and what they do, login to your InfluxDB and play around with differnet functions found from 2.
Now that we know how we can group data, we might ask what happens if some bucket does not contain a value. For example if we grouped with 1s, we would have buckets from 1-9s empty because there is no data. And InfluxDB does not try to guess any value there, it really is empty. This is also displayed in both Tables
- and 2. in time 0.
InfluxDB has a fill
command that can be used to fill the null values with some
number, usually with zero. Another way is to tell Grafana how to handel null
values. When editing a metric in Grafana, there is a "Null value" option under
Display tab, that can be set to be "connected", "null" or "null as zero".
Connected means that the Grafana will just ignore empty value and connect the
two data points with a line. Null will not display a line at all and Null as
zero will draw a zero line for those points.
tl;dr: To get a sensible value from the counter, use:
select sum(requests) from web_app group by time(1min) fill(0);
This will return requests/min.