OsQu/doc.md

## doc.md

      
    Raw
  

              doc.md
            
          
    Statsd + InfluxDB + Grafana

  +-------------+       +--------+     Flush     +----------+
  |             |  Push |        |  Periodically            |
  | Application +-------> StatsD +---------------> InfluxDB |
  |             |       |        |               |          |
  +-------------+       +--------+               +-----^----+
                                                       |
                                                       |
                                                       |
                                                 +-----+----+
                                                 |          |
                                                 | Grafana  |
                                                 |          |
                                                 +----------+

    Figure 1. Data flow

How data flows from application to another

Figure 1. displays the data flow. Application pushes updates to different StatsD
buckets using UDP (or optionally TCP) protocol. StatsD flushes them periodically
(defaults to 10s intervals) to InfluxDB. Differnet data visualization tools such
as Grafana can then query the data using InfluxDBs query language 1 to fetch
and visualize data.
Different StatsD metric types explained

The list is from the most simplest to the most complex metric.

Counter: Increment a number and and flush periodically.
Gauge: Set value (also supports deltas) and flush periodically. If has not
change since last flush, send the previous value.
Timing: Send timing data. StatsD calculates automatically different metrics
based on it, such as percentiles, mean, standard deviation, sum, uppwer and
lower bounds, etc and flushes periodically.

It is possible to sample counters. When sampling, StatsD sends only the
given portion of the metrics but includes the sample rate in the metric, so it
is compensated in the end of StatsD server. So for example when sending 3 with
sample rate of 1/10, the resulting bucket will have 30 (3 * (1/10)^-1) as value.
StatsD has also Set metric type, but I have no experience on using it.
So what goes to InfluxDB?

In here we assume that the flush interval is the default, 10s.
Let's increment a counter at key foobar over time as following.
Increment: 2 3    2 1  3  2    8   1  1  2  3   2    1  1   1    2     3     4  1
T:         1    5    10    15    20    25    30    35    40    45    50    55     60

InfluxDB then receives this after 60 seconds:
                time      foobar
                ----      ------
                0
                10        8
                20        13
                30        7
                40        4
                50        3
                60        8

    Table 1. InfluxDB after StatsD has flushes values

And what this all actually means?

What is the unit of the query select foobar from some_measurement;? The
correct answer is increments / flush_interval. And this is what makes things in
my opinion complicated. Neither StatsD or InfluxDB do any magic. Also Grafana
just draws what ever data we throw to it. Gauges are easier to understand, since
it is always just a plain number in a given point of time. For example a gauge
that tracks logged in users displays just that and nothing else:
select logged_users from some_measurement; gives a table of logged users in
the flush_interval times.
Having a metric with unit increments / 10s for example is not very intuitive. We
need to do better than that. That's why InfluxDB offers group by function that
can be used to group the data over given field. Probably the most useful one is
the ability to group over time. For example using the data in Table 1. with
query: select sum(foobar) from some_measurement group by time(20s) and we would get
a following table:
                time      foobar
                ----      ------
                0
                20        21
                40        11
                60        11

    Table 2. Data grouped with 20s

Notice how you should always give an aggregation function when working with
group bys. Without that, the InfluxDB does not know how to aggregate the values
that are going to the same bucket. Choosing the correct aggregation function
depends from the input data. For example, if you aggregate counters, you almost
always want to use sum function, because that would add the values in one
bucket together. Exactly the same thing that StatsD does within 10s interval!
However with gauges you most definitely don't want to use sum, because adding
together your current user count in given point of time does not make sense. For
that probably first or last functions are more suitable. Or if you want to
have a metric about logged in and out users, you can use derivative function.
Really depends from your use case.
To familiarize yourself with the different functions and what they do, login to
your InfluxDB and play around with differnet functions found from 2.
What happens to empty values

Now that we know how we can group data, we might ask what happens if some bucket
does not contain a value. For example if we grouped with 1s, we would have
buckets from 1-9s empty because there is no data. And InfluxDB does not try to
guess any value there, it really is empty. This is also displayed in both Tables

and 2. in time 0.

InfluxDB has a fill command that can be used to fill the null values with some
number, usually with zero. Another way is to tell Grafana how to handel null
values. When editing a metric in Grafana, there is a "Null value" option under
Display tab, that can be set to be "connected", "null" or "null as zero".
Connected means that the Grafana will just ignore empty value and connect the
two data points with a line. Null will not display a line at all and Null as
zero will draw a zero line for those points.
Conclusion

tl;dr: To get a sensible value from the counter, use:
select sum(requests) from web_app group by time(1min) fill(0);

This will return requests/min.