Skip to content

Instantly share code, notes, and snippets.

@stuartcarnie
Last active December 11, 2018 04:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save stuartcarnie/1d47b64262734b24b7eb60380c48fdb9 to your computer and use it in GitHub Desktop.
Save stuartcarnie/1d47b64262734b24b7eb60380c48fdb9 to your computer and use it in GitHub Desktop.
Sampling

Overview

pointsWrittenOK represents the cumulative number of values written to InfluxDB, since the last restart

Therefore, each time we read the debug/vars end point and inspect the value of pointsWrittenOK, it will return the total number of values written successfully at that point in time. InfluxDB updates this counter each time it successfully writes data. When a request is made to debug/vars the current value of this counter is reported as pointsWrittenOK.

Scenario

InfluxDB has just been started and the time is 2018-12-10T00:00:00Z. Imagine the following sequence of events:

  • At 00:00:00Z (0s since we restarted), we read debug/vars; pointsWrittenOK returns 0, because we haven't written anything yet.
  • At 00:00:10Z (10s since we restarted), we write 5 values to InfluxDB
  • At 00:00:11Z, we again read debug/vars and now pointsWrittenOK returns 5, because we have written a total of 5 values
  • At 00:00:13Z, reading debug/vars again still produces a value of 5 for pointsWrittenOK, given we have written no more data

If we keep reading debug/vars without writing another value, pointsWrittenOK will return 5 indefinitely.

Continuing the narrative,

  • At 00:00:15Z, we write 10 values
  • At 00:00:20Z, we read debug/vars and pointsWrittenOK returns 15, because we have written a total of 15 values since we started.

If we restart InfluxDB and don't write any values, it will return 0, because pointsWrittenOK resets to 0 after a restart.

Something you might have observed is that at any point in time, if we read the pointsWrittenOK, we only know the total number of values written. How can we use this value to give us a rate of change?

Sampling

Another term for reading the debug/vars end point is sampling.

Imagine InfluxDB receives the following sequence of writes:

time of write (since restart) # of values written pointsWrittenOK counter
0s 0 0
1s 5 5
3s 2 7
9s 10 17
11s 1 18
16s 3 21
22s 1 22

Note that pointsWrittenOK is just the sum of the # of values written.

If we read or sample pointsWrittenOK every 10 seconds and store these values somewhere (_internal), we now have a history of these values. I've replicated the previous table and included the 10s intervals of when we would sample the value of pointsWrittenOK:

time(since restart) # of values written pointsWrittenOK sample
0s 0 0 ☑️
1s 5 5
3s 2 7
9s 10 17
10s 0 17 ☑️
11s 1 18
16s 3 21
20s 0 21 ☑️
22s 1 22
30s 0 22 ☑️

Summarized, we now have the following list of values for pointsWrittenOK as a new time series in the _internal database, at 10-second intervals:

time pointsWrittenOK
0s 0
10s 17
20s 21
30s 22

These still only show the total number of values written at that point in time. If we take the differences of adjacent pointsWrittenOK values, we know how many points were written during that 10-second period. In a table, the values would be:

time pointsWrittenOK difference (from previous)
0s 0 0
10s 17 17 (17 - 0)
20s 21 4 (21 - 17)
30s 22 1 (22 - 21)

InfluxQL provides a function for this called the NON_NEGATIVE_DERIVATIVE. It is a little more sophisticated than just determining the difference between adjacent values, as it takes into account the differences in time to normalize the units, especially if your sample rate is not consistent. With that data, we can now determine the number of points written per 10 second interval.

Sample Rate

Another important attribute of sampling is the rate at which the data is sampled. In this example we recorded the samples at 10 second intervals, which happens to be the default behavior for the _internal database and telegraf. Another way to think of the sample rate is the accuracy, meaning in our previous example the minimum interval at which we can accurately report changes is every 10 seconds. Observing the interval between 0s and 10s, we see that we wrote 17 points or 1.7 points per second. This is obviously not a true representation of what actually happened, as the server only took writes at 1s, 3s and 9s.

If we were to sample the pointsWrittenOK metric every 5 seconds, we would improve the accuracy of our observations, however, at the cost of

  • additional CPU resources to sample the data and
  • additional memory and disk resources to store the additional volume of data
time(since restart) # of values written pointsWrittenOK sample
0s 0 0 ☑️
1s 1 1
3s 1 1
5s 0 2 ☑️
9s 15 17
10s 0 17 ☑️
11s 1 18
15s 0 18 ☑️
16s 3 21
20s 0 21 ☑️
22s 1 22
25s 0 22 ☑️
30s 0 22 ☑️

our new table, with differences, looks like the following:

time pointsWrittenOK difference (from previous)
0s 0 0
5s 2 2
10s 17 15
15s 18 1
20s 21 3
25s 22 1
30s 22 0

Therefore, the sample-rate is a trade-off, where an increase in the rate improves our accuracy at the cost of additional compute resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment