stuartcarnie/sampling.md Secret

## sampling.md

      
    Raw
  

              sampling.md
            
          
    Overview


pointsWrittenOK represents the cumulative number of values written to InfluxDB, since the last restart

Therefore, each time we read the debug/vars end point and inspect the value of pointsWrittenOK, it will return the total number of values written successfully at that point in time. InfluxDB updates this counter each time it successfully writes data. When a request is made to debug/vars the current value of this counter is reported as pointsWrittenOK.
Scenario

InfluxDB has just been started and the time is 2018-12-10T00:00:00Z. Imagine the following sequence of events:

At 00:00:00Z (0s since we restarted), we read debug/vars; pointsWrittenOK returns 0, because we haven't written anything yet.
At 00:00:10Z (10s since we restarted), we write 5 values to InfluxDB
At 00:00:11Z, we again read debug/vars and now pointsWrittenOK returns 5, because we have written a total of 5 values
At 00:00:13Z, reading debug/vars again still produces a value of 5 for pointsWrittenOK, given we have written no more data

If we keep reading debug/vars without writing another value, pointsWrittenOK will return 5 indefinitely.
Continuing the narrative,

At 00:00:15Z, we write 10 values
At 00:00:20Z, we read debug/vars and pointsWrittenOK returns 15, because we have written a total of 15 values since we started.

If we restart InfluxDB and don't write any values, it will return 0, because pointsWrittenOK resets to 0 after a restart.
Something you might have observed is that at any point in time, if we read the pointsWrittenOK, we only know the total number of values written. How can we use this value to give us a rate of change?
Sampling

Another term for reading the debug/vars end point is sampling.
Imagine InfluxDB receives the following sequence of writes:


time of write (since restart)
# of values written
pointsWrittenOK counter


0s
0
0


1s
5
5


3s
2
7


9s
10
17


11s
1
18


16s
3
21


22s
1
22


Note that pointsWrittenOK is just the sum of the # of values written.
If we read or sample pointsWrittenOK every 10 seconds and store these values somewhere (_internal), we now have a history of these values. I've replicated the previous table and included the 10s intervals of when we would sample the value of pointsWrittenOK:


time(since restart)
# of values written
pointsWrittenOK
sample


0s
0
0
☑️


1s
5
5


3s
2
7


9s
10
17


10s
0
17
☑️


11s
1
18


16s
3
21


20s
0
21
☑️


22s
1
22


30s
0
22
☑️


Summarized, we now have the following list of values for pointsWrittenOK as a new time series in the _internal database, at 10-second intervals:


time
pointsWrittenOK


0s
0


10s
17


20s
21


30s
22


These still only show the total number of values written at that point in time. If we take the differences of adjacent pointsWrittenOK values, we know how many points were written during that 10-second period. In a table, the values would be:


time
pointsWrittenOK
difference (from previous)


0s
0
0


10s
17
17 (17 - 0)


20s
21
4 (21 - 17)


30s
22
1 (22 - 21)


InfluxQL provides a function for this called the NON_NEGATIVE_DERIVATIVE. It is a little more sophisticated than just determining the difference between adjacent values, as it takes into account the differences in time to normalize the units, especially if your sample rate is not consistent. With that data, we can now determine the number of points written per 10 second interval.
Sample Rate

Another important attribute of sampling is the rate at which the data is sampled. In this example we recorded the samples at 10 second intervals, which happens to be the default behavior for the _internal database and telegraf. Another way to think of the sample rate is the accuracy, meaning in our previous example the minimum interval at which we can accurately report changes is every 10 seconds. Observing the interval between 0s and 10s, we see that we wrote 17 points or 1.7 points per second. This is obviously not a true representation of what actually happened, as the server only took writes at 1s, 3s and 9s.
If we were to sample the pointsWrittenOK metric every 5 seconds, we would improve the accuracy of our observations, however, at the cost of

additional CPU resources to sample the data and
additional memory and disk resources to store the additional volume of data


time(since restart)
# of values written
pointsWrittenOK
sample


0s
0
0
☑️


1s
1
1


3s
1
1


5s
0
2
☑️


9s
15
17


10s
0
17
☑️


11s
1
18


15s
0
18
☑️


16s
3
21


20s
0
21
☑️


22s
1
22


25s
0
22
☑️


30s
0
22
☑️


our new table, with differences, looks like the following:


time
pointsWrittenOK
difference (from previous)


0s
0
0


5s
2
2


10s
17
15


15s
18
1


20s
21
3


25s
22
1


30s
22
0


Therefore, the sample-rate is a trade-off, where an increase in the rate improves our accuracy at the cost of additional compute resources.
time(since restart)	# of values written	pointsWrittenOK	sample
0s	0	0	☑️
1s	5	5
3s	2	7
9s	10	17
10s	0	17	☑️
11s	1	18
16s	3	21
20s	0	21	☑️
22s	1	22
30s	0	22	☑️