pointsWrittenOK
represents the cumulative number of values written to InfluxDB, since the last restart
Therefore, each time we read the debug/vars
end point and inspect the value of pointsWrittenOK
, it will return the total number of values written successfully at that point in time. InfluxDB updates this counter each time it successfully writes data. When a request is made to debug/vars
the current value of this counter is reported as pointsWrittenOK
.
InfluxDB has just been started and the time is 2018-12-10T00:00:00Z
. Imagine the following sequence of events:
- At
00:00:00Z
(0s since we restarted), we readdebug/vars
;pointsWrittenOK
returns0
, because we haven't written anything yet. - At
00:00:10Z
(10s since we restarted), we write 5 values to InfluxDB - At
00:00:11Z
, we again readdebug/vars
and nowpointsWrittenOK
returns5
, because we have written a total of 5 values - At
00:00:13Z
, readingdebug/vars
again still produces a value of5
forpointsWrittenOK
, given we have written no more data
If we keep reading debug/vars
without writing another value, pointsWrittenOK
will return 5
indefinitely.
Continuing the narrative,
- At
00:00:15Z
, we write 10 values - At
00:00:20Z
, we readdebug/vars
andpointsWrittenOK
returns15
, because we have written a total of 15 values since we started.
If we restart InfluxDB and don't write any values, it will return 0
, because pointsWrittenOK
resets to 0
after a restart.
Something you might have observed is that at any point in time, if we read the pointsWrittenOK
, we only know the total number of values written. How can we use this value to give us a rate of change?
Another term for reading the debug/vars
end point is sampling.
Imagine InfluxDB receives the following sequence of writes:
time of write (since restart) | # of values written | pointsWrittenOK counter |
---|---|---|
0s | 0 | 0 |
1s | 5 | 5 |
3s | 2 | 7 |
9s | 10 | 17 |
11s | 1 | 18 |
16s | 3 | 21 |
22s | 1 | 22 |
Note that pointsWrittenOK
is just the sum of the # of values written.
If we read or sample pointsWrittenOK
every 10 seconds and store these values somewhere (_internal
), we now have a history of these values. I've replicated the previous table and included the 10s intervals of when we would sample the value of pointsWrittenOK
:
time(since restart) | # of values written | pointsWrittenOK | sample |
---|---|---|---|
0s | 0 | 0 | ☑️ |
1s | 5 | 5 | |
3s | 2 | 7 | |
9s | 10 | 17 | |
10s | 0 | 17 | ☑️ |
11s | 1 | 18 | |
16s | 3 | 21 | |
20s | 0 | 21 | ☑️ |
22s | 1 | 22 | |
30s | 0 | 22 | ☑️ |
Summarized, we now have the following list of values for pointsWrittenOK
as a new time series in the _internal
database, at 10-second intervals:
time | pointsWrittenOK |
---|---|
0s | 0 |
10s | 17 |
20s | 21 |
30s | 22 |
These still only show the total number of values written at that point in time. If we take the differences of adjacent pointsWrittenOK
values, we know how many points were written during that 10-second period. In a table, the values would be:
time | pointsWrittenOK | difference (from previous) |
---|---|---|
0s | 0 | 0 |
10s | 17 | 17 (17 - 0) |
20s | 21 | 4 (21 - 17) |
30s | 22 | 1 (22 - 21) |
InfluxQL provides a function for this called the NON_NEGATIVE_DERIVATIVE
. It is a little more sophisticated than just determining the difference between adjacent values, as it takes into account the differences in time to normalize the units, especially if your sample rate is not consistent. With that data, we can now determine the number of points written per 10 second interval.
Another important attribute of sampling is the rate at which the data is sampled. In this example we recorded the samples at 10 second intervals, which happens to be the default behavior for the _internal
database and telegraf. Another way to think of the sample rate is the accuracy, meaning in our previous example the minimum interval at which we can accurately report changes is every 10 seconds. Observing the interval between 0s and 10s, we see that we wrote 17 points or 1.7 points per second. This is obviously not a true representation of what actually happened, as the server only took writes at 1s, 3s and 9s.
If we were to sample the pointsWrittenOK
metric every 5 seconds, we would improve the accuracy of our observations, however, at the cost of
- additional CPU resources to sample the data and
- additional memory and disk resources to store the additional volume of data
time(since restart) | # of values written | pointsWrittenOK | sample |
---|---|---|---|
0s | 0 | 0 | ☑️ |
1s | 1 | 1 | |
3s | 1 | 1 | |
5s | 0 | 2 | ☑️ |
9s | 15 | 17 | |
10s | 0 | 17 | ☑️ |
11s | 1 | 18 | |
15s | 0 | 18 | ☑️ |
16s | 3 | 21 | |
20s | 0 | 21 | ☑️ |
22s | 1 | 22 | |
25s | 0 | 22 | ☑️ |
30s | 0 | 22 | ☑️ |
our new table, with differences, looks like the following:
time | pointsWrittenOK | difference (from previous) |
---|---|---|
0s | 0 | 0 |
5s | 2 | 2 |
10s | 17 | 15 |
15s | 18 | 1 |
20s | 21 | 3 |
25s | 22 | 1 |
30s | 22 | 0 |
Therefore, the sample-rate is a trade-off, where an increase in the rate improves our accuracy at the cost of additional compute resources.