A quick write up on the different outcomes from various error budget calculation methods.
Scenario data will be summarized as an array of pairs with a timestamp and value, ex:
Good Events
[ t0: 4, t1: 10, t2: 0, t3: 15 ...]
Total Events
[ t0: 4, t1: 11, t2: 0, t3: 15 ...]
In this scenario were looking at traffic that slowly rises and falls, for example, throughout various hours of the day.
During peak load, the service begins to suffer, so some requests fail.
It also suffers a small performance hiccup at launch due to startup costs.
Our target for the window is 95%.
Traffic | t0 | t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 |
---|---|---|---|---|---|---|---|---|---|---|
Good Events | 2 | 10 | 12 | 13 | 13 | 12 | 13 | 13 | 12 | 11 |
Valid Events | 4 | 10 | 12 | 13 | 14 | 15 | 14 | 13 | 12 | 11 |
With occurrence based error budgeting, we simply sum the Good and Total events for the window and divide them.
It's straightforward, and automatically weights impact by the total number of requests served.
Good Events summed - 111 Total Events summed - 118
Window performance -> Good/Total -> 94.0677%
Error Budget burnt during window -> (1 - 0.940677) / (1 - 0.95)
-> 118.646%
With timeslice based error budgeting, we evaluate each time interval individually against the target, and then
divide the total number of successful time intervals by the total number of time intervals.
For example, t4's interval value would be a boolean denoting whether 13/14 > 0.95
, or false
/0
.
Interval results (success=1, failure=0):
Interval | t0 | t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 |
---|---|---|---|---|---|---|---|---|---|---|
Result | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
Good intervals = 6 Total intervals = 10
Window performance -> 6/10 -> 60.0%
Error Budget burnt during window -> (1 - 0.6) / (1 - 0.95)
-> 800%
With weighted time slicing, rather than store a boolean value of each time interval, a success float would be stored
representing how successful the time interval was. The result is then averaged across all time windows.
For example, t4's interval value would be 13/14
-> 0.92857
Interval | t0 | t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 |
---|---|---|---|---|---|---|---|---|---|---|
Result | 0.5 | 1 | 1 | 1 | 0.92857 | 0.8 | 0.92857 | 1 | 1 | 1 |
Average of all intervals = 9.15174 / 10 -> 91.5174%
Error budget burnt during window ->(1 - 0.915174) / (1 - 0.95)
-> 169.652%
Some considerations from @mmazur on Edge-based performance evaluation
https://gist.github.com/mmazur/e94e6c31cb3821c15a948c24d6baaece