vsliouniaev/Aggregating Percentiles.md

## Aggregating Percentiles.md

      
    Raw
  

              Aggregating Percentiles.md
            
          
    https://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-you-cant-average.html:
I run into the following situation way too often:
You have some means of measuring and collecting latency, and you want to report on it's percentile behavior over time. You usually capture the data either in some form of high-fidelity histogram, or as raw data (each operation has it's own latency info). You then summarize the data on a per-interval basis and place it in some log. E.g. you may have a log with a summary line per interval, describing something like the 90%/ile, 95%'lie, 99%'lie, 99.9%'lie and Max of all results seen in the last 5 seconds.
Since raw data is (unfortunately) much more commonly used for this because it's hard to get accurate percentile information from most histograms, you worry about the space needed to keep the full data used to produce this summary over time. So you only keep the summaries but throw away the data.
Then you have this nice log file, with one line per summary interval, and over a run (hour, day, whatever), the log file has all sorts of interval data points for each percentile level, which will often shift rapidly over time. And while this data is useful for "what happened when?" needs, it is terrible for summarizing and getting a feel for the overall behavior of the system during the run. It needs to be further summarized to be comprehensible.
So how do you produce a summary report on the behavior over longer periods of time (e.g. per hour, or per day, or of the whole run) from such per-interval summary reports?
Well, here is what I see many people do at this point. Ready for this?
They produce a summary from the summaries. Since they really want to report the 90%'lie, 95%'lie, 99%'lie, 99.9%'lie, and Max values for the whole run, but only have the summary data on a per-interval basis, they average the interval summaries to build the overall summary.
E.g. The 99%'lie for the run is computed as the average of the 99%'lie report of all intervals in the log. Same for the other percentiles.
Any time you see timeline chart plotting some percentile, and there is an average reported in text to give you a feel for what the overall average behavior of the spiky line you are looking at is, that's what you are most likely looking at...
So what's wrong with that?
At first blush, you may think that the hourly average of the per-5-second-interval 99%ile value has some meaning. you know averages can be inaccurate and misrepresenting at times, but this value could still give you some "feel" for the 99%'lie system, right?
Wrong.
The average of 1,000 99%'lie measurement has as much to do with the 99%'lie behavior of a system as the current temperature on the surface of mars does. Maybe less.
The simplest way to demonstrate the absurdity of this "average of percentiles" calculation is to examine it's obvious effect on two "percentile" values that are hard to hide from: The Max (the 100%'lie value) and the Min (the 0%'lie value).
If you had the Maximum value recorded for each 5 second interval over a 5000 second run, and want to deduce the Maximum value for the whole run, would an average of the 1,000 max values produce anything useful? No. You'd need to use the Maximum of the Maximums. Which will actually work, but only for the unique edge case of 100%...
But if you had the Minimum value recorded for each 5 second interval over a 5000 second run, and want to deduce the Minimum value for the whole run, you'd have to look for the Minimum of the Minimums. Which also works, but only for the unique edge case of 0%.
The only thing you can deduce about the 99%'lie of a 5000 second run for which you have 1,000 5 second summary 99%'lie values is that it falls somewhere between the Maximum and Minimum of those 1,000 99%'lie values...
Bottom line: You can't average percentiles. It doesn't work. Period.
-- Gil Tene