I ran a smaller scale test to check if skew affects the system. Test parameters:
- 10 "nodes"
- 300 "queries"
- 5 "metrics
- 300 hours
Totaling 15,000 time series, 4,500,000 data points total.
I ran one test with no "disruptions", so I could plot the baseline data spread. Then I re-ran the simulation with a single disruption being added to the timeline.
Note: "single disruption" in this case means what it did in the article, namely an entire metric/query/node experiences a disruption, and thus is always more than a single time series that is affected. Usually a few thousand at minimum
In the graph, I plot the data distribution (
"Normal -- All -- no disruption"), the distribution of the 90th percentile of surprises
"Normal -- 90th percentile -- no disruption") and the 90th percentile of surprises over time (
"Normal -- 90th over time -- no disruption").
The same figures are plotted for the single-disruption case on the second line.
Next, I repeated the above two experiments using a LogNormal generator to provide a skewed distrbution, using the same dynamic range of random means / stddev. Same set of plots:
So it looks like the Atlas system works well for at least LogNormal skews. I realize not all skewed distributions are LogNormal, but it was an low-hanging fruit to test. I believe this works because Atlas doesn't actually care about the distribution of the data. Instead, it is monitoring the distribution of the 90th percentile of surprise over time. Based on my empirical testing and eBay's paper, it seems that the distribution of 90th over time is a skewed distribution itself, but throws really large outliers whenever there is a disruption to surprise in the data.
4th page of from ebay's paper, which looks similar to the results I simulated:
Of course, this was pretty simple and homogenous. It would be interesting to see how Atlas does with a mixture of normal + skewed. And based on the response from the authors when I asked about seasonal a few months ago, it probably doesn't work "out of the box" with seasonal data. And I'm curious how it responds if there are some "adversarial" timeseries that are essentially just random noise generators...does that throw off the ability to detect changes in surprise? Not sure.