ztellman/top-k-threshold.md

## top-k-threshold.md

      
    Raw
  

              top-k-threshold.md
            
          
    This relates to work I'm doing on a system for distributed outlier detection, as described here.
In this system, the count-min sketch is used as a distributed filter.  Over some interval, the CMS is populated on all the edge nodes, sent to a server for merging, and the merged CMS is broadcast back out to the edges.  On the following interval, each new key is run through the "global" CMS, and if it's over some threshold, it is added to the set of potential outliers, which are precisely counted.
This is, as described so far, an adaptation of this 2003 paper from single-node to multi-node.  However, that paper assumes that the threshold is known ahead of time.
In most cases, though, what constitutes an "outlier" is a moving target.  We'd like to adapt as the distribution changes.  Previously, I was using a consistent uniform sample of the keys to identify the threshold, but we'd have to take a very large sample to identify the threshold for the top-10 keys in a billion key metric space.
Based on some preliminary tests, though, it looks like the CMS itself can be used to identify the threshold.  If you have a sketch of w by h registers, then the top-k threshold can be estimated by sorting all the registers, and taking the (w * h) - (k * h)th register.
The intuition for this is pretty straightforward; consider the top-1 case where there's a clear power-law distribution.  One register in each row will be much higher than all the others, that row corresponds to the largest key in the metric space, and we choose the thresold corresponding to the least value across all the rows.   For k > 1, it's a bit less cut and dry, because the least register for a particular key may not be above the threshold we choose; our threshold ends up being too high, and our outlier set too small.  For Pareto distributions with greater alpha values, this becomes more pronounced.  However, unlike the uniform sample, a smaller k makes it easier to be accurate.
Note that this filter mechanism doesn't work at all if there aren't enough outliers.  If we're looking for top-10, but only one key differs from all the others, then our threshold won't do any filtering at all.  However, this is pretty easy to detect: we take our calculated threshold, adjust it down by some epsilon, and see how many registers are over that value.  If it's too many, we just adjust the threshold up until it excludes most registers.  If that isn't possible (which would be true of a completely uniform distribution), then we just skip this interval, and wait for some outliers to appear.