Hefeweizen/prom-outlier.md

## prom-outlier.md

      
    Raw
  

              prom-outlier.md
            
          
    Outlier Detection

We sometimes experience a node(s) running hot.  Determining what to do is conditional on the rest of the cluster.  If it’s one node out of 50 that’s running hot, we can chalk it up too “bad node” and kill it.  However, if it’s >50% of the nodes running hot, we should seek to understand what’s happening within the service.  To that end, I wanted to create an alert on outlier nodes.
Let’s start with cumulative idle on a host:
sum without(cpu, job) (irate(node_cpu{host=~"prod-foo.*",mode="idle"}[3m]))

Idle is the measure of unused CPU, so it’s a shortcut for matching mode !~ "idle".  Should we care, we’d want to subtract this from 1 to get the consumed CPU.  Also of note, we accumulating idle from all CPUs on the host, and not averaging per cpu.
Now, let’s get the average idle for the cluster:
avg by (role) (sum without(cpu, job) (irate(node_cpu{host=~"prod-foo.*",mode="idle"}[3m])))

Again, this number is accumulated idle per host.
Finally, dividing the host by the cluster-average gives us a factor for how out of band it is.
sum without(cpu, job) (irate(node_cpu{host=~"prod-foo.*",mode="idle"}[3m])) /on (role) group_left avg by (role) (sum without(cpu, job) (irate(node_cpu{host=~"prod-foo.*",mode="idle"}[3m])))

As an example, consider a host consuming 80% of its cpu; imagine the cluster average is 50% consumption.  80/50 would give us a factor of 1.6; this host is using 60% more resources than other hosts in the cluster.
Commentary
Normally, we show cpu utilization as average per cpu.  These formulas do not do that.  However, because we don’t do this for either the numerator or the denominator, we divide them out in the final formula.  This gives us the same scale as if we had calculated average-per-cpu.
I’ll experiment with this for a bit.  My next experiment is: can I count outlier nodes and just alert on “(high cpu) and (outlier count <10% of cluster size)”.