Skip to content

Instantly share code, notes, and snippets.

@Hefeweizen
Created October 22, 2021 21:59
Show Gist options
  • Save Hefeweizen/1860027fc89d30dcef5fade040d94638 to your computer and use it in GitHub Desktop.
Save Hefeweizen/1860027fc89d30dcef5fade040d94638 to your computer and use it in GitHub Desktop.
Prometheus Outlier Detection

Outlier Detection

We sometimes experience a node(s) running hot. Determining what to do is conditional on the rest of the cluster. If it’s one node out of 50 that’s running hot, we can chalk it up too “bad node” and kill it. However, if it’s >50% of the nodes running hot, we should seek to understand what’s happening within the service. To that end, I wanted to create an alert on outlier nodes.

Let’s start with cumulative idle on a host:

sum without(cpu, job) (irate(node_cpu{host=~"prod-foo.*",mode="idle"}[3m]))

Idle is the measure of unused CPU, so it’s a shortcut for matching mode !~ "idle". Should we care, we’d want to subtract this from 1 to get the consumed CPU. Also of note, we accumulating idle from all CPUs on the host, and not averaging per cpu.

Now, let’s get the average idle for the cluster:

avg by (role) (sum without(cpu, job) (irate(node_cpu{host=~"prod-foo.*",mode="idle"}[3m])))

Again, this number is accumulated idle per host.

Finally, dividing the host by the cluster-average gives us a factor for how out of band it is.

sum without(cpu, job) (irate(node_cpu{host=~"prod-foo.*",mode="idle"}[3m])) /on (role) group_left avg by (role) (sum without(cpu, job) (irate(node_cpu{host=~"prod-foo.*",mode="idle"}[3m])))

As an example, consider a host consuming 80% of its cpu; imagine the cluster average is 50% consumption. 80/50 would give us a factor of 1.6; this host is using 60% more resources than other hosts in the cluster.

Commentary Normally, we show cpu utilization as average per cpu. These formulas do not do that. However, because we don’t do this for either the numerator or the denominator, we divide them out in the final formula. This gives us the same scale as if we had calculated average-per-cpu.

I’ll experiment with this for a bit. My next experiment is: can I count outlier nodes and just alert on “(high cpu) and (outlier count <10% of cluster size)”.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment