Skip to content

Instantly share code, notes, and snippets.

@thealmightygrant
Last active January 7, 2022 03:33
Show Gist options
  • Save thealmightygrant/1b78388377e1e67e4d87e9d90081c12d to your computer and use it in GitHub Desktop.
Save thealmightygrant/1b78388377e1e67e4d87e9d90081c12d to your computer and use it in GitHub Desktop.
A Deep Dive into an Alert on Prometheus
<section>
<h1>A Deep Dive into an Alert on Prometheus</h1>
<h2>Grant Sherrick</h2>
</section>
<section id="where-we-started">
<h2>We started with an alert.</h2>
<br>
<pre><code>- alert: EntitledErrorRateIncreased
expr: sum(rate(http_requests_total{app="entitled",path!="/metrics/healthcheck",statusClass="5XX"}[1h])
* 60 * 60) > 5
for: 1m
labels:
app: entitled
annotations:
description: '{{$labels.instance}} of job {{$labels.job}} has experienced increased
error rates for more than 1 minute.'
summary: Instance {{$labels.instance}} is experiencing increased error rates</code></pre>
</section>
<section id="an-aside-on-rate">
<h2>A brief aside of <code>rate()</code></h2>
<br>
<div class="fragment">
<p style="text-align: left; margin-left: 1.3em;"><code>rate(v range-vector)</code> calculates the per-second average rate of increase of the time series in the range vector.</p>
<p style="text-align: left; margin-left: 1.3em;">rate should only be used with counters.</p>
</div>
<p style="text-align: left; margin-left: 1.3em;"><a href="https://github.com/prometheus/prometheus/blob/release-2.0/promql/functions.go#L135">The rate computation</a></p>
</section>
<section id="issue1">
<h2>This alert had a few issues.</h2>
<br>
<h4 style="text-align: left; margin-left: 1.3em;">1. It's not specific enough.</h4>
<pre><code>- alert: EntitledErrorRateIncreased
expr: sum(rate(http_requests_total{app="entitled",path!="/metrics/healthcheck",statusClass="5XX"}[1h])
* 60 * 60) > 5</code></pre>
<a href="https://prometheus.commonstack.io/graph?g0.range_input=1h&g0.expr=http_requests_total%7Bapp%3D%22entitled%22%2Cpath!%3D%22%2Fmetrics%2Fhealthcheck%22%7D&g0.tab=1"><pre class="fragment"><code>http_requests_total{app="entitled",path!="/metrics/healthcheck"}</code></pre></a>
</section>
<section id="issue2">
<h2>This alert had a few issues.</h2>
<br>
<h4 style="text-align: left; margin-left: 1.3em;">2. What does <code>rate[1h]</code> <code>for: 1m</code> mean?</h4>
<pre><code>- alert: EntitledErrorRateIncreased
expr: sum(rate(http_requests_total{app="entitled",path!="/metrics/healthcheck",statusClass="5XX"}[1h])
* 60 * 60) > 5
for: 1m</code></pre>
<div class="fragment">
<p>In order to fire: this alert has to have been firing for 1 minute. The alert is averaged over the past hour...</p>
</dev>
</section>
<section id="issue3">
<h2>This alert had a few issues.</h2>
<br>
<h4 style="text-align: left; margin-left: 1.3em;">3. Why do we care if the error rate > 5?</h4>
<pre><code>- alert: EntitledErrorRateIncreased
expr: sum(rate(http_requests_total{app="entitled",path!="/metrics/healthcheck",statusClass="5XX"}[1h])
* 60 * 60) > 5</code></pre>
<div class="fragment">
<p>We've seen an average of at least 5 5XX responses per hour over all paths</p>
</dev>
</section>
<section id="issue4">
<h2>This alert had a few issues.</h2>
<br>
<h4 style="text-align: left; margin-left: 1.3em;">4. The description and summary are not clear and do not show the data we'd like to see.</h4>
<pre><code>annotations:
description: '{{$labels.instance}} of job {{$labels.job}} has experienced increased
error rates for more than 1 minute.'
summary: Instance {{$labels.instance}} is experiencing increased error rates</code></pre>
<div class="fragment">
<p><a href="https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators">Aggregation Operators, like sum, rate, min, max, etc.</a></p>
<p><a href="https://prometheus.commonstack.io/graph?g0.range_input=1h&g0.expr=sum(rate(http_requests_total%7Bapp%3D%22entitled%22%2Cpath!%3D%22%2Fmetrics%2Fhealthcheck%22%7D%5B1h%5D)%20*%2060%20*%2060)%20&g0.tab=1">There's not even instance data in this alert.</a></p>
</dev>
</section>
<section id="issue4">
<h2>Let's <a href="https://prometheus.commonstack.io/graph?g0.range_input=1h&g0.expr=sum(rate(http_requests_total%7Bapp%3D%22entitled%22%2Cpath!%3D%22%2Fmetrics%2Fhealthcheck%22%7D%5B1h%5D)%20*%2060%20*%2060)%20&g0.tab=1">fix</a> it!</h2>
</section>
<section id="conclusion">
<h2>Thanks!</h2>
</section>
<section id="useful-links">
<h2>Some Useful Links:</h2>
<ul>
<li><a href="https://github.com/google/re2/wiki/Syntax">RE2, the regular expression syntax for p8s</a></li>
<li><a href="https://prometheus.io/docs/prometheus/latest/querying/basics/">Querying Basics with Prometheus</a></li>
<li><a href="https://www.weave.works/blog/">WeaveWorks Blog</a></li>
</ul>
</section>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment