In order to figure what is going on with Grafana managed alerts generic issue, we can use (or need) the following information.
Less may be needed depending how far along in the pipeline the issue is (if it is at the end like notifications, then can need all the stuff that takes us to that point).
Grafana Managed Rule Edit Alert Rule View:
- The queries (and their data sources) and expressions
- Example of data returned by the queries
- The
[Condition]
drop down field setting (refId) - The
[Evaluate Every]
and[For]
settings - The
[NoData/Error Handling]
dropdown options
From the Notification Policy View:
- The notification policy that will apply to the above alert rule its settings:
[Group by]
label matchers- Timings:
[Group wait]
,[Group interval]
,[Repeat interval]
- The associated contact point
From the Contact Point View:
- All settings that are not sensitive and/or Person/Company Identifiable
- Also, if a
[template]
or sub template is referenced instead of being entirely in the field, that template should also be included.
Also possibly any active or recently active silences.
Edit Alert Rule view:
- Screenshots of the data from "run queries" can help, as well as the output of "test alert"
Administrative:
- Grafana logs with debug enabled (I am not sure we currently have consistent pattern to get only alerting related things)
- Main Alerting Database tables:
alert_rule
,alert_instance
,alert_configuration
. (There are other tables that matter like for things like permissions)
Dashboard view:
- A screenshot of Grafana annotations that are created for the alert rule (I believe a dashboard/panel has to be associated with it for these to exist? or perhaps just to show up? )
Alert Overview view:
- Expanded view of the alert rule (that shows the instances).
Alert Groups view:
- Active alert groups, perhaps multiple screen shots over time.
This describes the alert cycle through the various alerting components (alert pipeline (naming?) and how the above pieces of information fit into it. (Note: See image in gist)
The rule stage consists of all the pieces before sending to Grafana's embedded Alert Manager component. Is is analogous to how prometheus and alert manager interact except with some extra features to accommodate other data sources. With another notable difference: prometheus can send notifications, whereas Alert Manager is required for notifications with Grafana managed alerts.
Will load Alert Rules, and will signal the Evaluation component to run each alert rule on the [Evaluate Every]
setting set in the Rule Edit view. It also provides information to the State Manager component and coordinates other components.
Since the scheduler has no direct UI, its information is found in Grafana's logs.
Runs queries and expressions from an Alert Rule and determines if the each Alert Instance is alerting or not (many instances can come from one rule that get their own state management). Those instances are passed to the state manager.
Detects state changes, also handles For/Pending logic. The [For]
settings and [NoData/Error Handling]
options can change state changes.
When an alert instance has a state change to Alerting
from another state, or from Alerting
to Normal
, the alert instance is sent to the Alert Manager component. The Scheduler is also making sure that the Alert Manager component is updated about active alert instances (since if alert instances disappear, alert manager will consider them resolved after a period of time.)
Alert instances are saved to the alert_instance
table in the database, the Grafana alert annotations in Dashboard view can show when the state manager detected a state change.
The rule stage handles rendering of templates in alert rule annotations, where alert manager renders contact point templates (or the annotation that has already been rendered before arriving to alert manager.)
The Alert Manager component handles:
- Sending alert instances from their associated alert rules to the correct notification policy and contact points.
- Grouping of alert instances, based on the
[Group by]
label matchers - Deduplication of alert instances
- Dispatching of notifications for alert instance groups based on the associated and Timings:
[Group wait]
,[Group interval]
,[Repeat interval]
- Rendering of notification templates
The groups have been formed can be view in Grafana's Alert Groups view.
Explaining how the timings work together across the components in the context of Grafana managed alerts is a TODO. But https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html as the Rule State above is like prometheus, and the Alert Manager stage is Alert Manager.
This a bunch of information for a user to collect.
Thinking forward, perhaps we can create an API endpoint that will get a good chunk of this information in one call (permissions complications may exist, but also maybe a way to make calls that tie the systems components together for testing as well).
Some of this information can also be added to documentation (or maybe UX). In particular if this fits with what Josh was saying about users understanding the components better in order to understand the expected alerting behavior.