Skip to content

Instantly share code, notes, and snippets.

@kylebrandt
Last active August 10, 2021 14:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kylebrandt/a1b241b8755136e146ade9c9ddef8112 to your computer and use it in GitHub Desktop.
Save kylebrandt/a1b241b8755136e146ade9c9ddef8112 to your computer and use it in GitHub Desktop.
Grafana Managed Alert: Needed info for debugging issues

Grafana Managed Alert Rule Issue Debugging

Needed context

In order to figure what is going on with Grafana managed alerts generic issue, we can use (or need) the following information.

Less may be needed depending how far along in the pipeline the issue is (if it is at the end like notifications, then can need all the stuff that takes us to that point).

Configuration

Grafana Managed Rule Edit Alert Rule View:

  • The queries (and their data sources) and expressions
  • Example of data returned by the queries
  • The [Condition] drop down field setting (refId)
  • The [Evaluate Every] and [For] settings
  • The [NoData/Error Handling] dropdown options

From the Notification Policy View:

  • The notification policy that will apply to the above alert rule its settings:
    • [Group by] label matchers
    • Timings: [Group wait], [Group interval], [Repeat interval]
    • The associated contact point

From the Contact Point View:

  • All settings that are not sensitive and/or Person/Company Identifiable
  • Also, if a [template] or sub template is referenced instead of being entirely in the field, that template should also be included.

Also possibly any active or recently active silences.

Runtime

Edit Alert Rule view:

  • Screenshots of the data from "run queries" can help, as well as the output of "test alert"

Administrative:

  • Grafana logs with debug enabled (I am not sure we currently have consistent pattern to get only alerting related things)
  • Main Alerting Database tables: alert_rule, alert_instance, alert_configuration. (There are other tables that matter like for things like permissions)

Dashboard view:

  • A screenshot of Grafana annotations that are created for the alert rule (I believe a dashboard/panel has to be associated with it for these to exist? or perhaps just to show up? )

Alert Overview view:

  • Expanded view of the alert rule (that shows the instances).

Alert Groups view:

  • Active alert groups, perhaps multiple screen shots over time.

Alert Component Pipeline

This describes the alert cycle through the various alerting components (alert pipeline (naming?) and how the above pieces of information fit into it. (Note: See image in gist)

Rule Stage

The rule stage consists of all the pieces before sending to Grafana's embedded Alert Manager component. Is is analogous to how prometheus and alert manager interact except with some extra features to accommodate other data sources. With another notable difference: prometheus can send notifications, whereas Alert Manager is required for notifications with Grafana managed alerts.

Scheduler

Will load Alert Rules, and will signal the Evaluation component to run each alert rule on the [Evaluate Every] setting set in the Rule Edit view. It also provides information to the State Manager component and coordinates other components.

Since the scheduler has no direct UI, its information is found in Grafana's logs.

Evaluation

Runs queries and expressions from an Alert Rule and determines if the each Alert Instance is alerting or not (many instances can come from one rule that get their own state management). Those instances are passed to the state manager.

State Manager

Detects state changes, also handles For/Pending logic. The [For] settings and [NoData/Error Handling] options can change state changes.

When an alert instance has a state change to Alerting from another state, or from Alerting to Normal, the alert instance is sent to the Alert Manager component. The Scheduler is also making sure that the Alert Manager component is updated about active alert instances (since if alert instances disappear, alert manager will consider them resolved after a period of time.)

Alert instances are saved to the alert_instance table in the database, the Grafana alert annotations in Dashboard view can show when the state manager detected a state change.

Additional

The rule stage handles rendering of templates in alert rule annotations, where alert manager renders contact point templates (or the annotation that has already been rendered before arriving to alert manager.)

Alert Manager Stage

The Alert Manager component handles:

  • Sending alert instances from their associated alert rules to the correct notification policy and contact points.
  • Grouping of alert instances, based on the [Group by] label matchers
  • Deduplication of alert instances
  • Dispatching of notifications for alert instance groups based on the associated and Timings: [Group wait], [Group interval], [Repeat interval]
  • Rendering of notification templates

The groups have been formed can be view in Grafana's Alert Groups view.

Note on Timings

Explaining how the timings work together across the components in the context of Grafana managed alerts is a TODO. But https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html as the Rule State above is like prometheus, and the Alert Manager stage is Alert Manager.

Future Development around troubleshooting alerts

This a bunch of information for a user to collect.

Thinking forward, perhaps we can create an API endpoint that will get a good chunk of this information in one call (permissions complications may exist, but also maybe a way to make calls that tie the systems components together for testing as well).

Some of this information can also be added to documentation (or maybe UX). In particular if this fits with what Josh was saying about users understanding the components better in order to understand the expected alerting behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment