Skip to content

Instantly share code, notes, and snippets.

@wavejumper
Last active May 21, 2021
Embed
What would you like to do?
prom.md

Kafka Alerting with kPow, Prometheus and Alertmanager

This article covers setting up alerting with kPow using Prometheus and Alertmanager.

Introduction

kPow was built from our own need to monitor Kafka clusters and related resources (eg, Streams, Connect and Schema Registries).

Through kPow's user interface we can detect and even predict potential problems with Kafka such as:

  • Replicas that have gone out of sync
  • Consumer group assignments that are lagging above a certain threshold
  • Topic growth that will exceed a quota

But how can we alert teams as soon as these problems occur? kPow does not provide its own alerting functionality but instead integrates with Prometheus for a modern alerting solution.

Why don't we natively support alerting? We feel that a dedicated product like Prometheus is better suited for the role of alerting rather than our own product, because we know that your organization has alerting needs that go beyond Kafka. Thus, having alerting managed from a centralized service (Prometheus) makes sense.

Don't use Prometheus? Fear not, almost every major observability tool on the market today supports Prometheus metrics. For example, Grafana Cloud supports Prometheus alerts out of the box.

This article will demonstrate how to setup kPow with Prometheus + AlertManager. We will provide useful example configuration to help you get started defining your own alerts for when things go wrong with your Kafka cluster.

Architechure

Here is the basic architecture of alerting with Prometehus:

arch

Alerts are defined in Prometheus configuration. Prometheus pulls metrics from all client applications (including kPow). If any condition is met, Prometheus pushes the alert to the AlertManager service which manages the alerts through its pipeline of silencing, inhibition, grouping and sending out notifications. Essentially what that means is that AlertManager takes care of deduplicating, grouping and routing of alerts to the correct integration such as Slack, email or Opsgenie.

About kPow's metrics

The unique thing about kPow as a product is that we calculate our own telemetry about your Kafka Cluster and related resources.

This has a ton of advantages:

  • No dependency on Kafka's own JMX metrics, which allows us to integrate with other Kafka-like systems such as Azure Event Hubs or Red Hat AMQ Streams. This also allows for a much frictionless installation and configuration!
  • From our observations about your Kafka cluster, we calculate a wider range of Kafka metrics, including group and topic offset deltas! That is, we aggregate and compute metrics over time.
  • This same pattern applies to other supported resources such as Kafka Connect, Kafka Streams and Schema Registry metrics

kPow's user interface powered by our computed metrics

Setup

We have provided a docker-compose.yml configuration that starts up kPow, a 3-node Kafka cluster and Prometheus + AlertManager. This can be found in the kpow-local repositry on GitHub. Instructions on how to start a 14-day trial of kPow can be found in the repo if you are also new to kPow.

git clone https://github.com/operatr-io/kpow-local.git
cd kpow-local
vi local.env # add your LICENSE details, see kpow-local README.md
docker-compose up

Once the Docker Compose environment is running:

  • Alertmanager's web UI will be reachable on port 9001
  • Prometheus' web UI will be reachable on port 9090
  • kPow's web UI will be reachable on port 3000

The remainder of this tutorial will be based off the Docker Compose environment.

Prometheus configuration

A single instance of kPow can observe and monitor multiple Kafka clusters and related resources! This makes kPow a great aggregator for your entire Kafka deployment across multiple environments as a single Prometheus endpoint served by kPow can provide metrics about all your Kafka resources.

When kPow starts up, it logs the various Prometheus endpoints available:

This allows Prometehus to only consume a subset of metrics (eg, metrics about a specific consumer group or resource).

To have Prometheus pull all metrics, add this entry to your scrape_configs:

scrape_configs:
  - job_name: 'kpow'
    metrics_path: '/metrics/v1'
    static_configs:
      - targets: ['http://kpow:3000']

Note: you will need to provide a reachable target. In this example kPow is reachable at http://kpow:3000.

Within your prometheus config, you will need to specify a location to your rules.yml file:

rule_files:
  - kpow-rules.yml

Our kpow-rules.yml file looks something like:

groups:
- name: Kafka
  rules:
  # Example rules in section below

We have a single alert group called Kafka. The collection of rules are explained in the next section.

The sample kpow-rules.yml and alertmanager.yml config can be found here. In this example alertmanager will be sending all fired alerts to a Slack WebHook.

kPow Metric Structure

A glossary of available Prometheus metrics from kPow can be found here

All kPow metrics follow a similar labelling convention:

  • domain - the category of metric (for example cluster, connect, streams)
  • id - the unique identifier of the category (for example Kafka Cluster ID)
  • target - the identifier of the metric (for example consumer group, topic name etc)
  • env - an optional label to identify the domain

For example, the metric:

group_state{domain="cluster",id="6Qw4099nSuuILkCkWC_aNw",target="tx_partner_group4",env="Trade_Book__Staging_",} 4.0 1619060220000

Relates to a Kafka Cluster (with id 6Qw4099nSuuILkCkWC_aNw and label Trade Book Staging) for consumer group tx_partner_group4.

Example Prometheus Rules

The remainder of this section will provide example Prometheus rules for common alerting scenarios.

Alerting when a Consumer Group is unhealthy

- alert: UnhealthyConsumer
  expr: group_state == 0 or group_state == 1 or group_state == 2
  for: 5m
  annotations:
    summary: "Consumer {{ $labels.target }} is unhealthy"
    description:  "The Consumer Group {{ $labels.target }} has gone into {{ $labels.state }} for cluster {{ $labels.id }}"

Here, the group_state metric from kPow is exposed as a gauge and the value represents the ordinal value of the ConsumerGroupState enum. The expr is testing whether group_state enters state DEAD, EMPTY or UNKNOWN for all consumer groups.

The for clause causes Prometheus to wait for a certain duration between first encountering a new expression output vector element and counting an alert as firing for this element. In this case 5 minutes.

The annotations section then provides a human readable alert description which describes which consumer group has entered an unhealthy state. Group state has a state label that contains the human-readable value of the state (eg, STABLE).

Alerting when a Kafka Connect task is unhealthy

Similar to our consumer group configuration, we can alert when we detect a connector task has gone into an ERROR state.

- alert: UnhealthyConnectorTask
  expr: connect_connector_task_state != 1
  for: 5m
  annotations:
    summary: "Connect task {{ $labels.target }} is unhealthy"
    description:  "The Connector task {{ $labels.target }} has gone into {{ $labels.target }} for cluster {{ $labels.id }}"

- alert: UnhealthyConnector
  expr: connect_connector_state != 1
  for: 5m
  annotations:
    summary: "Connector {{ $labels.target }} is unhealthy"
    description:  "The Connector {{ $labels.target }} has gone into {{ $labels.target }} for cluster {{ $labels.id }}"

Here we have configured two alerts: one if an individual connector task goes enters an error state, and one if the connector itself enters an error state. The value of 1 represents the RUNNING state.

Alerting when a consumer group is lagging above a threshold

In this example Prometheus will fire an alert if any consumer groups lag exceeds 5000 messages for more than 5 minutes.

We can configure a similar alert for host_offset_lag to monitor individual lagging hosts, or even broker_offset_lag for lagging behind brokers.

- alert: LaggingConsumerGroup
  expr: group_offset_lag > 5000
  for: 5m
  annotations:
    summary: "Consumer group {{ $labels.target }} is lagging"
    description:  "Consumer group {{ $labels.target }} is lagging for cluster {{ $labels.id }}"

Alerting when the kPow instance is down

- alert: KpowDown
  expr: up == 0 and {job="kpow"}
  for: 1m
  annotations:
    summary: "kPow is down"
    description:  "kPow instance {{ $labels.target }} has been down for more than 1 minute."

Conclusion

This article demonstrates how you can build out a modern alerting system with kPow and Prometheus.

Source code for configuration, including a demo docker-compose.yml of the setup can be found here.

As more and more observability services support Prometheus metrics, similar integrations with services such as Grafana Cloud or New Relic are possible. All of these services provide an equally compelling solution to alerting.

What's even more exciting for us personally is Amazon's Managed Service for Prometheus which is currently in feature preview. This service looks to make Prometheus monitoring of containerized applications at scale easy!

While Prometheus metrics are what we expose for data egress with kPow, please get in touch if you would like alternative metric egress formats in kPow such as WebHooks or even a JMX connection - we'd love to know your use case!

Further reading/references

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment