Skip to content

Instantly share code, notes, and snippets.

@natemollica-nm
Last active May 9, 2025 20:29
Show Gist options
  • Save natemollica-nm/b605b3ed0dc337f35152817564abc017 to your computer and use it in GitHub Desktop.
Save natemollica-nm/b605b3ed0dc337f35152817564abc017 to your computer and use it in GitHub Desktop.
Consul OpenShift Monitoring

Monitoring Consul and Consul Service Mesh on OpenShift

HashiCorp Support Recommended Practices


1 — Why monitor?

Consul is a distributed system that relies on Raft consensus, RPC‐based discovery, and (optionally) an Envoy dataplane. In production, continuous telemetry gives early warning of:

  • Control-plane risk – leader churn, slow Raft commits, memory leaks, failing ACL checks.
  • Dataplane bottlenecks – retry storms, listener errors, xDS update failures.
  • Resource exhaustion – goroutine explosions, rising BoltDB freelist, DNS query spikes.

The guidance below uses Prometheus + Alertmanager (the default stack on OpenShift) but mirrors the same metric names you’ll see if you ingest via Telegraf-InfluxDB or ship to Grafana Cloud.


2 — Architecture at a glance

Consul agents (servers & clients)  ─┐
                                   │  /metrics (Prometheus format)
Envoy proxies (sidecars, gateways) ─┘
       │
       ▼
Prometheus Operator (OpenShift) ──›  Alertmanager  ──›  Slack / PagerDuty
       │                                 ▲
       └─›  Grafana dashboards  ─────────┘

Use the kube-rbac-proxy sidecar if you need RBAC‐scoped endpoints.


3 — Key metrics & PromQL queries

Layer Metric (Prometheus name) PromQL example Why it matters
Raft health consul_raft_leader_last_contact_seconds max(consul_raft_leader_last_contact_seconds) Spikes > 1 s = network partitions or CPU stall
consul_raft_commit_time_seconds (histogram) histogram_quantile(0.95, rate(consul_raft_commit_time_seconds_bucket[5m])) 95-th > 100 ms signals disk IO or quorum latency
KVS / Txn latency consul_kvs_apply_time_seconds rate(consul_kvs_apply_time_seconds_sum[5m]) / rate(consul_kvs_apply_time_seconds_count[5m]) Rising mean shows storage or scheduler backlog
RPC saturation consul_client_rpc sum by(method)(rate(consul_client_rpc[1m])) Sudden burst = service registration flood
Runtime gauges consul_runtime_num_goroutines max(consul_runtime_num_goroutines{job="consul-servers"}) Linear climb = goroutine leak
consul_runtime_alloc_bytes max_over_time(consul_runtime_alloc_bytes[30m]) Memory footprint trend
BoltDB consul_raft_boltdb_freelist_bytes rate(consul_raft_boltdb_freelist_bytes[5m]) Large freelist → run autopilot -cleanup-boltdb
DNS / discovery consul_dns_domain_query_total rate(consul_dns_domain_query_total[1m]) Triage query storms
ACL consul_acl_blocked_service_registration_total rate(consul_acl_blocked_service_registration_total[5m]) Detect policy gaps that break deployments
Autopilot consul_autopilot_healthy (0/1) min(consul_autopilot_healthy) 0 indicates failed redundancy checks
Dataplane (Envoy) envoy_cluster_upstream_rq{job="sidecar"} sum(rate(envoy_cluster_upstream_rq{response_code!~"2.."}[1m])) Non-2xx > baseline → upstream failures
envoy_listener_downstream_cx_active max_over_time(envoy_listener_downstream_cx_active[5m]) Sudden drop = listener crash/rotation
envoy_server_live sum by(pod)(envoy_server_live) 0 = proxy not healthy / xDS rejected
envoy_cluster_update_success sum(rate(envoy_cluster_update_success[5m])) / sum(rate(envoy_cluster_update_attempt[5m])) < 1 indicates frequent config rejects

Consul WAL LogStore (Raft) – Key metrics & PromQL queries

Note: Raft WAL Logstore is enabled by default on new clusters spun up running Consul versions >= v1.20.x

Category Metric / Panel PromQL Query Purpose / Why it matters
Checksum integrity Read-side checksum failures increase(consul_raft_logstore_verifier_read_checksum_failures[5m]) Detects on-disk corruption; any increase should trigger an immediate page and a revert to BoltDB.
Write-side checksum failures increase(consul_raft_logstore_verifier_write_checksum_failures[5m]) Catches in-flight (network/software) corruption between leader ➜ follower; still demands prompt investigation.
Commit latency 95-th percentile commit time histogram_quantile(0.95, rate(consul_raft_commitTime_bucket[5m])) WAL should be same or faster than BoltDB; rising values imply follower lag or I/O regression.
Follower disk flush 95-th percentile AppendEntries.storeLogs histogram_quantile(0.95, rate(consul_raft_rpc_appendEntries_storeLogs_bucket[5m])) Measures follower disk write cost; WAL-enabled followers should not be slower than BoltDB peers.
Leader→follower replication 95-th percentile AppendEntries.rpc histogram_quantile(0.95, rate(consul_raft_replication_appendEntries_rpc_bucket[5m])) High delta vs. follower storeLogs reveals Raft RPC queuing issues unrelated to backend but still impacts commit time.
Log compaction 95-th percentile compactLogs histogram_quantile(0.95, rate(consul_raft_compactLogs_bucket[5m])) WAL must not increase compaction latency; spikes could indicate fragmentation or slow fsync.
Leader disk flush 95-th percentile leader.dispatchLog histogram_quantile(0.95, rate(consul_raft_leader_dispatchLog_bucket[5m])) Only relevant when a WAL-enabled server is leader; should match or beat historical BoltDB numbers.

Consul Control-plane – Key metrics & PromQL queries

Category Metric / Panel PromQL Query Purpose / Why it matters
Raft health 95-th percentile commit time histogram_quantile(0.95, rate(consul_raft_commitTime_bucket[5m])) Detects disk / network latency that slows consensus; >100 ms sustained indicates risk of write-stall or split-brain.
Commits per 5 min rate(consul_raft_apply[5m]) Measures write throughput; sudden drop can signal leader unavailability or storage pausing.
Leader last-contact max(consul_raft_leader_lastContact) Time since followers heard from leader; values >1 s warn of network partitions or CPU starvation.
Election events rate(consul_raft_state_candidate[1m]) Frequent elections reveal control-plane instability that can cascade to clients.
Cluster safety Autopilot healthy flag min(consul_autopilot_healthy) Boolean verdict—0 means redundancy checks failed (e.g., too few voters).
DNS load DNS queries / s rate(consul_dns_domain_query_count[5m]) High query rate may indicate service-discovery loops or misconfigured stubs.
95-th percentile DNS latency histogram_quantile(0.95, rate(consul_dns_domain_query_bucket[5m])) Tracks resolution latency experienced by workloads; spikes precede timeouts.
KV store KV applies / s rate(consul_kvs_apply_count[5m]) Surges often correlate with application redeploys or excessive leader writes.
95-th percentile KV latency histogram_quantile(0.95, rate(consul_kvs_apply_bucket[5m])) Elevated latency points to BoltDB pressure or Raft write congestion.
ACL activity ACL resolves / s rate(consul_acl_ResolveToken_count[5m]) Growth reflects authentication load; unexpected spikes may precede 403s.
95-th percentile ACL latency histogram_quantile(0.95, rate(consul_acl_ResolveToken_bucket[5m])) Slow token resolution delays every RPC and catalog write.
Catalog churn Register + deregister rate rate(consul_catalog_register_count[5m]) + rate(consul_catalog_deregister_count[5m]) Measures service volatility; high churn can saturate Raft and DNS.
95-th percentile catalog op time histogram_quantile(0.95, rate(consul_catalog_register_bucket[5m])) Prolonged operations hint at storage contention or heavy watch load.

Consul Dataplane – Key metrics & PromQL queries

Category Metric / Panel PromQL Query Purpose / Why it matters
Availability Live Envoy instances sum(envoy_server_live{app=~"$service"}) Drops expose pod crashes, liveness probe failures, or xDS misconfig.
Traffic quality Request success rate sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!~"4|5", consul_destination_service=~"$service"}[10m])) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$service"}[10m])) Falling ratio (<99 %) highlights upstream errors before users see them.
Failed requests sum(increase(envoy_cluster_upstream_rq_xx{envoy_response_code_class=~"4|5", consul_destination_service=~"$service"}[10m])) by (local_cluster) Pinpoints which service/cluster is producing 4xx/5xx bursts.
Requests per second sum(rate(envoy_http_downstream_rq_total{service=~"$service", envoy_http_conn_manager_prefix="public_listener"}[5m])) by (service) Workload baseline; sudden jump affects capacity and latency.
Cluster health Unhealthy endpoints (sum(envoy_cluster_membership_total{app=~"$service", envoy_cluster_name=~"$cluster"}) - sum(envoy_cluster_membership_healthy{app=~"$service", envoy_cluster_name=~"$cluster"})) Any non-zero value = potential pod/unhealthy check in upstream cluster.
All clusters healthy? (sum(envoy_cluster_membership_total{app=~"$service",envoy_cluster_name=~"$cluster"})-sum(envoy_cluster_membership_healthy{app=~"$service",envoy_cluster_name=~"$cluster"})) == bool 0 Boolean check for alerting—simplifies dashboards.
Resource usage Envoy heap size sum(envoy_server_memory_heap_size{app=~"$service"}) Detects memory leaks in proxies; trending up + OOMKill is a red flag.
Allocated memory sum(envoy_server_memory_allocated{app=~"$service"}) Correlate with heap size to verify GC effectiveness.
Average uptime avg(envoy_server_uptime{app=~"$service"}) Frequent restarts reset uptime—good early crash signal.
Kubernetes pod health CPU throttled seconds rate(container_cpu_cfs_throttled_seconds_total{namespace=~"$namespace"}[5m]) CPU throttling increases latency even when Envoy appears idle.
Memory usage % of limit 100 * max(container_memory_working_set_bytes{namespace=~"$namespace"} / on(container,pod) label_replace(kube_pod_container_resource_limits{resource="memory"},"pod","$1","exported_pod","(.+)")) by (pod) Alerts operators before OOM-kill terminates proxies.
CPU usage % of limit 100 * max(rate(container_cpu_usage_seconds_total{namespace=~"$namespace"}[5m]) / on(container,pod) label_replace(kube_pod_container_resource_limits{resource="cpu"},"pod","$1","exported_pod","(.+)")) by (pod) Helps right-size CPU requests/limits for sidecars.
Connections Active upstream connections sum(envoy_cluster_upstream_cx_active{app=~"$service", envoy_cluster_name=~"$cluster"}) by (app, envoy_cluster_name) Rising open CX alongside high latency points to connection leaks or retries.
Active downstream connections sum(envoy_http_downstream_cx_active{app=~"$service"}) Indicates live client load; sudden drop hints at listener failure.

4 — Alert templates (PrometheusRule)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: consul-production-alerts
  namespace: monitoring
spec:
  groups:
  - name: consul.controlplane
    rules:
    - alert: ConsulLeaderUnreachable
      expr: max(consul_raft_leader_last_contact_seconds) > 1
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Consul leader not contacted in >1 s"
        description: |
          Raft leader lastContact has exceeded 1 second for {{ $labels.instance }}.
          Check network latency, CPU throttle, or disk stall.
    - alert: ConsulExcessiveGoroutines
      expr: max(consul_runtime_num_goroutines{job="consul-servers"}) > 10000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Possible goroutine leak on Consul servers"
  - name: consul.envoy
    rules:
    - alert: EnvoyHigh5xx
      expr: sum(rate(envoy_cluster_upstream_rq{response_code=~"5.."}[5m])) > 10
      for: 3m
      labels:
        severity: warning
      annotations:
        summary: "High 5xx responses observed in sidecars"
        runbook: https://developer.hashicorp.com/consul/docs/observe/grafana/dataplane

Threshold tuning: Start with the values above, then baseline against your own 95-th percentile for a week.

Listing Prometheus alerts

The user wants a full, categorized list of Prometheus alerts for Consul and Consul Service Mesh—control plane and dataplane. These should be based on the previously discussed alerts like ConsulLeaderUnreachable and EnvoyHigh5xx. I'll make sure to cover key alerts for the control plane, such as Raft leader unreachable, excessive goroutines, and ACL blocked service registration, among others. I'll also include dataplane-related alerts as needed, while keeping in line with the monitoring needs we discussed earlier.


Consul Control-plane alerts

Alert name PromQL expression (template) For Severity Why it matters
ConsulLeaderUnreachable max(consul_raft_leader_lastContact_seconds) > 1 2 m critical Followers haven’t heard from the leader in > 1 s ⇒ risk of split-brain.
ConsulFrequentElections rate(consul_raft_state_candidate[1m]) > 0.1 3 m warning Elections > 6/min show instability (network, CPU, or I/O throttling).
ConsulHighRaftCommitLatency histogram_quantile(0.95, rate(consul_raft_commitTime_seconds_bucket[5m])) > 0.1 5 m warning 95-th commit > 100 ms → disk or network latency delaying writes.
ConsulRaftFollowerLag max(consul_raft_replication_appendEntries_rpc_seconds{quantile="0.99"}) > 0.15 5 m warning Followers > 150 ms behind leader—watch for network jitter.
ConsulAutopilotUnhealthy min(consul_autopilot_healthy) == 0 1 m critical Autopilot reports redundancy or peer-set issues.
ConsulACLBlockedServiceRegistration sum(rate(consul_acl_blocked_service_registration_total[5m])) > 0 2 m warning New services can’t register; policies mis-scoped.
ConsulHighKVLatency histogram_quantile(0.95, rate(consul_kvs_apply_time_seconds_bucket[5m])) > 0.05 5 m warning 95-th KV apply > 50 ms → write congestion.
ConsulHighDNSLatency histogram_quantile(0.95, rate(consul_dns_domain_query_seconds_bucket[5m])) > 0.02 3 m warning Look-ups > 20 ms begin to hit application timeouts.
ConsulHighDNSQueryRate sum(rate(consul_dns_domain_query_total[1m])) > 5e4 1 m info Surging queries may indicate discovery loops.
ConsulExcessiveGoroutines max(consul_runtime_num_goroutines{job="consul-servers"}) > 10000 5 m warning Linear growth → goroutine leak / runaway watch.
ConsulMemoryGrowth increase(consul_runtime_alloc_bytes{job="consul-servers"}[30m]) > 2e9 30 m warning +2 GiB in 30 min: possible leak.
ConsulBoltDBFreelistGrowth rate(consul_raft_boltdb_freelistBytes[15m]) > 1e7 15 m warning Freelist growing fast → run BoltDB cleanup.
ConsulWALChecksumFailure increase(consul_raft_logstore_verifier_read_checksum_failures[5m]) > 0 or increase(consul_raft_logstore_verifier_write_checksum_failures[5m]) > 0 0 m critical Data-integrity error—rollback node to BoltDB & escalate.
ConsulCatalogChurnSpike rate(consul_catalog_register_total[1m]) + rate(consul_catalog_deregister_total[1m]) > 200 1 m info Unusual registration churn (deploy wave or flapping health checks).
ConsulClientRPCBurst sum(rate(consul_client_rpc[1m])) > 2e4 1 m info Detects RPC floods that can saturate servers.

Consul Dataplane (Envoy) alerts

Alert name PromQL expression (template) For Severity Why it matters
EnvoyLowSuccessRate 1 - ( sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!~"4|5"}[5m])) / sum(irate(envoy_cluster_upstream_rq_xx[5m])) ) > 0.01 5 m critical < 99 % success → customer-visible errors.
EnvoyHigh5xx sum(rate(envoy_cluster_upstream_rq_5xx[5m])) > 10 3 m warning Surging 5xx from upstream services.
EnvoyHigh4xx sum(rate(envoy_cluster_upstream_rq_4xx[5m])) > 50 3 m info Many 4xx may hint at bad client or mis-routed mesh traffic.
EnvoyUnhealthyClusters (sum(envoy_cluster_membership_total) - sum(envoy_cluster_membership_healthy)) > 0 2 m warning Any cluster with 0 healthy endpoints breaks routing.
EnvoyXDSUpdateFailure increase(envoy_cluster_update_rejected[5m]) > 0 0 m critical Config pushes failing—pods serving stale routes.
EnvoyListenerErrorRate rate(envoy_listener_downstream_cx_destroy_local_with_active_rq[5m]) > 5 5 m warning Listener resets during active reqs → mesh latency / drops.
EnvoyMemoryLeak increase(envoy_server_memory_allocated[30m]) > 5e8 30 m warning +500 MiB in half an hour → leak or runaway buffers.
EnvoyFrequentRestarts changes(envoy_server_uptime[30m]) > 1 30 m warning Proxies crashing/restarting.
EnvoyHighCPUThrottling sum(rate(container_cpu_cfs_throttled_seconds_total{container=~"^envoy.*"}[5m])) > 5 5 m warning Throttling >5 s/5 m; latency spikes expected.
EnvoyHighMemoryUtilization container_memory_working_set_bytes / on(pod,container) kube_pod_container_resource_limits{resource="memory"} > 0.9 5 m warning ≥90 % of limit—OOM-kill imminent.
EnvoyRetryStorm sum(rate(envoy_cluster_upstream_rq_retry[1m])) > 20 2 m info Excessive automatic retries amplify latency/load.
EnvoyHighRequestLatency histogram_quantile(0.95, rate(envoy_cluster_upstream_rq_time_bucket[5m])) > 0.3 5 m warning 95-th > 300 ms → slowness in upstream or network path.
EnvoyActiveCxDrop delta(envoy_http_downstream_cx_active[5m]) < -100 5 m critical Sudden loss of ≥100 downstream connections—possible listener crash or route removal.

Tune thresholds to your baseline RPS and SLO; these templates assume mid-sized clusters in production.


5 — Scraping Consul & Envoy on OpenShift

5.1 — Consul agents

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: consul-agents
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: consul
  endpoints:
  - port: http
    interval: 30s
    path: /v1/agent/metrics
    params:
      format: [prometheus]        # ?format=prometheus

For server pods, add tlsConfig: insecureSkipVerify: true if you use HTTPS.

5.2 — Envoy sidecars & gateways

Expose Envoy’s admin port (e.g. 9901) via a headless service and add a ServiceMonitor:

endpoints:
- port: admin
  interval: 30s
  relabelings:
  - sourceLabels: [__metrics_path__]
    targetLabel: envoy

6 — Grafana dashboards

  • HashiCorp’s “Consul Control Plane” dashboard JSON ID 10539
  • HashiCorp’s “Consul Dataplane (Envoy)” dashboard JSON ID 10540

Import via Dashboards → + Import or automate with the Grafana Operator Dashboard CR.


7 — Runbook integration

Attach the alert annotations.runbook URLs to your internal runbook repo or HashiCorp docs:


8 — Ongoing validation

  • Weekly WAL integrity audit – review the checksum-failure counters consul_raft_logstore_verifier_read_checksum_failures and consul_raft_logstore_verifier_write_checksum_failures. They must stay at 0; any increment warrants rolling the node back to BoltDB and opening a support ticket.
  • Monthly BoltDB defrag – schedule consul operator autopilot cleanup-boltdb.
  • Quarterly failover test – kill leader pod; ensure ConsulLeaderUnreachable fires once and clears.
  • Blue/green upgrade – validate that Envoy xDS success rate stays > 99 %.

Conclusion & Calls to Action

Urgent (next 2 weeks) Long-term (quarterly)
• Deploy ServiceMonitors & PrometheusRule above.
• Import control-plane & dataplane dashboards.
• Tune alert thresholds to your baseline.
• Automate BoltDB cleanup & snapshot verification.
• Expand dataplane metrics to include L7 latency histograms.
• Integrate alert webhooks with on-call escalation tools.

Need deeper help tailoring queries or debugging alert noise? HashiCorp Support can provide a focused workshop—just let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment