natemollica-nm/consul-k8s-monitoring.md

## consul-k8s-monitoring.md

      
    Raw
  

              consul-k8s-monitoring.md
            
          
    Monitoring Consul and Consul Service Mesh on OpenShift

HashiCorp Support Recommended Practices

1 — Why monitor?

Consul is a distributed system that relies on Raft consensus, RPC‐based discovery, and (optionally) an Envoy dataplane. In production, continuous telemetry gives early warning of:

Control-plane risk – leader churn, slow Raft commits, memory leaks, failing ACL checks.
Dataplane bottlenecks – retry storms, listener errors, xDS update failures.
Resource exhaustion – goroutine explosions, rising BoltDB freelist, DNS query spikes.

The guidance below uses Prometheus + Alertmanager (the default stack on OpenShift) but mirrors the same metric names you’ll see if you ingest via Telegraf-InfluxDB or ship to Grafana Cloud.

2 — Architecture at a glance

Consul agents (servers & clients)  ─┐
                                   │  /metrics (Prometheus format)
Envoy proxies (sidecars, gateways) ─┘
       │
       ▼
Prometheus Operator (OpenShift) ──›  Alertmanager  ──›  Slack / PagerDuty
       │                                 ▲
       └─›  Grafana dashboards  ─────────┘

Use the kube-rbac-proxy sidecar if you need RBAC‐scoped endpoints.

3 — Key metrics & PromQL queries


Layer
Metric (Prometheus name)
PromQL example
Why it matters


Raft health
consul_raft_leader_last_contact_seconds
max(consul_raft_leader_last_contact_seconds)
Spikes > 1 s = network partitions or CPU stall


consul_raft_commit_time_seconds (histogram)
histogram_quantile(0.95, rate(consul_raft_commit_time_seconds_bucket[5m]))
95-th > 100 ms signals disk IO or quorum latency


KVS / Txn latency
consul_kvs_apply_time_seconds
rate(consul_kvs_apply_time_seconds_sum[5m]) / rate(consul_kvs_apply_time_seconds_count[5m])
Rising mean shows storage or scheduler backlog


RPC saturation
consul_client_rpc
sum by(method)(rate(consul_client_rpc[1m]))
Sudden burst = service registration flood


Runtime gauges
consul_runtime_num_goroutines
max(consul_runtime_num_goroutines{job="consul-servers"})
Linear climb = goroutine leak


consul_runtime_alloc_bytes
max_over_time(consul_runtime_alloc_bytes[30m])
Memory footprint trend


BoltDB
consul_raft_boltdb_freelist_bytes
rate(consul_raft_boltdb_freelist_bytes[5m])
Large freelist → run autopilot -cleanup-boltdb


DNS / discovery
consul_dns_domain_query_total
rate(consul_dns_domain_query_total[1m])
Triage query storms


ACL
consul_acl_blocked_service_registration_total
rate(consul_acl_blocked_service_registration_total[5m])
Detect policy gaps that break deployments


Autopilot
consul_autopilot_healthy (0/1)
min(consul_autopilot_healthy)
0 indicates failed redundancy checks


Dataplane (Envoy)
envoy_cluster_upstream_rq{job="sidecar"}
sum(rate(envoy_cluster_upstream_rq{response_code!~"2.."}[1m]))
Non-2xx > baseline → upstream failures


envoy_listener_downstream_cx_active
max_over_time(envoy_listener_downstream_cx_active[5m])
Sudden drop = listener crash/rotation


envoy_server_live
sum by(pod)(envoy_server_live)
0 = proxy not healthy / xDS rejected


envoy_cluster_update_success
sum(rate(envoy_cluster_update_success[5m])) / sum(rate(envoy_cluster_update_attempt[5m]))
< 1 indicates frequent config rejects


Consul WAL LogStore (Raft) – Key metrics & PromQL queries


Note: Raft WAL Logstore is enabled by default on new clusters spun up running Consul versions >= v1.20.x


Category
Metric / Panel
PromQL Query
Purpose / Why it matters


Checksum integrity
Read-side checksum failures
increase(consul_raft_logstore_verifier_read_checksum_failures[5m])
Detects on-disk corruption; any increase should trigger an immediate page and a revert to BoltDB.


Write-side checksum failures
increase(consul_raft_logstore_verifier_write_checksum_failures[5m])
Catches in-flight (network/software) corruption between leader ➜ follower; still demands prompt investigation.


Commit latency
95-th percentile commit time
histogram_quantile(0.95, rate(consul_raft_commitTime_bucket[5m]))
WAL should be same or faster than BoltDB; rising values imply follower lag or I/O regression.


Follower disk flush
95-th percentile AppendEntries.storeLogs
histogram_quantile(0.95, rate(consul_raft_rpc_appendEntries_storeLogs_bucket[5m]))
Measures follower disk write cost; WAL-enabled followers should not be slower than BoltDB peers.


Leader→follower replication
95-th percentile AppendEntries.rpc
histogram_quantile(0.95, rate(consul_raft_replication_appendEntries_rpc_bucket[5m]))
High delta vs. follower storeLogs reveals Raft RPC queuing issues unrelated to backend but still impacts commit time.


Log compaction
95-th percentile compactLogs
histogram_quantile(0.95, rate(consul_raft_compactLogs_bucket[5m]))
WAL must not increase compaction latency; spikes could indicate fragmentation or slow fsync.


Leader disk flush
95-th percentile leader.dispatchLog
histogram_quantile(0.95, rate(consul_raft_leader_dispatchLog_bucket[5m]))
Only relevant when a WAL-enabled server is leader; should match or beat historical BoltDB numbers.


Consul Control-plane – Key metrics & PromQL queries


Category
Metric / Panel
PromQL Query
Purpose / Why it matters


Raft health
95-th percentile commit time
histogram_quantile(0.95, rate(consul_raft_commitTime_bucket[5m]))
Detects disk / network latency that slows consensus; >100 ms sustained indicates risk of write-stall or split-brain.


Commits per 5 min
rate(consul_raft_apply[5m])
Measures write throughput; sudden drop can signal leader unavailability or storage pausing.


Leader last-contact
max(consul_raft_leader_lastContact)
Time since followers heard from leader; values >1 s warn of network partitions or CPU starvation.


Election events
rate(consul_raft_state_candidate[1m])
Frequent elections reveal control-plane instability that can cascade to clients.


Cluster safety
Autopilot healthy flag
min(consul_autopilot_healthy)
Boolean verdict—0 means redundancy checks failed (e.g., too few voters).


DNS load
DNS queries / s
rate(consul_dns_domain_query_count[5m])
High query rate may indicate service-discovery loops or misconfigured stubs.


95-th percentile DNS latency
histogram_quantile(0.95, rate(consul_dns_domain_query_bucket[5m]))
Tracks resolution latency experienced by workloads; spikes precede timeouts.


KV store
KV applies / s
rate(consul_kvs_apply_count[5m])
Surges often correlate with application redeploys or excessive leader writes.


95-th percentile KV latency
histogram_quantile(0.95, rate(consul_kvs_apply_bucket[5m]))
Elevated latency points to BoltDB pressure or Raft write congestion.


ACL activity
ACL resolves / s
rate(consul_acl_ResolveToken_count[5m])
Growth reflects authentication load; unexpected spikes may precede 403s.


95-th percentile ACL latency
histogram_quantile(0.95, rate(consul_acl_ResolveToken_bucket[5m]))
Slow token resolution delays every RPC and catalog write.


Catalog churn
Register + deregister rate
rate(consul_catalog_register_count[5m]) + rate(consul_catalog_deregister_count[5m])
Measures service volatility; high churn can saturate Raft and DNS.


95-th percentile catalog op time
histogram_quantile(0.95, rate(consul_catalog_register_bucket[5m]))
Prolonged operations hint at storage contention or heavy watch load.


Consul Dataplane – Key metrics & PromQL queries


Category
Metric / Panel
PromQL Query
Purpose / Why it matters


Availability
Live Envoy instances
sum(envoy_server_live{app=~"$service"})
Drops expose pod crashes, liveness probe failures, or xDS misconfig.


Traffic quality
Request success rate
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!~"4|5", consul_destination_service=~"$service"}[10m])) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$service"}[10m]))
Falling ratio (<99 %) highlights upstream errors before users see them.


Failed requests
sum(increase(envoy_cluster_upstream_rq_xx{envoy_response_code_class=~"4|5", consul_destination_service=~"$service"}[10m])) by (local_cluster)
Pinpoints which service/cluster is producing 4xx/5xx bursts.


Requests per second
sum(rate(envoy_http_downstream_rq_total{service=~"$service", envoy_http_conn_manager_prefix="public_listener"}[5m])) by (service)
Workload baseline; sudden jump affects capacity and latency.


Cluster health
Unhealthy endpoints
(sum(envoy_cluster_membership_total{app=~"$service", envoy_cluster_name=~"$cluster"}) - sum(envoy_cluster_membership_healthy{app=~"$service", envoy_cluster_name=~"$cluster"}))
Any non-zero value = potential pod/unhealthy check in upstream cluster.


All clusters healthy?
(sum(envoy_cluster_membership_total{app=~"$service",envoy_cluster_name=~"$cluster"})-sum(envoy_cluster_membership_healthy{app=~"$service",envoy_cluster_name=~"$cluster"})) == bool 0
Boolean check for alerting—simplifies dashboards.


Resource usage
Envoy heap size
sum(envoy_server_memory_heap_size{app=~"$service"})
Detects memory leaks in proxies; trending up + OOMKill is a red flag.


Allocated memory
sum(envoy_server_memory_allocated{app=~"$service"})
Correlate with heap size to verify GC effectiveness.


Average uptime
avg(envoy_server_uptime{app=~"$service"})
Frequent restarts reset uptime—good early crash signal.


Kubernetes pod health
CPU throttled seconds
rate(container_cpu_cfs_throttled_seconds_total{namespace=~"$namespace"}[5m])
CPU throttling increases latency even when Envoy appears idle.


Memory usage % of limit
100 * max(container_memory_working_set_bytes{namespace=~"$namespace"} / on(container,pod) label_replace(kube_pod_container_resource_limits{resource="memory"},"pod","$1","exported_pod","(.+)")) by (pod)
Alerts operators before OOM-kill terminates proxies.


CPU usage % of limit
100 * max(rate(container_cpu_usage_seconds_total{namespace=~"$namespace"}[5m]) / on(container,pod) label_replace(kube_pod_container_resource_limits{resource="cpu"},"pod","$1","exported_pod","(.+)")) by (pod)
Helps right-size CPU requests/limits for sidecars.


Connections
Active upstream connections
sum(envoy_cluster_upstream_cx_active{app=~"$service", envoy_cluster_name=~"$cluster"}) by (app, envoy_cluster_name)
Rising open CX alongside high latency points to connection leaks or retries.


Active downstream connections
sum(envoy_http_downstream_cx_active{app=~"$service"})
Indicates live client load; sudden drop hints at listener failure.


4 — Alert templates (PrometheusRule)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: consul-production-alerts
  namespace: monitoring
spec:
  groups:
  - name: consul.controlplane
    rules:
    - alert: ConsulLeaderUnreachable
      expr: max(consul_raft_leader_last_contact_seconds) > 1
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Consul leader not contacted in >1 s"
        description: |
          Raft leader lastContact has exceeded 1 second for {{ $labels.instance }}.
          Check network latency, CPU throttle, or disk stall.
    - alert: ConsulExcessiveGoroutines
      expr: max(consul_runtime_num_goroutines{job="consul-servers"}) > 10000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Possible goroutine leak on Consul servers"
  - name: consul.envoy
    rules:
    - alert: EnvoyHigh5xx
      expr: sum(rate(envoy_cluster_upstream_rq{response_code=~"5.."}[5m])) > 10
      for: 3m
      labels:
        severity: warning
      annotations:
        summary: "High 5xx responses observed in sidecars"
        runbook: https://developer.hashicorp.com/consul/docs/observe/grafana/dataplane

Threshold tuning: Start with the values above, then baseline against your own 95-th percentile for a week.

Listing Prometheus alerts
The user wants a full, categorized list of Prometheus alerts for Consul and Consul Service Mesh—control plane and dataplane. These should be based on the previously discussed alerts like ConsulLeaderUnreachable and EnvoyHigh5xx. I'll make sure to cover key alerts for the control plane, such as Raft leader unreachable, excessive goroutines, and ACL blocked service registration, among others. I'll also include dataplane-related alerts as needed, while keeping in line with the monitoring needs we discussed earlier.

Consul Control-plane alerts


Alert name
PromQL expression (template)
For
Severity
Why it matters


ConsulLeaderUnreachable
max(consul_raft_leader_lastContact_seconds) > 1
2 m
critical
Followers haven’t heard from the leader in > 1 s ⇒ risk of split-brain.


ConsulFrequentElections
rate(consul_raft_state_candidate[1m]) > 0.1
3 m
warning
Elections > 6/min show instability (network, CPU, or I/O throttling).


ConsulHighRaftCommitLatency
histogram_quantile(0.95, rate(consul_raft_commitTime_seconds_bucket[5m])) > 0.1
5 m
warning
95-th commit > 100 ms → disk or network latency delaying writes.


ConsulRaftFollowerLag
max(consul_raft_replication_appendEntries_rpc_seconds{quantile="0.99"}) > 0.15
5 m
warning
Followers > 150 ms behind leader—watch for network jitter.


ConsulAutopilotUnhealthy
min(consul_autopilot_healthy) == 0
1 m
critical
Autopilot reports redundancy or peer-set issues.


ConsulACLBlockedServiceRegistration
sum(rate(consul_acl_blocked_service_registration_total[5m])) > 0
2 m
warning
New services can’t register; policies mis-scoped.


ConsulHighKVLatency
histogram_quantile(0.95, rate(consul_kvs_apply_time_seconds_bucket[5m])) > 0.05
5 m
warning
95-th KV apply > 50 ms → write congestion.


ConsulHighDNSLatency
histogram_quantile(0.95, rate(consul_dns_domain_query_seconds_bucket[5m])) > 0.02
3 m
warning
Look-ups > 20 ms begin to hit application timeouts.


ConsulHighDNSQueryRate
sum(rate(consul_dns_domain_query_total[1m])) > 5e4
1 m
info
Surging queries may indicate discovery loops.


ConsulExcessiveGoroutines
max(consul_runtime_num_goroutines{job="consul-servers"}) > 10000
5 m
warning
Linear growth → goroutine leak / runaway watch.


ConsulMemoryGrowth
increase(consul_runtime_alloc_bytes{job="consul-servers"}[30m]) > 2e9
30 m
warning
+2 GiB in 30 min: possible leak.


ConsulBoltDBFreelistGrowth
rate(consul_raft_boltdb_freelistBytes[15m]) > 1e7
15 m
warning
Freelist growing fast → run BoltDB cleanup.


ConsulWALChecksumFailure
increase(consul_raft_logstore_verifier_read_checksum_failures[5m]) > 0 or increase(consul_raft_logstore_verifier_write_checksum_failures[5m]) > 0
0 m
critical
Data-integrity error—rollback node to BoltDB & escalate.


ConsulCatalogChurnSpike
rate(consul_catalog_register_total[1m]) + rate(consul_catalog_deregister_total[1m]) > 200
1 m
info
Unusual registration churn (deploy wave or flapping health checks).


ConsulClientRPCBurst
sum(rate(consul_client_rpc[1m])) > 2e4
1 m
info
Detects RPC floods that can saturate servers.


Consul Dataplane (Envoy) alerts


Alert name
PromQL expression (template)
For
Severity
Why it matters


EnvoyLowSuccessRate
1 - ( sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!~"4|5"}[5m])) / sum(irate(envoy_cluster_upstream_rq_xx[5m])) ) > 0.01
5 m
critical
< 99 % success → customer-visible errors.


EnvoyHigh5xx
sum(rate(envoy_cluster_upstream_rq_5xx[5m])) > 10
3 m
warning
Surging 5xx from upstream services.


EnvoyHigh4xx
sum(rate(envoy_cluster_upstream_rq_4xx[5m])) > 50
3 m
info
Many 4xx may hint at bad client or mis-routed mesh traffic.


EnvoyUnhealthyClusters
(sum(envoy_cluster_membership_total) - sum(envoy_cluster_membership_healthy)) > 0
2 m
warning
Any cluster with 0 healthy endpoints breaks routing.


EnvoyXDSUpdateFailure
increase(envoy_cluster_update_rejected[5m]) > 0
0 m
critical
Config pushes failing—pods serving stale routes.


EnvoyListenerErrorRate
rate(envoy_listener_downstream_cx_destroy_local_with_active_rq[5m]) > 5
5 m
warning
Listener resets during active reqs → mesh latency / drops.


EnvoyMemoryLeak
increase(envoy_server_memory_allocated[30m]) > 5e8
30 m
warning
+500 MiB in half an hour → leak or runaway buffers.


EnvoyFrequentRestarts
changes(envoy_server_uptime[30m]) > 1
30 m
warning
Proxies crashing/restarting.


EnvoyHighCPUThrottling
sum(rate(container_cpu_cfs_throttled_seconds_total{container=~"^envoy.*"}[5m])) > 5
5 m
warning
Throttling >5 s/5 m; latency spikes expected.


EnvoyHighMemoryUtilization
container_memory_working_set_bytes / on(pod,container) kube_pod_container_resource_limits{resource="memory"} > 0.9
5 m
warning
≥90 % of limit—OOM-kill imminent.


EnvoyRetryStorm
sum(rate(envoy_cluster_upstream_rq_retry[1m])) > 20
2 m
info
Excessive automatic retries amplify latency/load.


EnvoyHighRequestLatency
histogram_quantile(0.95, rate(envoy_cluster_upstream_rq_time_bucket[5m])) > 0.3
5 m
warning
95-th > 300 ms → slowness in upstream or network path.


EnvoyActiveCxDrop
delta(envoy_http_downstream_cx_active[5m]) < -100
5 m
critical
Sudden loss of ≥100 downstream connections—possible listener crash or route removal.


Tune thresholds to your baseline RPS and SLO; these templates assume mid-sized clusters in production.


5 — Scraping Consul & Envoy on OpenShift

5.1 — Consul agents

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: consul-agents
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: consul
  endpoints:
  - port: http
    interval: 30s
    path: /v1/agent/metrics
    params:
      format: [prometheus]        # ?format=prometheus
For server pods, add tlsConfig: insecureSkipVerify: true if you use HTTPS.
5.2 — Envoy sidecars & gateways

Expose Envoy’s admin port (e.g. 9901) via a headless service and add a ServiceMonitor:
endpoints:
- port: admin
  interval: 30s
  relabelings:
  - sourceLabels: [__metrics_path__]
    targetLabel: envoy

6 — Grafana dashboards


HashiCorp’s “Consul Control Plane” dashboard JSON ID 10539
HashiCorp’s “Consul Dataplane (Envoy)” dashboard JSON ID 10540

Import via Dashboards → + Import or automate with the Grafana Operator Dashboard CR.

7 — Runbook integration

Attach the alert annotations.runbook URLs to your internal runbook repo or HashiCorp docs:

Raft instability → https://developer.hashicorp.com/consul/docs/architecture/raft
Envoy 5xx burst → https://developer.hashicorp.com/consul/docs/observe/grafana/dataplane


8 — Ongoing validation


Weekly WAL integrity audit – review the checksum-failure counters
consul_raft_logstore_verifier_read_checksum_failures and
consul_raft_logstore_verifier_write_checksum_failures.
They must stay at 0; any increment warrants rolling the node back to BoltDB and opening a support ticket.
Monthly BoltDB defrag – schedule consul operator autopilot cleanup-boltdb.
Quarterly failover test – kill leader pod; ensure ConsulLeaderUnreachable fires once and clears.
Blue/green upgrade – validate that Envoy xDS success rate stays > 99 %.


Conclusion & Calls to Action


Urgent (next 2 weeks)
Long-term (quarterly)


• Deploy ServiceMonitors & PrometheusRule above.
• Import control-plane & dataplane dashboards.
• Tune alert thresholds to your baseline.
• Automate BoltDB cleanup & snapshot verification.
• Expand dataplane metrics to include L7 latency histograms.
• Integrate alert webhooks with on-call escalation tools.


Need deeper help tailoring queries or debugging alert noise?  HashiCorp Support can provide a focused workshop—just let us know.
Layer	Metric (Prometheus name)	PromQL example	Why it matters
Raft health	`consul_raft_leader_last_contact_seconds`	`max(consul_raft_leader_last_contact_seconds)`	Spikes > 1 s = network partitions or CPU stall
	`consul_raft_commit_time_seconds` (histogram)	`histogram_quantile(0.95, rate(consul_raft_commit_time_seconds_bucket[5m]))`	95-th > 100 ms signals disk IO or quorum latency
KVS / Txn latency	`consul_kvs_apply_time_seconds`	`rate(consul_kvs_apply_time_seconds_sum[5m]) / rate(consul_kvs_apply_time_seconds_count[5m])`	Rising mean shows storage or scheduler backlog
RPC saturation	`consul_client_rpc`	`sum by(method)(rate(consul_client_rpc[1m]))`	Sudden burst = service registration flood
Runtime gauges	`consul_runtime_num_goroutines`	`max(consul_runtime_num_goroutines{job="consul-servers"})`	Linear climb = goroutine leak
	`consul_runtime_alloc_bytes`	`max_over_time(consul_runtime_alloc_bytes[30m])`	Memory footprint trend
BoltDB	`consul_raft_boltdb_freelist_bytes`	`rate(consul_raft_boltdb_freelist_bytes[5m])`	Large freelist → run `autopilot -cleanup-boltdb`
DNS / discovery	`consul_dns_domain_query_total`	`rate(consul_dns_domain_query_total[1m])`	Triage query storms
ACL	`consul_acl_blocked_service_registration_total`	`rate(consul_acl_blocked_service_registration_total[5m])`	Detect policy gaps that break deployments
Autopilot	`consul_autopilot_healthy` (0/1)	`min(consul_autopilot_healthy)`	0 indicates failed redundancy checks
Dataplane (Envoy)	`envoy_cluster_upstream_rq{job="sidecar"}`	`sum(rate(envoy_cluster_upstream_rq{response_code!~"2.."}[1m]))`	Non-2xx > baseline → upstream failures
	`envoy_listener_downstream_cx_active`	`max_over_time(envoy_listener_downstream_cx_active[5m])`	Sudden drop = listener crash/rotation
	`envoy_server_live`	`sum by(pod)(envoy_server_live)`	0 = proxy not healthy / xDS rejected
	`envoy_cluster_update_success`	`sum(rate(envoy_cluster_update_success[5m])) / sum(rate(envoy_cluster_update_attempt[5m]))`	< 1 indicates frequent config rejects
Category	Metric / Panel	PromQL Query	Purpose / Why it matters
Checksum integrity	Read-side checksum failures	`increase(consul_raft_logstore_verifier_read_checksum_failures[5m])`	Detects on-disk corruption; any increase should trigger an immediate page and a revert to BoltDB.
	Write-side checksum failures	`increase(consul_raft_logstore_verifier_write_checksum_failures[5m])`	Catches in-flight (network/software) corruption between leader ➜ follower; still demands prompt investigation.
Commit latency	95-th percentile commit time	`histogram_quantile(0.95, rate(consul_raft_commitTime_bucket[5m]))`	WAL should be same or faster than BoltDB; rising values imply follower lag or I/O regression.
Follower disk flush	95-th percentile `AppendEntries.storeLogs`	`histogram_quantile(0.95, rate(consul_raft_rpc_appendEntries_storeLogs_bucket[5m]))`	Measures follower disk write cost; WAL-enabled followers should not be slower than BoltDB peers.
Leader→follower replication	95-th percentile `AppendEntries.rpc`	`histogram_quantile(0.95, rate(consul_raft_replication_appendEntries_rpc_bucket[5m]))`	High delta vs. follower storeLogs reveals Raft RPC queuing issues unrelated to backend but still impacts commit time.
Log compaction	95-th percentile `compactLogs`	`histogram_quantile(0.95, rate(consul_raft_compactLogs_bucket[5m]))`	WAL must not increase compaction latency; spikes could indicate fragmentation or slow fsync.
Leader disk flush	95-th percentile `leader.dispatchLog`	`histogram_quantile(0.95, rate(consul_raft_leader_dispatchLog_bucket[5m]))`	Only relevant when a WAL-enabled server is leader; should match or beat historical BoltDB numbers.
Category	Metric / Panel	PromQL Query	Purpose / Why it matters
Availability	Live Envoy instances	`sum(envoy_server_live{app=~"$service"})`	Drops expose pod crashes, liveness probe failures, or xDS misconfig.
Traffic quality	Request success rate	`sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!~"4\|5", consul_destination_service=~"$service"}[10m])) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$service"}[10m]))`	Falling ratio (<99 %) highlights upstream errors before users see them.
	Failed requests	`sum(increase(envoy_cluster_upstream_rq_xx{envoy_response_code_class=~"4\|5", consul_destination_service=~"$service"}[10m])) by (local_cluster)`	Pinpoints which service/cluster is producing 4xx/5xx bursts.
	Requests per second	`sum(rate(envoy_http_downstream_rq_total{service=~"$service", envoy_http_conn_manager_prefix="public_listener"}[5m])) by (service)`	Workload baseline; sudden jump affects capacity and latency.
Cluster health	Unhealthy endpoints	`(sum(envoy_cluster_membership_total{app=~"$service", envoy_cluster_name=~"$cluster"}) - sum(envoy_cluster_membership_healthy{app=~"$service", envoy_cluster_name=~"$cluster"}))`	Any non-zero value = potential pod/unhealthy check in upstream cluster.
	All clusters healthy?	`(sum(envoy_cluster_membership_total{app=~"$service",envoy_cluster_name=~"$cluster"})-sum(envoy_cluster_membership_healthy{app=~"$service",envoy_cluster_name=~"$cluster"})) == bool 0`	Boolean check for alerting—simplifies dashboards.
Resource usage	Envoy heap size	`sum(envoy_server_memory_heap_size{app=~"$service"})`	Detects memory leaks in proxies; trending up + OOMKill is a red flag.
	Allocated memory	`sum(envoy_server_memory_allocated{app=~"$service"})`	Correlate with heap size to verify GC effectiveness.
	Average uptime	`avg(envoy_server_uptime{app=~"$service"})`	Frequent restarts reset uptime—good early crash signal.
Kubernetes pod health	CPU throttled seconds	`rate(container_cpu_cfs_throttled_seconds_total{namespace=~"$namespace"}[5m])`	CPU throttling increases latency even when Envoy appears idle.
	Memory usage % of limit	`100 * max(container_memory_working_set_bytes{namespace=~"$namespace"} / on(container,pod) label_replace(kube_pod_container_resource_limits{resource="memory"},"pod","$1","exported_pod","(.+)")) by (pod)`	Alerts operators before OOM-kill terminates proxies.
	CPU usage % of limit	`100 * max(rate(container_cpu_usage_seconds_total{namespace=~"$namespace"}[5m]) / on(container,pod) label_replace(kube_pod_container_resource_limits{resource="cpu"},"pod","$1","exported_pod","(.+)")) by (pod)`	Helps right-size CPU requests/limits for sidecars.
Connections	Active upstream connections	`sum(envoy_cluster_upstream_cx_active{app=~"$service", envoy_cluster_name=~"$cluster"}) by (app, envoy_cluster_name)`	Rising open CX alongside high latency points to connection leaks or retries.
	Active downstream connections	`sum(envoy_http_downstream_cx_active{app=~"$service"})`	Indicates live client load; sudden drop hints at listener failure.
Alert name	PromQL expression (template)	For	Severity	Why it matters
ConsulLeaderUnreachable	`max(consul_raft_leader_lastContact_seconds) > 1`	2 m	critical	Followers haven’t heard from the leader in > 1 s ⇒ risk of split-brain.
ConsulFrequentElections	`rate(consul_raft_state_candidate[1m]) > 0.1`	3 m	warning	Elections > 6/min show instability (network, CPU, or I/O throttling).
ConsulHighRaftCommitLatency	`histogram_quantile(0.95, rate(consul_raft_commitTime_seconds_bucket[5m])) > 0.1`	5 m	warning	95-th commit > 100 ms → disk or network latency delaying writes.
ConsulRaftFollowerLag	`max(consul_raft_replication_appendEntries_rpc_seconds{quantile="0.99"}) > 0.15`	5 m	warning	Followers > 150 ms behind leader—watch for network jitter.
ConsulAutopilotUnhealthy	`min(consul_autopilot_healthy) == 0`	1 m	critical	Autopilot reports redundancy or peer-set issues.
ConsulACLBlockedServiceRegistration	`sum(rate(consul_acl_blocked_service_registration_total[5m])) > 0`	2 m	warning	New services can’t register; policies mis-scoped.
ConsulHighKVLatency	`histogram_quantile(0.95, rate(consul_kvs_apply_time_seconds_bucket[5m])) > 0.05`	5 m	warning	95-th KV apply > 50 ms → write congestion.
ConsulHighDNSLatency	`histogram_quantile(0.95, rate(consul_dns_domain_query_seconds_bucket[5m])) > 0.02`	3 m	warning	Look-ups > 20 ms begin to hit application timeouts.
ConsulHighDNSQueryRate	`sum(rate(consul_dns_domain_query_total[1m])) > 5e4`	1 m	info	Surging queries may indicate discovery loops.
ConsulExcessiveGoroutines	`max(consul_runtime_num_goroutines{job="consul-servers"}) > 10000`	5 m	warning	Linear growth → goroutine leak / runaway watch.
ConsulMemoryGrowth	`increase(consul_runtime_alloc_bytes{job="consul-servers"}[30m]) > 2e9`	30 m	warning	+2 GiB in 30 min: possible leak.
ConsulBoltDBFreelistGrowth	`rate(consul_raft_boltdb_freelistBytes[15m]) > 1e7`	15 m	warning	Freelist growing fast → run BoltDB cleanup.
ConsulWALChecksumFailure	`increase(consul_raft_logstore_verifier_read_checksum_failures[5m]) > 0 or increase(consul_raft_logstore_verifier_write_checksum_failures[5m]) > 0`	0 m	critical	Data-integrity error—rollback node to BoltDB & escalate.
ConsulCatalogChurnSpike	`rate(consul_catalog_register_total[1m]) + rate(consul_catalog_deregister_total[1m]) > 200`	1 m	info	Unusual registration churn (deploy wave or flapping health checks).
ConsulClientRPCBurst	`sum(rate(consul_client_rpc[1m])) > 2e4`	1 m	info	Detects RPC floods that can saturate servers.
Alert name	PromQL expression (template)	For	Severity	Why it matters
EnvoyLowSuccessRate	`1 - ( sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!~"4\|5"}[5m])) / sum(irate(envoy_cluster_upstream_rq_xx[5m])) ) > 0.01`	5 m	critical	< 99 % success → customer-visible errors.
EnvoyHigh5xx	`sum(rate(envoy_cluster_upstream_rq_5xx[5m])) > 10`	3 m	warning	Surging 5xx from upstream services.
EnvoyHigh4xx	`sum(rate(envoy_cluster_upstream_rq_4xx[5m])) > 50`	3 m	info	Many 4xx may hint at bad client or mis-routed mesh traffic.
EnvoyUnhealthyClusters	`(sum(envoy_cluster_membership_total) - sum(envoy_cluster_membership_healthy)) > 0`	2 m	warning	Any cluster with 0 healthy endpoints breaks routing.
EnvoyXDSUpdateFailure	`increase(envoy_cluster_update_rejected[5m]) > 0`	0 m	critical	Config pushes failing—pods serving stale routes.
EnvoyListenerErrorRate	`rate(envoy_listener_downstream_cx_destroy_local_with_active_rq[5m]) > 5`	5 m	warning	Listener resets during active reqs → mesh latency / drops.
EnvoyMemoryLeak	`increase(envoy_server_memory_allocated[30m]) > 5e8`	30 m	warning	+500 MiB in half an hour → leak or runaway buffers.
EnvoyFrequentRestarts	`changes(envoy_server_uptime[30m]) > 1`	30 m	warning	Proxies crashing/restarting.
EnvoyHighCPUThrottling	`sum(rate(container_cpu_cfs_throttled_seconds_total{container=~"^envoy.*"}[5m])) > 5`	5 m	warning	Throttling >5 s/5 m; latency spikes expected.
EnvoyHighMemoryUtilization	`container_memory_working_set_bytes / on(pod,container) kube_pod_container_resource_limits{resource="memory"} > 0.9`	5 m	warning	≥90 % of limit—OOM-kill imminent.
EnvoyRetryStorm	`sum(rate(envoy_cluster_upstream_rq_retry[1m])) > 20`	2 m	info	Excessive automatic retries amplify latency/load.
EnvoyHighRequestLatency	`histogram_quantile(0.95, rate(envoy_cluster_upstream_rq_time_bucket[5m])) > 0.3`	5 m	warning	95-th > 300 ms → slowness in upstream or network path.
EnvoyActiveCxDrop	`delta(envoy_http_downstream_cx_active[5m]) < -100`	5 m	critical	Sudden loss of ≥100 downstream connections—possible listener crash or route removal.