HashiCorp Support Recommended Practices
Consul is a distributed system that relies on Raft consensus, RPC‐based discovery, and (optionally) an Envoy dataplane. In production, continuous telemetry gives early warning of:
- Control-plane risk – leader churn, slow Raft commits, memory leaks, failing ACL checks.
- Dataplane bottlenecks – retry storms, listener errors, xDS update failures.
- Resource exhaustion – goroutine explosions, rising BoltDB freelist, DNS query spikes.
The guidance below uses Prometheus + Alertmanager (the default stack on OpenShift) but mirrors the same metric names you’ll see if you ingest via Telegraf-InfluxDB or ship to Grafana Cloud.
Consul agents (servers & clients) ─┐
│ /metrics (Prometheus format)
Envoy proxies (sidecars, gateways) ─┘
│
▼
Prometheus Operator (OpenShift) ──› Alertmanager ──› Slack / PagerDuty
│ ▲
└─› Grafana dashboards ─────────┘
Use the kube-rbac-proxy
sidecar if you need RBAC‐scoped endpoints.
Layer | Metric (Prometheus name) | PromQL example | Why it matters |
---|---|---|---|
Raft health | consul_raft_leader_last_contact_seconds |
max(consul_raft_leader_last_contact_seconds) |
Spikes > 1 s = network partitions or CPU stall |
consul_raft_commit_time_seconds (histogram) |
histogram_quantile(0.95, rate(consul_raft_commit_time_seconds_bucket[5m])) |
95-th > 100 ms signals disk IO or quorum latency | |
KVS / Txn latency | consul_kvs_apply_time_seconds |
rate(consul_kvs_apply_time_seconds_sum[5m]) / rate(consul_kvs_apply_time_seconds_count[5m]) |
Rising mean shows storage or scheduler backlog |
RPC saturation | consul_client_rpc |
sum by(method)(rate(consul_client_rpc[1m])) |
Sudden burst = service registration flood |
Runtime gauges | consul_runtime_num_goroutines |
max(consul_runtime_num_goroutines{job="consul-servers"}) |
Linear climb = goroutine leak |
consul_runtime_alloc_bytes |
max_over_time(consul_runtime_alloc_bytes[30m]) |
Memory footprint trend | |
BoltDB | consul_raft_boltdb_freelist_bytes |
rate(consul_raft_boltdb_freelist_bytes[5m]) |
Large freelist → run autopilot -cleanup-boltdb |
DNS / discovery | consul_dns_domain_query_total |
rate(consul_dns_domain_query_total[1m]) |
Triage query storms |
ACL | consul_acl_blocked_service_registration_total |
rate(consul_acl_blocked_service_registration_total[5m]) |
Detect policy gaps that break deployments |
Autopilot | consul_autopilot_healthy (0/1) |
min(consul_autopilot_healthy) |
0 indicates failed redundancy checks |
Dataplane (Envoy) | envoy_cluster_upstream_rq{job="sidecar"} |
sum(rate(envoy_cluster_upstream_rq{response_code!~"2.."}[1m])) |
Non-2xx > baseline → upstream failures |
envoy_listener_downstream_cx_active |
max_over_time(envoy_listener_downstream_cx_active[5m]) |
Sudden drop = listener crash/rotation | |
envoy_server_live |
sum by(pod)(envoy_server_live) |
0 = proxy not healthy / xDS rejected | |
envoy_cluster_update_success |
sum(rate(envoy_cluster_update_success[5m])) / sum(rate(envoy_cluster_update_attempt[5m])) |
< 1 indicates frequent config rejects |
Note: Raft WAL Logstore is enabled by default on new clusters spun up running Consul versions >= v1.20.x
Category | Metric / Panel | PromQL Query | Purpose / Why it matters |
---|---|---|---|
Checksum integrity | Read-side checksum failures | increase(consul_raft_logstore_verifier_read_checksum_failures[5m]) |
Detects on-disk corruption; any increase should trigger an immediate page and a revert to BoltDB. |
Write-side checksum failures | increase(consul_raft_logstore_verifier_write_checksum_failures[5m]) |
Catches in-flight (network/software) corruption between leader ➜ follower; still demands prompt investigation. | |
Commit latency | 95-th percentile commit time | histogram_quantile(0.95, rate(consul_raft_commitTime_bucket[5m])) |
WAL should be same or faster than BoltDB; rising values imply follower lag or I/O regression. |
Follower disk flush | 95-th percentile AppendEntries.storeLogs |
histogram_quantile(0.95, rate(consul_raft_rpc_appendEntries_storeLogs_bucket[5m])) |
Measures follower disk write cost; WAL-enabled followers should not be slower than BoltDB peers. |
Leader→follower replication | 95-th percentile AppendEntries.rpc |
histogram_quantile(0.95, rate(consul_raft_replication_appendEntries_rpc_bucket[5m])) |
High delta vs. follower storeLogs reveals Raft RPC queuing issues unrelated to backend but still impacts commit time. |
Log compaction | 95-th percentile compactLogs |
histogram_quantile(0.95, rate(consul_raft_compactLogs_bucket[5m])) |
WAL must not increase compaction latency; spikes could indicate fragmentation or slow fsync. |
Leader disk flush | 95-th percentile leader.dispatchLog |
histogram_quantile(0.95, rate(consul_raft_leader_dispatchLog_bucket[5m])) |
Only relevant when a WAL-enabled server is leader; should match or beat historical BoltDB numbers. |
Category | Metric / Panel | PromQL Query | Purpose / Why it matters |
---|---|---|---|
Raft health | 95-th percentile commit time | histogram_quantile(0.95, rate(consul_raft_commitTime_bucket[5m])) |
Detects disk / network latency that slows consensus; >100 ms sustained indicates risk of write-stall or split-brain. |
Commits per 5 min | rate(consul_raft_apply[5m]) |
Measures write throughput; sudden drop can signal leader unavailability or storage pausing. | |
Leader last-contact | max(consul_raft_leader_lastContact) |
Time since followers heard from leader; values >1 s warn of network partitions or CPU starvation. | |
Election events | rate(consul_raft_state_candidate[1m]) |
Frequent elections reveal control-plane instability that can cascade to clients. | |
Cluster safety | Autopilot healthy flag | min(consul_autopilot_healthy) |
Boolean verdict—0 means redundancy checks failed (e.g., too few voters). |
DNS load | DNS queries / s | rate(consul_dns_domain_query_count[5m]) |
High query rate may indicate service-discovery loops or misconfigured stubs. |
95-th percentile DNS latency | histogram_quantile(0.95, rate(consul_dns_domain_query_bucket[5m])) |
Tracks resolution latency experienced by workloads; spikes precede timeouts. | |
KV store | KV applies / s | rate(consul_kvs_apply_count[5m]) |
Surges often correlate with application redeploys or excessive leader writes. |
95-th percentile KV latency | histogram_quantile(0.95, rate(consul_kvs_apply_bucket[5m])) |
Elevated latency points to BoltDB pressure or Raft write congestion. | |
ACL activity | ACL resolves / s | rate(consul_acl_ResolveToken_count[5m]) |
Growth reflects authentication load; unexpected spikes may precede 403s. |
95-th percentile ACL latency | histogram_quantile(0.95, rate(consul_acl_ResolveToken_bucket[5m])) |
Slow token resolution delays every RPC and catalog write. | |
Catalog churn | Register + deregister rate | rate(consul_catalog_register_count[5m]) + rate(consul_catalog_deregister_count[5m]) |
Measures service volatility; high churn can saturate Raft and DNS. |
95-th percentile catalog op time | histogram_quantile(0.95, rate(consul_catalog_register_bucket[5m])) |
Prolonged operations hint at storage contention or heavy watch load. |
Category | Metric / Panel | PromQL Query | Purpose / Why it matters |
---|---|---|---|
Availability | Live Envoy instances | sum(envoy_server_live{app=~"$service"}) |
Drops expose pod crashes, liveness probe failures, or xDS misconfig. |
Traffic quality | Request success rate | sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!~"4|5", consul_destination_service=~"$service"}[10m])) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$service"}[10m])) |
Falling ratio (<99 %) highlights upstream errors before users see them. |
Failed requests | sum(increase(envoy_cluster_upstream_rq_xx{envoy_response_code_class=~"4|5", consul_destination_service=~"$service"}[10m])) by (local_cluster) |
Pinpoints which service/cluster is producing 4xx/5xx bursts. | |
Requests per second | sum(rate(envoy_http_downstream_rq_total{service=~"$service", envoy_http_conn_manager_prefix="public_listener"}[5m])) by (service) |
Workload baseline; sudden jump affects capacity and latency. | |
Cluster health | Unhealthy endpoints | (sum(envoy_cluster_membership_total{app=~"$service", envoy_cluster_name=~"$cluster"}) - sum(envoy_cluster_membership_healthy{app=~"$service", envoy_cluster_name=~"$cluster"})) |
Any non-zero value = potential pod/unhealthy check in upstream cluster. |
All clusters healthy? | (sum(envoy_cluster_membership_total{app=~"$service",envoy_cluster_name=~"$cluster"})-sum(envoy_cluster_membership_healthy{app=~"$service",envoy_cluster_name=~"$cluster"})) == bool 0 |
Boolean check for alerting—simplifies dashboards. | |
Resource usage | Envoy heap size | sum(envoy_server_memory_heap_size{app=~"$service"}) |
Detects memory leaks in proxies; trending up + OOMKill is a red flag. |
Allocated memory | sum(envoy_server_memory_allocated{app=~"$service"}) |
Correlate with heap size to verify GC effectiveness. | |
Average uptime | avg(envoy_server_uptime{app=~"$service"}) |
Frequent restarts reset uptime—good early crash signal. | |
Kubernetes pod health | CPU throttled seconds | rate(container_cpu_cfs_throttled_seconds_total{namespace=~"$namespace"}[5m]) |
CPU throttling increases latency even when Envoy appears idle. |
Memory usage % of limit | 100 * max(container_memory_working_set_bytes{namespace=~"$namespace"} / on(container,pod) label_replace(kube_pod_container_resource_limits{resource="memory"},"pod","$1","exported_pod","(.+)")) by (pod) |
Alerts operators before OOM-kill terminates proxies. | |
CPU usage % of limit | 100 * max(rate(container_cpu_usage_seconds_total{namespace=~"$namespace"}[5m]) / on(container,pod) label_replace(kube_pod_container_resource_limits{resource="cpu"},"pod","$1","exported_pod","(.+)")) by (pod) |
Helps right-size CPU requests/limits for sidecars. | |
Connections | Active upstream connections | sum(envoy_cluster_upstream_cx_active{app=~"$service", envoy_cluster_name=~"$cluster"}) by (app, envoy_cluster_name) |
Rising open CX alongside high latency points to connection leaks or retries. |
Active downstream connections | sum(envoy_http_downstream_cx_active{app=~"$service"}) |
Indicates live client load; sudden drop hints at listener failure. |
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: consul-production-alerts
namespace: monitoring
spec:
groups:
- name: consul.controlplane
rules:
- alert: ConsulLeaderUnreachable
expr: max(consul_raft_leader_last_contact_seconds) > 1
for: 2m
labels:
severity: critical
annotations:
summary: "Consul leader not contacted in >1 s"
description: |
Raft leader lastContact has exceeded 1 second for {{ $labels.instance }}.
Check network latency, CPU throttle, or disk stall.
- alert: ConsulExcessiveGoroutines
expr: max(consul_runtime_num_goroutines{job="consul-servers"}) > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Possible goroutine leak on Consul servers"
- name: consul.envoy
rules:
- alert: EnvoyHigh5xx
expr: sum(rate(envoy_cluster_upstream_rq{response_code=~"5.."}[5m])) > 10
for: 3m
labels:
severity: warning
annotations:
summary: "High 5xx responses observed in sidecars"
runbook: https://developer.hashicorp.com/consul/docs/observe/grafana/dataplane
Threshold tuning: Start with the values above, then baseline against your own 95-th percentile for a week.
Listing Prometheus alerts
The user wants a full, categorized list of Prometheus alerts for Consul and Consul Service Mesh—control plane and dataplane. These should be based on the previously discussed alerts like ConsulLeaderUnreachable and EnvoyHigh5xx. I'll make sure to cover key alerts for the control plane, such as Raft leader unreachable, excessive goroutines, and ACL blocked service registration, among others. I'll also include dataplane-related alerts as needed, while keeping in line with the monitoring needs we discussed earlier.
Alert name | PromQL expression (template) | For | Severity | Why it matters |
---|---|---|---|---|
ConsulLeaderUnreachable | max(consul_raft_leader_lastContact_seconds) > 1 |
2 m | critical | Followers haven’t heard from the leader in > 1 s ⇒ risk of split-brain. |
ConsulFrequentElections | rate(consul_raft_state_candidate[1m]) > 0.1 |
3 m | warning | Elections > 6/min show instability (network, CPU, or I/O throttling). |
ConsulHighRaftCommitLatency | histogram_quantile(0.95, rate(consul_raft_commitTime_seconds_bucket[5m])) > 0.1 |
5 m | warning | 95-th commit > 100 ms → disk or network latency delaying writes. |
ConsulRaftFollowerLag | max(consul_raft_replication_appendEntries_rpc_seconds{quantile="0.99"}) > 0.15 |
5 m | warning | Followers > 150 ms behind leader—watch for network jitter. |
ConsulAutopilotUnhealthy | min(consul_autopilot_healthy) == 0 |
1 m | critical | Autopilot reports redundancy or peer-set issues. |
ConsulACLBlockedServiceRegistration | sum(rate(consul_acl_blocked_service_registration_total[5m])) > 0 |
2 m | warning | New services can’t register; policies mis-scoped. |
ConsulHighKVLatency | histogram_quantile(0.95, rate(consul_kvs_apply_time_seconds_bucket[5m])) > 0.05 |
5 m | warning | 95-th KV apply > 50 ms → write congestion. |
ConsulHighDNSLatency | histogram_quantile(0.95, rate(consul_dns_domain_query_seconds_bucket[5m])) > 0.02 |
3 m | warning | Look-ups > 20 ms begin to hit application timeouts. |
ConsulHighDNSQueryRate | sum(rate(consul_dns_domain_query_total[1m])) > 5e4 |
1 m | info | Surging queries may indicate discovery loops. |
ConsulExcessiveGoroutines | max(consul_runtime_num_goroutines{job="consul-servers"}) > 10000 |
5 m | warning | Linear growth → goroutine leak / runaway watch. |
ConsulMemoryGrowth | increase(consul_runtime_alloc_bytes{job="consul-servers"}[30m]) > 2e9 |
30 m | warning | +2 GiB in 30 min: possible leak. |
ConsulBoltDBFreelistGrowth | rate(consul_raft_boltdb_freelistBytes[15m]) > 1e7 |
15 m | warning | Freelist growing fast → run BoltDB cleanup. |
ConsulWALChecksumFailure | increase(consul_raft_logstore_verifier_read_checksum_failures[5m]) > 0 or increase(consul_raft_logstore_verifier_write_checksum_failures[5m]) > 0 |
0 m | critical | Data-integrity error—rollback node to BoltDB & escalate. |
ConsulCatalogChurnSpike | rate(consul_catalog_register_total[1m]) + rate(consul_catalog_deregister_total[1m]) > 200 |
1 m | info | Unusual registration churn (deploy wave or flapping health checks). |
ConsulClientRPCBurst | sum(rate(consul_client_rpc[1m])) > 2e4 |
1 m | info | Detects RPC floods that can saturate servers. |
Alert name | PromQL expression (template) | For | Severity | Why it matters | |
---|---|---|---|---|---|
EnvoyLowSuccessRate | 1 - ( sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!~"4|5"}[5m])) / sum(irate(envoy_cluster_upstream_rq_xx[5m])) ) > 0.01 |
5 m | critical | < 99 % success → customer-visible errors. | |
EnvoyHigh5xx | sum(rate(envoy_cluster_upstream_rq_5xx[5m])) > 10 |
3 m | warning | Surging 5xx from upstream services. | |
EnvoyHigh4xx | sum(rate(envoy_cluster_upstream_rq_4xx[5m])) > 50 |
3 m | info | Many 4xx may hint at bad client or mis-routed mesh traffic. | |
EnvoyUnhealthyClusters | (sum(envoy_cluster_membership_total) - sum(envoy_cluster_membership_healthy)) > 0 |
2 m | warning | Any cluster with 0 healthy endpoints breaks routing. | |
EnvoyXDSUpdateFailure | increase(envoy_cluster_update_rejected[5m]) > 0 |
0 m | critical | Config pushes failing—pods serving stale routes. | |
EnvoyListenerErrorRate | rate(envoy_listener_downstream_cx_destroy_local_with_active_rq[5m]) > 5 |
5 m | warning | Listener resets during active reqs → mesh latency / drops. | |
EnvoyMemoryLeak | increase(envoy_server_memory_allocated[30m]) > 5e8 |
30 m | warning | +500 MiB in half an hour → leak or runaway buffers. | |
EnvoyFrequentRestarts | changes(envoy_server_uptime[30m]) > 1 |
30 m | warning | Proxies crashing/restarting. | |
EnvoyHighCPUThrottling | sum(rate(container_cpu_cfs_throttled_seconds_total{container=~"^envoy.*"}[5m])) > 5 |
5 m | warning | Throttling >5 s/5 m; latency spikes expected. | |
EnvoyHighMemoryUtilization | container_memory_working_set_bytes / on(pod,container) kube_pod_container_resource_limits{resource="memory"} > 0.9 |
5 m | warning | ≥90 % of limit—OOM-kill imminent. | |
EnvoyRetryStorm | sum(rate(envoy_cluster_upstream_rq_retry[1m])) > 20 |
2 m | info | Excessive automatic retries amplify latency/load. | |
EnvoyHighRequestLatency | histogram_quantile(0.95, rate(envoy_cluster_upstream_rq_time_bucket[5m])) > 0.3 |
5 m | warning | 95-th > 300 ms → slowness in upstream or network path. | |
EnvoyActiveCxDrop | delta(envoy_http_downstream_cx_active[5m]) < -100 |
5 m | critical | Sudden loss of ≥100 downstream connections—possible listener crash or route removal. |
Tune thresholds to your baseline RPS and SLO; these templates assume mid-sized clusters in production.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: consul-agents
namespace: monitoring
spec:
selector:
matchLabels:
app: consul
endpoints:
- port: http
interval: 30s
path: /v1/agent/metrics
params:
format: [prometheus] # ?format=prometheus
For server pods, add tlsConfig: insecureSkipVerify: true
if you use HTTPS.
Expose Envoy’s admin port (e.g. 9901
) via a headless service and add a ServiceMonitor
:
endpoints:
- port: admin
interval: 30s
relabelings:
- sourceLabels: [__metrics_path__]
targetLabel: envoy
- HashiCorp’s “Consul Control Plane” dashboard JSON ID
10539
- HashiCorp’s “Consul Dataplane (Envoy)” dashboard JSON ID
10540
Import via Dashboards → + Import or automate with the Grafana Operator Dashboard
CR.
Attach the alert annotations.runbook URLs to your internal runbook repo or HashiCorp docs:
- Raft instability → https://developer.hashicorp.com/consul/docs/architecture/raft
- Envoy 5xx burst → https://developer.hashicorp.com/consul/docs/observe/grafana/dataplane
- Weekly WAL integrity audit – review the checksum-failure counters
consul_raft_logstore_verifier_read_checksum_failures
andconsul_raft_logstore_verifier_write_checksum_failures
. They must stay at 0; any increment warrants rolling the node back to BoltDB and opening a support ticket. - Monthly BoltDB defrag – schedule
consul operator autopilot cleanup-boltdb
. - Quarterly failover test – kill leader pod; ensure
ConsulLeaderUnreachable
fires once and clears. - Blue/green upgrade – validate that Envoy xDS success rate stays > 99 %.
Urgent (next 2 weeks) | Long-term (quarterly) |
---|---|
• Deploy ServiceMonitors & PrometheusRule above. • Import control-plane & dataplane dashboards. • Tune alert thresholds to your baseline. |
• Automate BoltDB cleanup & snapshot verification. • Expand dataplane metrics to include L7 latency histograms. • Integrate alert webhooks with on-call escalation tools. |
Need deeper help tailoring queries or debugging alert noise? HashiCorp Support can provide a focused workshop—just let us know.