Loki stops ingesting logs from promtail. The error messages are a stream HTTP 499
errors
in the gateway
component which is based on nginx. Ex:
10.194.148.169 - - [17/Feb/2022:21:08:28 +0000] 499 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/" "-"
10.194.90.195 - - [17/Feb/2022:21:08:28 +0000] 499 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/" "-"
A normal log message results in 204
, ex:
10.194.72.8 - - [17/Feb/2022:21:08:28 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/" "-"
The issue is persistent until the gateway pod is restarted or a SIGHUP
is sent to nginx to reload config.
Nginx maintains "persistent" UDP connections to the kube-dns
service (every query originates with the same source port).
This leads to conntrack
table entries to persistent indefinitely. When the pods backing kube-dns
are restarted these
flows are never expired. Nginx continues to send DNS queries to pod IP's that no longer exist until restarted or reloaded
with SIGHUP
. These queries timeout causing nginx to stop proxying to the loki distributor
service.
This is a known issue: kubernetes/kubernetes#26309
Solution: Add a dnsmasq
sidecar to nginx
that handles kube-dns
pod restarts correctly and configure nginx
to query the sidecar instead of kube-dns
.
The Loki gateway is an optional component when using Loki in "microservices mode". It provides a convenient,
single entry point to all of the Loki microservices: distributor
, querier
, query-frontend
.
The nginx config looks like the following. (Not the complete config, only the relevant parts are shown here.)
http {
resolver kube-dns.kube-system.svc.cluster.local;
server {
listen 8080;
location = / {
return 200 'OK';
auth_basic off;
}
location = /api/prom/push {
proxy_pass http://loki-loki-distributed-distributor.monitoring.svc.cluster.local:3100$request_uri;
}
location = /api/prom/tail {
proxy_pass http://loki-loki-distributed-querier.monitoring.svc.cluster.local:3100$request_uri;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
location ~ /api/prom/.* {
proxy_pass http://loki-loki-distributed-query-frontend.monitoring.svc.cluster.local:3100$request_uri;
}
location = /loki/api/v1/push {
proxy_pass http://loki-loki-distributed-distributor.monitoring.svc.cluster.local:3100$request_uri;
}
location = /loki/api/v1/tail {
proxy_pass http://loki-loki-distributed-querier.monitoring.svc.cluster.local:3100$request_uri;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
location ~ /loki/api/.* {
proxy_pass http://loki-loki-distributed-query-frontend.monitoring.svc.cluster.local:3100$request_uri;
}
}
}
The problematic endpoint is the most active one used by the promtail
instances to send logs to the distributor
:
location = /loki/api/v1/push {
proxy_pass http://loki-loki-distributed-distributor.monitoring.svc.cluster.local:3100$request_uri;
}
The address loki-loki-distributed-distributor.monitoring.svc.cluster.local
is resolved via kube-dns
in the kube-system
namespace:
resolver kube-dns.kube-system.svc.cluster.local;
And resolves to:
;; ANSWER SECTION:
loki-loki-distributed-distributor.monitoring.svc.cluster.local. 5 IN A 172.20.245.174
There is a 5 second TTL and nginx attempts to re-resolve this address every ~5 seconds.
The issue manifests after the coredns
pods backing the kube-dns
service are restarted. Restarts of these pods are
generally rare and usually occur during node fail/restart (auto-repair), node pool scale up/down, and node upgrades
which result in drain/replace.
The problem occurs because of two behaviors, one from nginx and one from kubernetes:
- Nginx uses the same source port for every DNS request.
- Kubernetes uses DNAT to route/re-write packets destined from a service's ClusterIP to one of the backing pod's IP, which results in an entry being created in the conntrack table.
Due to the frequency of DNS lookups, and the frequency of of the /loki/api/v1/push
being hit by the promtail
instances,
the conntrack entries never expire.
A more detailed explanation is covered in this thread: kubernetes/kubernetes#26309 (comment)
Re-creating the issue is straightforward:
- Start tailing logs of the loki gateway pod. You should see a steady stream of
204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/" "-"
in the logs indicating successful log submission frompromtail
pods. - Restart the
coredns
pods in thekube-system
namespace.
After a few seconds the stream of successful 204
responses in the gateway pod will be replaced by 499
errors.
The 499
indicates that the promtail
clients have timed out and closed their connections. If the promtail
timeout
is increased or the nginx config's resolve_timeout
is reduced from its default value of 30s
to something lower
such as 5s
the errors will be 502
s as now nginx will timeout waiting for a valid DNS response and closes the connection
waiting for the DNS result.
The errors will persist until nginx is restarted or sent a SIGHUP
causing it to reload its config and create a
new UDP connection to its DNS resolver.
The issue can be tricky to observe outside of nginx. For example, if you were to start a new pod and repeatedly
query kube-dns
with dig
…
while true; do dig loki-loki-distributed-distributor.monitoring.svc.cluster.local ; sleep 0.5 ; done
… you would not observe any failures because each dig
invocation would create a new UDP "connection" with a new
source port and a new entry in the conntrack table to one of the currently active coredns
pods.
So, the issue comes down to nginx using the same UDP flow. Can we configure nginx in a way the causes it to use a new source on each query?
Several possible solutions were attempted:
Nginx is able to proxy DNS and can use this proxy internally, ex:
Simple:
stream {
server {
listen 127.0.0.1:8053 udp;
proxy_pass kube-dns.kube-system.svc.cluster.local:53;
}
}
http {
resolver 127.0.0.1:8053;
}
Using upstream and multiple entries:
stream {
upstream dns {
server kube-dns.kube-system.svc.cluster.local:53 max_fails=1 max_conns=1 fail_timeout=30s;
server kube-dns.kube-system.svc.cluster.local:53 backup;
}
server {
listen 127.0.0.1:8053 udp;
proxy_pass dns;
}
}
Result: Nope. Does not help.
Unfortunately nginx continues to maintain persistent, same source-port, connections to the upstream dns servers.
Nginx respects the 5 second DNS TTLs and re-attempts lookups every 5 seconds.
DNAT conntrack entries will eventually expire. Reducing the frequency of lookups from nginx may provide enough time for the kernel to expire the conntrack entries.
nginx.conf:
resolver kube-dns.kube-system.svc.cluster.local valid=120s;
Result: Nope. Does not help. Nginx will continue to hold the same source port. We can observe this with netstat
in the nginx container:
$ netstat -anpW | grep udp
udp 0 0 10.8.184.73:52376 10.8.118.202:53 ESTABLISHED 9/nginx: worker pro
In this example the 10.8.118.202
IP is the address of a former coredns
pod after we have restarted coredns
. These
queries will never succeed. Nginx must be restarted or SIGHUP
'd to become functional again.
A livenessProbe
could detect the issue and cause the kubelet to restart nginx.
How to construct the probe? All Loki components expose a simple /ready
endpoint when run in microservices mode.
Since the distributor is the most frequently hit component we proxy to its /ready
endpoint to detect failure.
Add a new location
in nginx config:
location = /healthz {
proxy_pass http://loki-loki-distributed-distributor.monitoring.svc.cluster.local:3100/ready;
}
In the gateway
Deployment:
livenessProbe:
failureThreshold: 3
httpGet:
path: "/healthz"
port: http
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
Result: This works. But feels incorrect and brittle. It would tie the availability of the entire Loki stack
to the availability of a single component - the distributor
. If the distributor fails then the gateway
would end up in a restart loop making the components querier`` and
query-frontend` unavailable.
Additionally, the loki-distributed
helm chart does not expose a configurable livenessProbe in the
gateway deployment template. They would probably accept a patch to allow setting a livenessProbe since there is
already a configurable readinessProbe.
Based on this suggestion: kubernetes/kubernetes#26309 (comment)
Add a sidecar to the gateway
deployment and configure nginx
to send queries through the sidecar. As long
as the sidecar creates a new UDP connection/src-port on each query we should not run into the issue:
spec:
containers:
- name: dnsmasq
image: "janeczku/go-dnsmasq:release-1.0.7"
args:
- --listen
- "127.0.0.1:8053"
- --hostsfile=/etc/hosts
- --enable-search
- --verbose
nginx.conf:
resolver 127.0.0.1:8053;
Result: SUCCESS! With this configuration the dnsmasq sidecar handles restarts of the coredns
pods
correctly.
Because the TTLs on the DNS records is only 5 seconds there is a brief period of time when DNS queries from dnsmasq
to kube-dns
will fail. We can work around this by forcing nginx to cache the results for a longer period. Because the Service IPs rarely change
this is a safe trade-off to avoid temporarily blips:
resolver 127.0.0.1:8053 valid=60s;
The helm chart does not provide a way to inject a sidecar into the gateway
deployment.
It's possible that the upstream helm maintainers would accept a patch that optionally adds the dnsmasq
sidecar as well.
- Attempt to find a way to inject a dnsmasq sidecar. Possibly using kustomize on top of the rendered helm chart?
- Submit an issue to the
grafana/helm-charts
repo. Suggest a new config option for either (a) adding a pre-configureddnsmasq
sidecar or (b) making a generic sidecar container option for any type of sidecar to be added.
You may want to look at https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#socket-loadbalancer-bypass-in-pod-namespace, saves you a sidecar.