joemiller/loki-nginx-dns-499-issue.md

## loki-nginx-dns-499-issue.md

      
    Raw
  

              loki-nginx-dns-499-issue.md
            
          
    Loki log ingestion issue

Loki stops ingesting logs from promtail. The error messages are a stream HTTP 499 errors
in the gateway component which is based on nginx. Ex:
10.194.148.169 - - [17/Feb/2022:21:08:28 +0000]  499 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/" "-"
10.194.90.195 - - [17/Feb/2022:21:08:28 +0000]  499 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/" "-"

A normal log message results in 204, ex:
10.194.72.8 - - [17/Feb/2022:21:08:28 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/" "-"

The issue is persistent until the gateway pod is restarted or a SIGHUP is sent to nginx to reload config.
TL;DR

Nginx maintains "persistent" UDP connections to the kube-dns service (every query originates with the same source port).
This leads to conntrack table entries to persistent indefinitely. When the pods backing kube-dns are restarted these
flows are never expired. Nginx continues to send DNS queries to pod IP's that no longer exist until restarted or reloaded
with SIGHUP. These queries timeout causing nginx to stop proxying to the loki distributor service.
This is a known issue: kubernetes/kubernetes#26309
Solution: Add a dnsmasq sidecar to nginx that handles kube-dns pod restarts correctly and configure nginx
to query the sidecar instead of kube-dns.
Loki Gateway

The Loki gateway is an optional component when using Loki in "microservices mode". It provides a convenient,
single entry point to all of the Loki microservices: distributor, querier, query-frontend.
The nginx config looks like the following. (Not the complete config, only the relevant parts are shown here.)
http {
  resolver kube-dns.kube-system.svc.cluster.local;

  server {
    listen             8080;

    location = / {
      return 200 'OK';
      auth_basic off;
    }

    location = /api/prom/push {
      proxy_pass       http://loki-loki-distributed-distributor.monitoring.svc.cluster.local:3100$request_uri;
    }

    location = /api/prom/tail {
      proxy_pass       http://loki-loki-distributed-querier.monitoring.svc.cluster.local:3100$request_uri;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
    }

    location ~ /api/prom/.* {
      proxy_pass       http://loki-loki-distributed-query-frontend.monitoring.svc.cluster.local:3100$request_uri;
    }

    location = /loki/api/v1/push {
      proxy_pass       http://loki-loki-distributed-distributor.monitoring.svc.cluster.local:3100$request_uri;
    }

    location = /loki/api/v1/tail {
      proxy_pass       http://loki-loki-distributed-querier.monitoring.svc.cluster.local:3100$request_uri;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
    }

    location ~ /loki/api/.* {
      proxy_pass       http://loki-loki-distributed-query-frontend.monitoring.svc.cluster.local:3100$request_uri;
    }
  }
}
The problematic endpoint is the most active one used by the promtail instances to send logs to the distributor:
    location = /loki/api/v1/push {
      proxy_pass       http://loki-loki-distributed-distributor.monitoring.svc.cluster.local:3100$request_uri;
    }
The address loki-loki-distributed-distributor.monitoring.svc.cluster.local is resolved via kube-dns in the kube-system namespace:
  resolver kube-dns.kube-system.svc.cluster.local;
And resolves to:
;; ANSWER SECTION:
loki-loki-distributed-distributor.monitoring.svc.cluster.local. 5 IN A 172.20.245.174
There is a 5 second TTL and nginx attempts to re-resolve this address every ~5 seconds.
Root Cause

The issue manifests after the coredns pods backing the kube-dns service are restarted. Restarts of these pods are
generally rare and usually occur during node fail/restart (auto-repair), node pool scale up/down, and node upgrades
which result in drain/replace.
The problem occurs because of two behaviors, one from nginx and one from kubernetes:

Nginx uses the same source port for every DNS request.
Kubernetes uses DNAT to route/re-write packets destined from a service's ClusterIP to one of the backing pod's IP, which
results in an entry being created in the conntrack table.

Due to the frequency of DNS lookups, and the frequency of of the /loki/api/v1/push being hit by the promtail instances,
the conntrack entries never expire.
A more detailed explanation is covered in this thread: kubernetes/kubernetes#26309 (comment)
Re-creating

Re-creating the issue is straightforward:

Start tailing logs of the loki gateway pod. You should see a steady stream of 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/" "-"
in the logs indicating successful log submission from promtail pods.
Restart the coredns pods in the kube-system namespace.

After a few seconds the stream of successful 204 responses in the gateway pod will be replaced by 499 errors.
The 499 indicates that the promtail clients have timed out and closed their connections. If the promtail timeout
is increased or the nginx config's resolve_timeout is reduced from its default value of 30s to something lower
such as 5s the errors will be 502s as now nginx will timeout waiting for a valid DNS response and closes the connection
waiting for the DNS result.
The errors will persist until nginx is restarted or sent a SIGHUP causing it to reload its config and create a
new UDP connection to its DNS resolver.
The issue can be tricky to observe outside of nginx. For example, if you were to start a new pod and repeatedly
query kube-dns with dig …
while true; do dig loki-loki-distributed-distributor.monitoring.svc.cluster.local ; sleep 0.5 ; done
… you would not observe any failures because each dig invocation would create a new UDP "connection" with a new
source port and a new entry in the conntrack table to one of the currently active coredns pods.
Fixes

So, the issue comes down to nginx using the same UDP flow. Can we configure nginx in a way the causes it to use a new source on each query?
Several possible solutions were attempted:
1. Configure nginx to create its own DNS proxy

Nginx is able to proxy DNS and can use this proxy internally, ex:
Simple:
stream {
  server {
    listen 127.0.0.1:8053 udp;
    proxy_pass kube-dns.kube-system.svc.cluster.local:53;
  }
}

http {
  resolver 127.0.0.1:8053;
}
Using upstream and multiple entries:
stream {
  upstream dns {
    server kube-dns.kube-system.svc.cluster.local:53 max_fails=1 max_conns=1 fail_timeout=30s;
    server kube-dns.kube-system.svc.cluster.local:53 backup;
  }
  server {
    listen 127.0.0.1:8053 udp;
    proxy_pass dns;
  }
}
Result: Nope. Does not help.
Unfortunately nginx continues to maintain persistent, same source-port, connections to the upstream dns servers.
2. Override nginx DNS caching to reduce query frequency

Nginx respects the 5 second DNS TTLs and re-attempts lookups every 5 seconds.
DNAT conntrack entries will eventually expire. Reducing the frequency of lookups from nginx may provide enough time
for the kernel to expire the conntrack entries.
nginx.conf:
  resolver kube-dns.kube-system.svc.cluster.local valid=120s;
Result: Nope. Does not help. Nginx will continue to hold the same source port. We can observe this with netstat
in the nginx container:
$ netstat -anpW  | grep udp
udp        0      0 10.8.184.73:52376                                   10.8.118.202:53                                     ESTABLISHED 9/nginx: worker pro
In this example the 10.8.118.202 IP is the address of a former coredns pod after we have restarted coredns. These
queries will never succeed. Nginx must be restarted or SIGHUP'd to become functional again.
3. livenessProbe to detect failure and auto-restart

A livenessProbe could detect the issue and cause the kubelet to restart nginx.
How to construct the probe? All Loki components expose a simple /ready endpoint when run in microservices mode.
Since the distributor is the most frequently hit component we proxy to its /ready endpoint to detect failure.
Add a new location in nginx config:
location = /healthz {
  proxy_pass       http://loki-loki-distributed-distributor.monitoring.svc.cluster.local:3100/ready;
}
In the gateway Deployment:
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: "/healthz"
            port: http
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
Result: This works. But feels incorrect and brittle. It would tie the availability of the entire Loki stack
to the availability of a single component - the distributor. If the distributor fails then the gateway
would end up in a restart loop making the components querier`` and query-frontend` unavailable.
Additionally, the loki-distributed helm chart does not expose a configurable livenessProbe in the
gateway deployment template. They would probably accept a patch to allow setting a livenessProbe since there is
already a configurable readinessProbe.
4. Add dnsmasq (proxy/forwarder) sidecar

Based on this suggestion: kubernetes/kubernetes#26309 (comment)
Add a sidecar to the gateway deployment and configure nginx to send queries through the sidecar. As long
as the sidecar creates a new UDP connection/src-port on each query we should not run into the issue:
spec:
  containers:
    - name: dnsmasq
    image: "janeczku/go-dnsmasq:release-1.0.7"
    args:
      - --listen
      - "127.0.0.1:8053"
      - --hostsfile=/etc/hosts
      - --enable-search
      - --verbose
nginx.conf:
resolver 127.0.0.1:8053;
Result: SUCCESS! With this configuration the dnsmasq sidecar handles restarts of the coredns pods
correctly.
Because the TTLs on the DNS records is only 5 seconds there is a brief period of time when DNS queries from dnsmasq to kube-dns
will fail. We can work around this by forcing nginx to cache the results for a longer period. Because the Service IPs rarely change
this is a safe trade-off to avoid temporarily blips:
resolver 127.0.0.1:8053 valid=60s;
The helm chart does not provide a way to inject a sidecar into the gateway deployment.
It's possible that the upstream helm maintainers would accept a patch that optionally adds the dnsmasq sidecar as well.
Next Steps


Attempt to find a way to inject a dnsmasq sidecar. Possibly using kustomize on top of the rendered helm chart?
Submit an issue to the grafana/helm-charts repo. Suggest a new config option for either (a) adding a pre-configured
dnsmasq sidecar or (b) making a generic sidecar container option for any type of sidecar to be added.