Skip to content

Instantly share code, notes, and snippets.

@joemiller
Last active November 28, 2024 16:17
Show Gist options
  • Save joemiller/68ab3f7a7a08e4a9d5ad5d023cb14fc2 to your computer and use it in GitHub Desktop.
Save joemiller/68ab3f7a7a08e4a9d5ad5d023cb14fc2 to your computer and use it in GitHub Desktop.

Loki log ingestion issue

Loki stops ingesting logs from promtail. The error messages are a stream HTTP 499 errors in the gateway component which is based on nginx. Ex:

10.194.148.169 - - [17/Feb/2022:21:08:28 +0000]  499 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/" "-"
10.194.90.195 - - [17/Feb/2022:21:08:28 +0000]  499 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/" "-"

A normal log message results in 204, ex:

10.194.72.8 - - [17/Feb/2022:21:08:28 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/" "-"

The issue is persistent until the gateway pod is restarted or a SIGHUP is sent to nginx to reload config.

TL;DR

Nginx maintains "persistent" UDP connections to the kube-dns service (every query originates with the same source port). This leads to conntrack table entries to persistent indefinitely. When the pods backing kube-dns are restarted these flows are never expired. Nginx continues to send DNS queries to pod IP's that no longer exist until restarted or reloaded with SIGHUP. These queries timeout causing nginx to stop proxying to the loki distributor service.

This is a known issue: kubernetes/kubernetes#26309

Solution: Add a dnsmasq sidecar to nginx that handles kube-dns pod restarts correctly and configure nginx to query the sidecar instead of kube-dns.

Loki Gateway

The Loki gateway is an optional component when using Loki in "microservices mode". It provides a convenient, single entry point to all of the Loki microservices: distributor, querier, query-frontend.

The nginx config looks like the following. (Not the complete config, only the relevant parts are shown here.)

http {
  resolver kube-dns.kube-system.svc.cluster.local;

  server {
    listen             8080;

    location = / {
      return 200 'OK';
      auth_basic off;
    }

    location = /api/prom/push {
      proxy_pass       http://loki-loki-distributed-distributor.monitoring.svc.cluster.local:3100$request_uri;
    }

    location = /api/prom/tail {
      proxy_pass       http://loki-loki-distributed-querier.monitoring.svc.cluster.local:3100$request_uri;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
    }

    location ~ /api/prom/.* {
      proxy_pass       http://loki-loki-distributed-query-frontend.monitoring.svc.cluster.local:3100$request_uri;
    }

    location = /loki/api/v1/push {
      proxy_pass       http://loki-loki-distributed-distributor.monitoring.svc.cluster.local:3100$request_uri;
    }

    location = /loki/api/v1/tail {
      proxy_pass       http://loki-loki-distributed-querier.monitoring.svc.cluster.local:3100$request_uri;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
    }

    location ~ /loki/api/.* {
      proxy_pass       http://loki-loki-distributed-query-frontend.monitoring.svc.cluster.local:3100$request_uri;
    }
  }
}

The problematic endpoint is the most active one used by the promtail instances to send logs to the distributor:

    location = /loki/api/v1/push {
      proxy_pass       http://loki-loki-distributed-distributor.monitoring.svc.cluster.local:3100$request_uri;
    }

The address loki-loki-distributed-distributor.monitoring.svc.cluster.local is resolved via kube-dns in the kube-system namespace:

  resolver kube-dns.kube-system.svc.cluster.local;

And resolves to:

;; ANSWER SECTION:
loki-loki-distributed-distributor.monitoring.svc.cluster.local. 5 IN A 172.20.245.174

There is a 5 second TTL and nginx attempts to re-resolve this address every ~5 seconds.

Root Cause

The issue manifests after the coredns pods backing the kube-dns service are restarted. Restarts of these pods are generally rare and usually occur during node fail/restart (auto-repair), node pool scale up/down, and node upgrades which result in drain/replace.

The problem occurs because of two behaviors, one from nginx and one from kubernetes:

  1. Nginx uses the same source port for every DNS request.
  2. Kubernetes uses DNAT to route/re-write packets destined from a service's ClusterIP to one of the backing pod's IP, which results in an entry being created in the conntrack table.

Due to the frequency of DNS lookups, and the frequency of of the /loki/api/v1/push being hit by the promtail instances, the conntrack entries never expire.

A more detailed explanation is covered in this thread: kubernetes/kubernetes#26309 (comment)

Re-creating

Re-creating the issue is straightforward:

  1. Start tailing logs of the loki gateway pod. You should see a steady stream of 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/" "-" in the logs indicating successful log submission from promtail pods.
  2. Restart the coredns pods in the kube-system namespace.

After a few seconds the stream of successful 204 responses in the gateway pod will be replaced by 499 errors.

The 499 indicates that the promtail clients have timed out and closed their connections. If the promtail timeout is increased or the nginx config's resolve_timeout is reduced from its default value of 30s to something lower such as 5s the errors will be 502s as now nginx will timeout waiting for a valid DNS response and closes the connection waiting for the DNS result.

The errors will persist until nginx is restarted or sent a SIGHUP causing it to reload its config and create a new UDP connection to its DNS resolver.

The issue can be tricky to observe outside of nginx. For example, if you were to start a new pod and repeatedly query kube-dns with dig

while true; do dig loki-loki-distributed-distributor.monitoring.svc.cluster.local ; sleep 0.5 ; done

… you would not observe any failures because each dig invocation would create a new UDP "connection" with a new source port and a new entry in the conntrack table to one of the currently active coredns pods.

Fixes

So, the issue comes down to nginx using the same UDP flow. Can we configure nginx in a way the causes it to use a new source on each query?

Several possible solutions were attempted:

1. Configure nginx to create its own DNS proxy

Nginx is able to proxy DNS and can use this proxy internally, ex:

Simple:

stream {
  server {
    listen 127.0.0.1:8053 udp;
    proxy_pass kube-dns.kube-system.svc.cluster.local:53;
  }
}

http {
  resolver 127.0.0.1:8053;
}

Using upstream and multiple entries:

stream {
  upstream dns {
    server kube-dns.kube-system.svc.cluster.local:53 max_fails=1 max_conns=1 fail_timeout=30s;
    server kube-dns.kube-system.svc.cluster.local:53 backup;
  }
  server {
    listen 127.0.0.1:8053 udp;
    proxy_pass dns;
  }
}

Result: Nope. Does not help.

Unfortunately nginx continues to maintain persistent, same source-port, connections to the upstream dns servers.

2. Override nginx DNS caching to reduce query frequency

Nginx respects the 5 second DNS TTLs and re-attempts lookups every 5 seconds.

DNAT conntrack entries will eventually expire. Reducing the frequency of lookups from nginx may provide enough time for the kernel to expire the conntrack entries.

nginx.conf:

  resolver kube-dns.kube-system.svc.cluster.local valid=120s;

Result: Nope. Does not help. Nginx will continue to hold the same source port. We can observe this with netstat in the nginx container:

$ netstat -anpW  | grep udp
udp        0      0 10.8.184.73:52376                                   10.8.118.202:53                                     ESTABLISHED 9/nginx: worker pro

In this example the 10.8.118.202 IP is the address of a former coredns pod after we have restarted coredns. These queries will never succeed. Nginx must be restarted or SIGHUP'd to become functional again.

3. livenessProbe to detect failure and auto-restart

A livenessProbe could detect the issue and cause the kubelet to restart nginx.

How to construct the probe? All Loki components expose a simple /ready endpoint when run in microservices mode. Since the distributor is the most frequently hit component we proxy to its /ready endpoint to detect failure.

Add a new location in nginx config:

location = /healthz {
  proxy_pass       http://loki-loki-distributed-distributor.monitoring.svc.cluster.local:3100/ready;
}

In the gateway Deployment:

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: "/healthz"
            port: http
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10

Result: This works. But feels incorrect and brittle. It would tie the availability of the entire Loki stack to the availability of a single component - the distributor. If the distributor fails then the gateway would end up in a restart loop making the components querier`` and query-frontend` unavailable.

Additionally, the loki-distributed helm chart does not expose a configurable livenessProbe in the gateway deployment template. They would probably accept a patch to allow setting a livenessProbe since there is already a configurable readinessProbe.

4. Add dnsmasq (proxy/forwarder) sidecar

Based on this suggestion: kubernetes/kubernetes#26309 (comment)

Add a sidecar to the gateway deployment and configure nginx to send queries through the sidecar. As long as the sidecar creates a new UDP connection/src-port on each query we should not run into the issue:

spec:
  containers:
    - name: dnsmasq
    image: "janeczku/go-dnsmasq:release-1.0.7"
    args:
      - --listen
      - "127.0.0.1:8053"
      - --hostsfile=/etc/hosts
      - --enable-search
      - --verbose

nginx.conf:

resolver 127.0.0.1:8053;

Result: SUCCESS! With this configuration the dnsmasq sidecar handles restarts of the coredns pods correctly.

Because the TTLs on the DNS records is only 5 seconds there is a brief period of time when DNS queries from dnsmasq to kube-dns will fail. We can work around this by forcing nginx to cache the results for a longer period. Because the Service IPs rarely change this is a safe trade-off to avoid temporarily blips:

resolver 127.0.0.1:8053 valid=60s;

The helm chart does not provide a way to inject a sidecar into the gateway deployment.

It's possible that the upstream helm maintainers would accept a patch that optionally adds the dnsmasq sidecar as well.

Next Steps

  1. Attempt to find a way to inject a dnsmasq sidecar. Possibly using kustomize on top of the rendered helm chart?
  2. Submit an issue to the grafana/helm-charts repo. Suggest a new config option for either (a) adding a pre-configured dnsmasq sidecar or (b) making a generic sidecar container option for any type of sidecar to be added.
@joemiller
Copy link
Author

@Pionerd I had forgotten all about this! I did some searching and perhaps I sent this PR and all was well after that: grafana/helm-charts#1272

@Pionerd
Copy link

Pionerd commented Feb 26, 2024

Thanks. Any chance you are using cilium or another kube-proxy replacement? I did a fix in cilium today, hoping that would resolve this issue as well.

@joemiller
Copy link
Author

Yes, using cilium.

@Pionerd
Copy link

Pionerd commented Feb 26, 2024

@Raboo
Copy link

Raboo commented Sep 12, 2024

@Pionerd

So setting .socketLB.hostNamespaceOnly: true fixes this issue?

@Pionerd
Copy link

Pionerd commented Sep 12, 2024

Yes, for us it did.

@Raboo
Copy link

Raboo commented Sep 12, 2024

Seems to have worked, thank you :)

@dvrkn
Copy link

dvrkn commented Nov 28, 2024

@joemiller great job!

i also was thinking about dirty hack with liveness checks and gateway restart and then found it.

@Pionerd thanks, tried it with cilium, will test and update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment