Skip to content

Instantly share code, notes, and snippets.

@makmanalp
Last active February 6, 2019 00:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save makmanalp/8bc764beda887079aff8d4bb1fec3780 to your computer and use it in GitHub Desktop.
Save makmanalp/8bc764beda887079aff8d4bb1fec3780 to your computer and use it in GitHub Desktop.
Orchestrator OOM investigation

Summary:

When we run a rolling restart on our orchestrator statefulset, the node that is the previous master will get stuck in a crash loop.

Findings so far:

  • The pod that gets stuck in a crash loop seems to be the node that used to be the master.
  • When you delete the pod, it somehow gets out of the crash loop.
  • Using pprof, Leo tracked the crash to within the martini web framework (used by orchestrator - Profile map, problem region), while writing the response.
  • More specifically, the X-Forwarded-For header in the response (which is supposed to contain the IP addresses of each proxy the server goes through) seems to accumulate an insane number of IPs, which eventually OOMs the process. A snippet:
[martini] Started GET /api/discover/172.18.23.33/3306 for 172.18.23.33,
172.18.41.92, 172.18.41.92, 172.18.41.92, 172.18.41.92, 172.18.41.92, 
...<snip>
  • The way we think this could happen is via a feature that was introduced in orchestrator that, for healthy followers, transparently proxies incoming requests to the master instead.
  • The way we think the above feature causes disaster is in how the proxy is set up. It looks from the martini log entry above that somehow the server is proxying queries to itself, i.e. it's doing httputil.NewSingleHostReverseProxy(url) where the url is its own IP address. When it then sends a request through that proxy to its own ip, it receives it yet again, appends its own IP again to X-Forwarded-For, proxies again, and continues the loop of death.
  • The weird bit is that the master, when restarted, correctly decides that it's not the leader and yet somehow when it wants to get the URL to the correct master, somehow it gets itself.

Threads of investigation:

  • Reproduction: is it just rolling restarts or will restarting the master pod just once cause the same issue?
    • Result: Just a regular old delete on the master will trigger the issue, though it won't always OOM. But it will always take up a disproportionately large amount of memory. I don't know what stops it eventually - perhaps a timeout.
  • Is the cannot determine raft state message a red herring or related somehow?
    • Result: it's a red herring.
  • Why does the broken leader work correctly on the next restart? How is the second restart better / different from the first (rolling) one that broke the leader? Is it the state of the other raft participants?
  • Who is initiating the request that eventually snowballs into hell?
  • Is there a race condition with israftenabled / isleader / getleader / proxy a race condition? Should it luri.lock()in that function body? Why isorcraft.LeaderURI` stale or flat out wrong?
    • Ominous looking stuff here saying "You must not access unhealthy raft members, i.e. nodes that are isolated from the quorum". Unclear if this also involves e.g. /raft-health. Theoretically it should be smart enough to reject API calls if it's isolated.
    • If the node is outside the quorum (broken, ...) it wouldn't even know who the leader is (it can do programmatic work based on configuration to ask who the leader is). here
    • That does leave a small window between network partition and whenever my readiness check picks up that the node is unhealthy, where some request could feasibly come in to be proxied. In that case, I would expect a 503 or something similar. here
    • [calculation of the leader URL] Is done independently on each node, once, upon startup, and then advertised to the followers any time the node turns to be the leader. here
    • How does the RaftEnabled stuff work here? What does a bare return from the raftReverseProxy entail?
  • Is ERROR near "ON" a red herring or related?

Files:

  • orchestrator-logs-full.log.gz: full logs from stern -t -l blah=orchestrator
  • orchestrator-logs-shortlines-hostnamescleaned.log.gz: chop the buggy lines at 250 characters, replace IPs with hostnames to make it easy to follow
  • kube-events.txt: Event history of orchestrator-4 dying and coming back, and getting OOM killed
$ kubectl get pods -l component=orchestrator -w
NAME READY STATUS RESTARTS AGE
orchestrator-0 1/1 Running 0 57m
orchestrator-1 1/1 Running 0 1d
orchestrator-2 1/1 Running 0 1h
orchestrator-3 1/1 Running 1 44m
orchestrator-4 1/1 Running 0 1h
orchestrator-4 1/1 Terminating 0 1h
orchestrator-4 0/1 Terminating 0 1h
orchestrator-4 0/1 Terminating 0 1h
orchestrator-4 0/1 Terminating 0 1h
orchestrator-4 0/1 Pending 0 0s
orchestrator-4 0/1 Pending 0 0s
orchestrator-4 0/1 ContainerCreating 0 0s
orchestrator-4 0/1 Running 0 8s
orchestrator-4 1/1 Running 0 16s
orchestrator-4 0/1 Running 1 1m
orchestrator-4 1/1 Running 1 1m
orchestrator-4 0/1 OOMKilled 1 2m
orchestrator-4 0/1 Running 2 2m
orchestrator-4 1/1 Running 2 2m
cat ./orchestrator-logs-full.log.gz | gunzip - | cut -c -250 | sed 's/172.19.137.134/orchestrator-0/g; s/172.19.204.83/orchestrator-1/g; s/172.19.234.191/orchestrator-2/g; s/172.19.236.152/orchestrator-3/g; s/172.19.184.166/orchestrator-4/g' | gzip - > orchestrator-logs-shortlines-hostnamescleaned.log.gz
View raw

(Sorry about that, but we can’t show files that are this big right now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment