When we run a rolling restart on our orchestrator statefulset, the node that is the previous master will get stuck in a crash loop.
- The pod that gets stuck in a crash loop seems to be the node that used to be the master.
- When you delete the pod, it somehow gets out of the crash loop.
- Using pprof, Leo tracked the crash to within the martini web framework (used by orchestrator - Profile map, problem region), while writing the response.
- More specifically, the
X-Forwarded-For
header in the response (which is supposed to contain the IP addresses of each proxy the server goes through) seems to accumulate an insane number of IPs, which eventually OOMs the process. A snippet:
[martini] Started GET /api/discover/172.18.23.33/3306 for 172.18.23.33,
172.18.41.92, 172.18.41.92, 172.18.41.92, 172.18.41.92, 172.18.41.92,
...<snip>
- The way we think this could happen is via a feature that was introduced in orchestrator that, for healthy followers, transparently proxies incoming requests to the master instead.
- The way we think the above feature causes disaster is in how the proxy is set up. It looks from the martini log entry above that somehow the server is proxying queries to itself, i.e. it's doing
httputil.NewSingleHostReverseProxy(url)
where the url is its own IP address. When it then sends a request through that proxy to its own ip, it receives it yet again, appends its own IP again toX-Forwarded-For
, proxies again, and continues the loop of death. - The weird bit is that the master, when restarted, correctly decides that it's not the leader and yet somehow when it wants to get the URL to the correct master, somehow it gets itself.
- Reproduction: is it just rolling restarts or will restarting the master pod just once cause the same issue?
- Result: Just a regular old delete on the master will trigger the issue, though it won't always OOM. But it will always take up a disproportionately large amount of memory. I don't know what stops it eventually - perhaps a timeout.
- Is the
cannot determine raft state
message a red herring or related somehow?- Result: it's a red herring.
- Why does the broken leader work correctly on the next restart? How is the second restart better / different from the first (rolling) one that broke the leader? Is it the state of the other raft participants?
- Who is initiating the request that eventually snowballs into hell?
- Is there a race condition with
israftenabled / isleader / getleader / proxy a race condition? Should it
luri.lock()in that function body? Why is
orcraft.LeaderURI` stale or flat out wrong?- Ominous looking stuff here saying "You must not access unhealthy raft members, i.e. nodes that are isolated from the quorum". Unclear if this also involves e.g.
/raft-health
. Theoretically it should be smart enough to reject API calls if it's isolated. If the node is outside the quorum (broken, ...) it wouldn't even know who the leader is (it can do programmatic work based on configuration to ask who the leader is).
hereThat does leave a small window between network partition and whenever my readiness check picks up that the node is unhealthy, where some request could feasibly come in to be proxied. In that case, I would expect a 503 or something similar.
here[calculation of the leader URL] Is done independently on each node, once, upon startup, and then advertised to the followers any time the node turns to be the leader.
here- How does the RaftEnabled stuff work here? What does a bare
return
from the raftReverseProxy entail?
- Ominous looking stuff here saying "You must not access unhealthy raft members, i.e. nodes that are isolated from the quorum". Unclear if this also involves e.g.
- Is
ERROR near "ON"
a red herring or related?
- orchestrator-logs-full.log.gz: full logs from
stern -t -l blah=orchestrator
- orchestrator-logs-shortlines-hostnamescleaned.log.gz: chop the buggy lines at 250 characters, replace IPs with hostnames to make it easy to follow
- kube-events.txt: Event history of orchestrator-4 dying and coming back, and getting OOM killed