Skip to content

Instantly share code, notes, and snippets.

Created November 14, 2014 22:25
Show Gist options
  • Save armon/9fa0542d7123f4c10e12 to your computer and use it in GitHub Desktop.
Save armon/9fa0542d7123f4c10e12 to your computer and use it in GitHub Desktop.
The initial observed cluster behavior:
1) Constant churn of nodes between Failed and Alive
2) Message bus saturated (~150 updates/sec)
3) Subset of cluster affected
4) Some nodes that are flapping don't exist! (Node dead, or agent down)
One immediate question is how the cluster remained in an unstable
state. We expect that the cluster should converge and return to
a quiet state after some time. However, there was a bug in the
low level SWIM implementation (memberlist library).
Bug fixed in
Memberlist typically makes use of an "Incarnation" or logical
clock value to track the causality of updates. This allows message reply
or old messages to be handled properly. Each known node has a corresponding
incarnation number. The aliveMsg and deadMsg code properly handles
checking for old incarnation numbers. The known nodes are stored in a
"nodeMap" and a "nodes" list. The map provides for O(1) lookups, while
the list is used for various iterators that require random iteration
with time-bounded promises.
The bug in deadMsg is that it removes a node from the "nodeMap" after
re-transmiting the deadMsg. The "nodes" list is cleared seperately
periodically. In the aliveMsg handler, we ONLY check the "nodeMap" for
a potentially known node, not the "node" list. As a result, we "forget"
the incarnation number of a known node.
This leads to the possibility of the following interleaving:
1) {Alive, Foo, 1} => Node Foo marked alive with Incarnation 1
2) {Dead, Foo, 1} => Node Foo marked dead, current incarnation
3) Replay {Alive, Foo, 1} => Node Foo marked alive with Incarnation 1
4) ...
Basically this bug allows a node to transition from the dead -> alive
state improperly (the incarnation number is invalid and should be rejected).
If this bug has existed then, why was it just triggered? In most cases,
the event bus is not saturated, meaning given time one of dead or alive
is able to propogate faster, and the system converges. However, in a saturated
state, the messages may be sitting in queues.
This leads to the possibility that Node A queue looks like:
{Alive, Foo, 1}
While Node B queue looks like:
{Dead, Foo, 1}
Given a large enough cluster, the interleaving of the two messages
generates a new set of updates. E.g. any node receiving dead will begin
to publish that, and any node receiving alive will publish that. Given
that the size of the cluster is large enough and the message bus is saturated,
there will always* be 2 nodes publishing conflicting information. This
prevents the system from returning to a stable state.
This explains why the system was never able to leave the unstable
situation, but what triggered the conditions to begin with? The first
step is that the message bus must be saturated for this to happen.
This means we need a thundering herd of either A) Failures or B) Joins.
I don't have access to enough log data to determine what it was,
but a temporary blip in a switch could take down enough servers to cause A,
while a lot of nodes starting Consul at once could start B. Alternatively,
a CM system or orchestration tool could restart many agents at once.
While the bus was saturated, nodes needed to stop running / fail. This would
introduce both Dead messages and Alive messages for the corresponding nodes
circulating. Now you have all the ingredients to trigger the bug :)
The fix is relatively simple, all we do is NOT clear the node from the
"nodeMap". This allows aliveMsg to properly de-duplicate. This avoid
the broadcast storm and breaks the constant cycling. The nodeMap and node
list are eventually cleared, but that is done long enough later to break
the cycle.
To fix the running cluster is a bit more challenging. Unfortunately,
due to the nature of the bug, if there are conflicting messages queued
on any node, they can restart the broadcast storm. The best approach is to
simple stop and start the agents, allowing the queues to reset. Of course,
to prevent the thundering herd again, it's best to limit the number of nodes
starting to < 50/sec.
Releases after 0.4.1 should fix this issue, and after a period of rapid
churn should return to a stable / quiet state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment