Created
November 14, 2014 22:25
-
-
Save armon/9fa0542d7123f4c10e12 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The initial observed cluster behavior: | |
1) Constant churn of nodes between Failed and Alive | |
2) Message bus saturated (~150 updates/sec) | |
3) Subset of cluster affected | |
4) Some nodes that are flapping don't exist! (Node dead, or agent down) | |
One immediate question is how the cluster remained in an unstable | |
state. We expect that the cluster should converge and return to | |
a quiet state after some time. However, there was a bug in the | |
low level SWIM implementation (memberlist library). | |
Bug fixed in https://github.com/hashicorp/memberlist/commit/63ef41a08f845463ae968b58ca4927666ccc1f4e | |
Memberlist typically makes use of an "Incarnation" or logical | |
clock value to track the causality of updates. This allows message reply | |
or old messages to be handled properly. Each known node has a corresponding | |
incarnation number. The aliveMsg and deadMsg code properly handles | |
checking for old incarnation numbers. The known nodes are stored in a | |
"nodeMap" and a "nodes" list. The map provides for O(1) lookups, while | |
the list is used for various iterators that require random iteration | |
with time-bounded promises. | |
The bug in deadMsg is that it removes a node from the "nodeMap" after | |
re-transmiting the deadMsg. The "nodes" list is cleared seperately | |
periodically. In the aliveMsg handler, we ONLY check the "nodeMap" for | |
a potentially known node, not the "node" list. As a result, we "forget" | |
the incarnation number of a known node. | |
This leads to the possibility of the following interleaving: | |
1) {Alive, Foo, 1} => Node Foo marked alive with Incarnation 1 | |
2) {Dead, Foo, 1} => Node Foo marked dead, current incarnation | |
3) Replay {Alive, Foo, 1} => Node Foo marked alive with Incarnation 1 | |
4) ... | |
Basically this bug allows a node to transition from the dead -> alive | |
state improperly (the incarnation number is invalid and should be rejected). | |
If this bug has existed then, why was it just triggered? In most cases, | |
the event bus is not saturated, meaning given time one of dead or alive | |
is able to propogate faster, and the system converges. However, in a saturated | |
state, the messages may be sitting in queues. | |
This leads to the possibility that Node A queue looks like: | |
{Alive, Foo, 1} | |
While Node B queue looks like: | |
{Dead, Foo, 1} | |
Given a large enough cluster, the interleaving of the two messages | |
generates a new set of updates. E.g. any node receiving dead will begin | |
to publish that, and any node receiving alive will publish that. Given | |
that the size of the cluster is large enough and the message bus is saturated, | |
there will always* be 2 nodes publishing conflicting information. This | |
prevents the system from returning to a stable state. | |
This explains why the system was never able to leave the unstable | |
situation, but what triggered the conditions to begin with? The first | |
step is that the message bus must be saturated for this to happen. | |
This means we need a thundering herd of either A) Failures or B) Joins. | |
I don't have access to enough log data to determine what it was, | |
but a temporary blip in a switch could take down enough servers to cause A, | |
while a lot of nodes starting Consul at once could start B. Alternatively, | |
a CM system or orchestration tool could restart many agents at once. | |
While the bus was saturated, nodes needed to stop running / fail. This would | |
introduce both Dead messages and Alive messages for the corresponding nodes | |
circulating. Now you have all the ingredients to trigger the bug :) | |
The fix is relatively simple, all we do is NOT clear the node from the | |
"nodeMap". This allows aliveMsg to properly de-duplicate. This avoid | |
the broadcast storm and breaks the constant cycling. The nodeMap and node | |
list are eventually cleared, but that is done long enough later to break | |
the cycle. | |
To fix the running cluster is a bit more challenging. Unfortunately, | |
due to the nature of the bug, if there are conflicting messages queued | |
on any node, they can restart the broadcast storm. The best approach is to | |
simple stop and start the agents, allowing the queues to reset. Of course, | |
to prevent the thundering herd again, it's best to limit the number of nodes | |
starting to < 50/sec. | |
Releases after 0.4.1 should fix this issue, and after a period of rapid | |
churn should return to a stable / quiet state. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment