armon/gist:9fa0542d7123f4c10e12

## gistfile1.txt
The initial observed cluster behavior:
1) Constant churn of nodes between Failed and Alive
2) Message bus saturated (~150 updates/sec)
3) Subset of cluster affected
4) Some nodes that are flapping don't exist! (Node dead, or agent down)

One immediate question is how the cluster remained in an unstable
state. We expect that the cluster should converge and return to
a quiet state after some time. However, there was a bug in the
low level SWIM implementation (memberlist library).

Bug fixed in https://github.com/hashicorp/memberlist/commit/63ef41a08f845463ae968b58ca4927666ccc1f4e

Memberlist typically makes use of an "Incarnation" or logical
clock value to track the causality of updates. This allows message reply
or old messages to be handled properly. Each known node has a corresponding
incarnation number. The aliveMsg and deadMsg code properly handles
checking for old incarnation numbers. The known nodes are stored in a
"nodeMap" and a "nodes" list. The map provides for O(1) lookups, while
the list is used for various iterators that require random iteration
with time-bounded promises.

The bug in deadMsg is that it removes a node from the "nodeMap" after
re-transmiting the deadMsg. The "nodes" list is cleared seperately
periodically. In the aliveMsg handler, we ONLY check the "nodeMap" for
a potentially known node, not the "node" list. As a result, we "forget"
the incarnation number of a known node.

This leads to the possibility of the following interleaving:

1) {Alive, Foo, 1} => Node Foo marked alive with Incarnation 1
2) {Dead, Foo, 1} => Node Foo marked dead, current incarnation
3) Replay {Alive, Foo, 1} => Node Foo marked alive with Incarnation 1
4) ...

Basically this bug allows a node to transition from the dead -> alive
state improperly (the incarnation number is invalid and should be rejected).

If this bug has existed then, why was it just triggered? In most cases,
the event bus is not saturated, meaning given time one of dead or alive
is able to propogate faster, and the system converges. However, in a saturated
state, the messages may be sitting in queues.

This leads to the possibility that Node A queue looks like:
{Alive, Foo, 1}

While Node B queue looks like:
{Dead, Foo, 1}

Given a large enough cluster, the interleaving of the two messages
generates a new set of updates. E.g. any node receiving dead will begin
to publish that, and any node receiving alive will publish that. Given
that the size of the cluster is large enough and the message bus is saturated,
there will always* be 2 nodes publishing conflicting information. This
prevents the system from returning to a stable state.

This explains why the system was never able to leave the unstable
situation, but what triggered the conditions to begin with? The first
step is that the message bus must be saturated for this to happen.
This means we need a thundering herd of either A) Failures or B) Joins.

I don't have access to enough log data to determine what it was,
but a temporary blip in a switch could take down enough servers to cause A,
while a lot of nodes starting Consul at once could start B. Alternatively,
a CM system or orchestration tool could restart many agents at once.

While the bus was saturated, nodes needed to stop running / fail. This would
introduce both Dead messages and Alive messages for the corresponding nodes
circulating. Now you have all the ingredients to trigger the bug :)

The fix is relatively simple, all we do is NOT clear the node from the
"nodeMap". This allows aliveMsg to properly de-duplicate. This avoid
the broadcast storm and breaks the constant cycling. The nodeMap and node
list are eventually cleared, but that is done long enough later to break
the cycle.

To fix the running cluster is a bit more challenging. Unfortunately,
due to the nature of the bug, if there are conflicting messages queued
on any node, they can restart the broadcast storm. The best approach is to
simple stop and start the agents, allowing the queues to reset. Of course,
to prevent the thundering herd again, it's best to limit the number of nodes
starting to < 50/sec.

Releases after 0.4.1 should fix this issue, and after a period of rapid
churn should return to a stable / quiet state.
	The initial observed cluster behavior:
	1) Constant churn of nodes between Failed and Alive
	2) Message bus saturated (~150 updates/sec)
	3) Subset of cluster affected
	4) Some nodes that are flapping don't exist! (Node dead, or agent down)

	One immediate question is how the cluster remained in an unstable
	state. We expect that the cluster should converge and return to
	a quiet state after some time. However, there was a bug in the
	low level SWIM implementation (memberlist library).

	Bug fixed in https://github.com/hashicorp/memberlist/commit/63ef41a08f845463ae968b58ca4927666ccc1f4e

	Memberlist typically makes use of an "Incarnation" or logical
	clock value to track the causality of updates. This allows message reply
	or old messages to be handled properly. Each known node has a corresponding
	incarnation number. The aliveMsg and deadMsg code properly handles
	checking for old incarnation numbers. The known nodes are stored in a
	"nodeMap" and a "nodes" list. The map provides for O(1) lookups, while
	the list is used for various iterators that require random iteration
	with time-bounded promises.

	The bug in deadMsg is that it removes a node from the "nodeMap" after
	re-transmiting the deadMsg. The "nodes" list is cleared seperately
	periodically. In the aliveMsg handler, we ONLY check the "nodeMap" for
	a potentially known node, not the "node" list. As a result, we "forget"
	the incarnation number of a known node.

	This leads to the possibility of the following interleaving:

	1) {Alive, Foo, 1} => Node Foo marked alive with Incarnation 1
	2) {Dead, Foo, 1} => Node Foo marked dead, current incarnation
	3) Replay {Alive, Foo, 1} => Node Foo marked alive with Incarnation 1
	4) ...

	Basically this bug allows a node to transition from the dead -> alive
	state improperly (the incarnation number is invalid and should be rejected).

	If this bug has existed then, why was it just triggered? In most cases,
	the event bus is not saturated, meaning given time one of dead or alive
	is able to propogate faster, and the system converges. However, in a saturated
	state, the messages may be sitting in queues.

	This leads to the possibility that Node A queue looks like:
	{Alive, Foo, 1}

	While Node B queue looks like:
	{Dead, Foo, 1}

	Given a large enough cluster, the interleaving of the two messages
	generates a new set of updates. E.g. any node receiving dead will begin
	to publish that, and any node receiving alive will publish that. Given
	that the size of the cluster is large enough and the message bus is saturated,
	there will always* be 2 nodes publishing conflicting information. This
	prevents the system from returning to a stable state.

	This explains why the system was never able to leave the unstable
	situation, but what triggered the conditions to begin with? The first
	step is that the message bus must be saturated for this to happen.
	This means we need a thundering herd of either A) Failures or B) Joins.

	I don't have access to enough log data to determine what it was,
	but a temporary blip in a switch could take down enough servers to cause A,
	while a lot of nodes starting Consul at once could start B. Alternatively,
	a CM system or orchestration tool could restart many agents at once.

	While the bus was saturated, nodes needed to stop running / fail. This would
	introduce both Dead messages and Alive messages for the corresponding nodes
	circulating. Now you have all the ingredients to trigger the bug :)

	The fix is relatively simple, all we do is NOT clear the node from the
	"nodeMap". This allows aliveMsg to properly de-duplicate. This avoid
	the broadcast storm and breaks the constant cycling. The nodeMap and node
	list are eventually cleared, but that is done long enough later to break
	the cycle.

	To fix the running cluster is a bit more challenging. Unfortunately,
	due to the nature of the bug, if there are conflicting messages queued
	on any node, they can restart the broadcast storm. The best approach is to
	simple stop and start the agents, allowing the queues to reset. Of course,
	to prevent the thundering herd again, it's best to limit the number of nodes
	starting to < 50/sec.

	Releases after 0.4.1 should fix this issue, and after a period of rapid
	churn should return to a stable / quiet state.