fulmicoton/leader_election_scuttlebutt.txt

## leader_election_scuttlebutt.txt
Leader election on top of Scuttlebutt.
-----------------------------------------------

The following makes it possible to cheaply get a leader election
on top of a Cassandra-like membership using scuttlebutt + phi accrual detection.

# TDLR about Scuttlebutt and phi accrual detection

Scuttlebutt is a anti-entropy gossip algorithm.
It makes it possible for every node in a cluster to
exposes a KV state and the algorithm ensures that eventually
all nodes will know about all of the other nodes states.

One way to get membership and failure detection on top of
scuttlebutt is to have nodes emit a periodically changing
heartbeat as part of their KV state.

Each node then decides locally if they consider another node
to be lively or not using phi-accrual error detection.
The idea is to look at the stream of event demonstrating the peer
node liveliness. For instance, receiving a message from the node, or
receiving an update about the node state (usually a change in its heartbeat).
If no events is received for more than a given threshold T, the node
is considered not lively.
Phi-accrual detection is all about dynamically defining what this threshold
should be based on the historical timestamps.

Such a scheme is used in Cassandra for instance.

# Leader election, assumptions an problem settings

The idea here is all about how to tweak a scuttlebutt + phi accrual detection detection
to make it leader election possible as well.

Assumptions:
- Computer have access to a monotonic clock, which frequency never varies by more than 500ppm.
This value is actually part of the POSIX standard.
- Non-Byzantine setting
- The number of nodes is supposed to be known by everyone.

Non-assumptions:
- Asynchronous setting: message can be lost or arrive with an arbitrary delay
- Node do not need to be fully connected.

The algorithm will eventually elect a leader, and the we ensure that at all time
there are stricly 0 or 1 leader active on the cluster.


# A weird clock

Each node keeps a local view of the cluster members and have
a strategy to keep track of who their favorite leader is.
This algorithm will explained in the next section.

Again the notion of favorite-leader is local to node.


Instead of an autoincrement counter, node expose, as their heartbeat, the following triplet.
- favorite-leader # their favorite-leader
- elapsed-since-start # local monotonic timestamp since node startup time
- favorite-leader-clock-lower-bound # A lower bound of their favorite leader clock

The last point is tricky.
Each node maintains its own monotonic clock C(N), but also maintains
a biased clock C_N(N') for all of the other node N'.

The clock C_N(N') are meant to be lagging estimates of the monotonic clock of N', C(N').
In other words, we want to build C_N(N') such that it maintains the following invariant:
At all time, C_N(N') < C_N'(N').

We do that as follows
At the first time we receive the heartbeat of N', we extract its elapsed-since-start-value
and we set.
`C_N(N') := received-since-start-value`

**important aparte here

Note that this can be quite lagging.
If it was sent directly by N' then it will be lagging due to the network latency.
If it was gossiped indirectly, then it is lagging by several few seconds.

In between two interaction, we update this clock using our monotonic clock, but we
update it slower with a bias of 1000ppm. This is to take in account of the variation
of our monotonous clock AND the variation of N' monotonous clock.

Upon a new interaction, we just take the best out of the two estimate:
`C_N(N') := max(C_N(N'), received-since-start-value)`.

favorite-leader-clock-lower-bound is precisely C_N(favorite_leader).

# A favorite leader

Let's now describe how a node decide who its favorite leader should be.

Each node is associated with an immutable uuid. Upon initialization, or
when it is time to pick a new favorite-leader (the previous one having been locally detected
as dead), we just use a bully-like strategy and simply pick the
node with the highest uuid that seems still alive.

If a node has a favorite leader who is fine and alive, then it sticks to it
until their favorite leader declares itself as

When a node detects the current leader as faulty, it does not update their favorite leader
right away. It only schedule doing so after a period P.
During that period, it stops updating `elapsed-since-start`.

# Am I the leader?

With just the heartbeat of the other nodes, a node can detect whether it is the leader or
not.

Assuming we are now node N.
We can look at all of the node N' declaring ourselves (N) has their favorite leader.
We can then read the last `C_N'(N)`. That value is lagging, but we know that it is a lower bound
of `C_N(N)` at the instant when they last declared us as favorite leader.

We assume there is a period P that is preset for all nodes.

If we have a majority of nodes N' for which `[C_N(N) - C_N'(N)]*(1 + 500/1000000) < P`, then N can safely identify
itself leader.

These conditions are sufficient to ensure that there are never two leaders in the cluster.
Why?

If there are two leaders L1, and L2, it means that they both have a view of the cluster that
observe the following conditions.

Due to the majority condition that means there exists a node N for which
the view of the state of N in L1 and in L2 looks like this.

- L1 is the favorite-leader and version L1:V1 with V1 emitted less than P seconds ago
- L2 is the favorite-leader and version L2:V2 with V2 emitted less than P seconds ago.

In other word, within `]now - P, now]` node N declared L1 and L2 as favorite-leader.
This is impossible, since by design a node cannot change favorite-leader within P.

** It is possible to greatly improve the performance of this solution
by twisting scuttlebutt a bit and let a node N temper with the heartbeat of a node N'
by replacing the `elapsed-since-start` with C_N(N'). It should help improving the overall accuracy of
C(N, N') for all (N, N').


# Choice of P

The choice of P is critical.
a) After a leader truely fails, by design, this algorithm will leave the cluster leaderless for at least P.
b) On the other hand if P is too short, the cluster will remain leaderless forever.

How short can P be? The question is strongly related to the threshold after which the phi-accrual detection
considers a node dead. If a node switches from their favorite leader, how long does it take for another
node to know about their recent switch?

For a cluster of 100 nodes, with a gossip interval of 1sec, we probably end up with P~=15s.
A failure leader, would leave the cluster leaderless for 30s.

A simple way to bring that number down is to reduce the number of nodes taking part
into this leader election.

Such nodes would then gossip in priority with other leader election nodes,
and do this with a higher frequency. The majority would be considerably lower.

================================================================

Cons:
---------------------
- P is probably relatively large, leaving the cluster leaderless for several seconds.
- a practical `P` probably becomes large as the cluster gets larger.

Some of the cons can be solved by introducing first-class citizen nodes and second-class
citizen nodes. Only first-class citizen nodes take part in the quorum used to elect a leader.
The leader is in charge of deciding who becomes a first class citizen and who does not.


Pros:
--------------------
- easy to code
- very little configuration
- is robust to network link failure without any refinement.
- generally speaking it is not up to a node to decide when to trigger an election.
the election is in a sense entirely continuous.
	Leader election on top of Scuttlebutt.
	-----------------------------------------------

	The following makes it possible to cheaply get a leader election
	on top of a Cassandra-like membership using scuttlebutt + phi accrual detection.

	# TDLR about Scuttlebutt and phi accrual detection

	Scuttlebutt is a anti-entropy gossip algorithm.
	It makes it possible for every node in a cluster to
	exposes a KV state and the algorithm ensures that eventually
	all nodes will know about all of the other nodes states.

	One way to get membership and failure detection on top of
	scuttlebutt is to have nodes emit a periodically changing
	heartbeat as part of their KV state.

	Each node then decides locally if they consider another node
	to be lively or not using phi-accrual error detection.
	The idea is to look at the stream of event demonstrating the peer
	node liveliness. For instance, receiving a message from the node, or
	receiving an update about the node state (usually a change in its heartbeat).
	If no events is received for more than a given threshold T, the node
	is considered not lively.
	Phi-accrual detection is all about dynamically defining what this threshold
	should be based on the historical timestamps.

	Such a scheme is used in Cassandra for instance.

	# Leader election, assumptions an problem settings

	The idea here is all about how to tweak a scuttlebutt + phi accrual detection detection
	to make it leader election possible as well.

	Assumptions:
	- Computer have access to a monotonic clock, which frequency never varies by more than 500ppm.
	This value is actually part of the POSIX standard.
	- Non-Byzantine setting
	- The number of nodes is supposed to be known by everyone.

	Non-assumptions:
	- Asynchronous setting: message can be lost or arrive with an arbitrary delay
	- Node do not need to be fully connected.

	The algorithm will eventually elect a leader, and the we ensure that at all time
	there are stricly 0 or 1 leader active on the cluster.


	# A weird clock

	Each node keeps a local view of the cluster members and have
	a strategy to keep track of who their favorite leader is.
	This algorithm will explained in the next section.

	Again the notion of favorite-leader is local to node.


	Instead of an autoincrement counter, node expose, as their heartbeat, the following triplet.
	- favorite-leader # their favorite-leader
	- elapsed-since-start # local monotonic timestamp since node startup time
	- favorite-leader-clock-lower-bound # A lower bound of their favorite leader clock

	The last point is tricky.
	Each node maintains its own monotonic clock C(N), but also maintains
	a biased clock C_N(N') for all of the other node N'.

	The clock C_N(N') are meant to be lagging estimates of the monotonic clock of N', C(N').
	In other words, we want to build C_N(N') such that it maintains the following invariant:
	At all time, C_N(N') < C_N'(N').

	We do that as follows
	At the first time we receive the heartbeat of N', we extract its elapsed-since-start-value
	and we set.
	`C_N(N') := received-since-start-value`

	**important aparte here

	Note that this can be quite lagging.
	If it was sent directly by N' then it will be lagging due to the network latency.
	If it was gossiped indirectly, then it is lagging by several few seconds.

	In between two interaction, we update this clock using our monotonic clock, but we
	update it slower with a bias of 1000ppm. This is to take in account of the variation
	of our monotonous clock AND the variation of N' monotonous clock.

	Upon a new interaction, we just take the best out of the two estimate:
	`C_N(N') := max(C_N(N'), received-since-start-value)`.

	favorite-leader-clock-lower-bound is precisely C_N(favorite_leader).

	# A favorite leader

	Let's now describe how a node decide who its favorite leader should be.

	Each node is associated with an immutable uuid. Upon initialization, or
	when it is time to pick a new favorite-leader (the previous one having been locally detected
	as dead), we just use a bully-like strategy and simply pick the
	node with the highest uuid that seems still alive.

	If a node has a favorite leader who is fine and alive, then it sticks to it
	until their favorite leader declares itself as

	When a node detects the current leader as faulty, it does not update their favorite leader
	right away. It only schedule doing so after a period P.
	During that period, it stops updating `elapsed-since-start`.

	# Am I the leader?

	With just the heartbeat of the other nodes, a node can detect whether it is the leader or
	not.

	Assuming we are now node N.
	We can look at all of the node N' declaring ourselves (N) has their favorite leader.
	We can then read the last `C_N'(N)`. That value is lagging, but we know that it is a lower bound
	of `C_N(N)` at the instant when they last declared us as favorite leader.

	We assume there is a period P that is preset for all nodes.

	If we have a majority of nodes N' for which `[C_N(N) - C_N'(N)]*(1 + 500/1000000) < P`, then N can safely identify
	itself leader.

	These conditions are sufficient to ensure that there are never two leaders in the cluster.
	Why?

	If there are two leaders L1, and L2, it means that they both have a view of the cluster that
	observe the following conditions.

	Due to the majority condition that means there exists a node N for which
	the view of the state of N in L1 and in L2 looks like this.

	- L1 is the favorite-leader and version L1:V1 with V1 emitted less than P seconds ago
	- L2 is the favorite-leader and version L2:V2 with V2 emitted less than P seconds ago.

	In other word, within `]now - P, now]` node N declared L1 and L2 as favorite-leader.
	This is impossible, since by design a node cannot change favorite-leader within P.

	** It is possible to greatly improve the performance of this solution
	by twisting scuttlebutt a bit and let a node N temper with the heartbeat of a node N'
	by replacing the `elapsed-since-start` with C_N(N'). It should help improving the overall accuracy of
	C(N, N') for all (N, N').


	# Choice of P

	The choice of P is critical.
	a) After a leader truely fails, by design, this algorithm will leave the cluster leaderless for at least P.
	b) On the other hand if P is too short, the cluster will remain leaderless forever.

	How short can P be? The question is strongly related to the threshold after which the phi-accrual detection
	considers a node dead. If a node switches from their favorite leader, how long does it take for another
	node to know about their recent switch?

	For a cluster of 100 nodes, with a gossip interval of 1sec, we probably end up with P~=15s.
	A failure leader, would leave the cluster leaderless for 30s.

	A simple way to bring that number down is to reduce the number of nodes taking part
	into this leader election.

	Such nodes would then gossip in priority with other leader election nodes,
	and do this with a higher frequency. The majority would be considerably lower.

	================================================================

	Cons:
	---------------------
	- P is probably relatively large, leaving the cluster leaderless for several seconds.
	- a practical `P` probably becomes large as the cluster gets larger.

	Some of the cons can be solved by introducing first-class citizen nodes and second-class
	citizen nodes. Only first-class citizen nodes take part in the quorum used to elect a leader.
	The leader is in charge of deciding who becomes a first class citizen and who does not.


	Pros:
	--------------------
	- easy to code
	- very little configuration
	- is robust to network link failure without any refinement.
	- generally speaking it is not up to a node to decide when to trigger an election.
	the election is in a sense entirely continuous.