Skip to content

Instantly share code, notes, and snippets.

@fulmicoton
Last active September 29, 2022 00:44
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fulmicoton/73395d8e45cf66ecc4a6e2977d435993 to your computer and use it in GitHub Desktop.
Save fulmicoton/73395d8e45cf66ecc4a6e2977d435993 to your computer and use it in GitHub Desktop.
Leader election on top of scuttlebutt.
Leader election on top of Scuttlebutt.
-----------------------------------------------
The following makes it possible to cheaply get a leader election
on top of a Cassandra-like membership using scuttlebutt + phi accrual detection.
# TDLR about Scuttlebutt and phi accrual detection
Scuttlebutt is a anti-entropy gossip algorithm.
It makes it possible for every node in a cluster to
exposes a KV state and the algorithm ensures that eventually
all nodes will know about all of the other nodes states.
One way to get membership and failure detection on top of
scuttlebutt is to have nodes emit a periodically changing
heartbeat as part of their KV state.
Each node then decides locally if they consider another node
to be lively or not using phi-accrual error detection.
The idea is to look at the stream of event demonstrating the peer
node liveliness. For instance, receiving a message from the node, or
receiving an update about the node state (usually a change in its heartbeat).
If no events is received for more than a given threshold T, the node
is considered not lively.
Phi-accrual detection is all about dynamically defining what this threshold
should be based on the historical timestamps.
Such a scheme is used in Cassandra for instance.
# Leader election, assumptions an problem settings
The idea here is all about how to tweak a scuttlebutt + phi accrual detection detection
to make it leader election possible as well.
Assumptions:
- Computer have access to a monotonic clock, which frequency never varies by more than 500ppm.
This value is actually part of the POSIX standard.
- Non-Byzantine setting
- The number of nodes is supposed to be known by everyone.
Non-assumptions:
- Asynchronous setting: message can be lost or arrive with an arbitrary delay
- Node do not need to be fully connected.
The algorithm will eventually elect a leader, and the we ensure that at all time
there are stricly 0 or 1 leader active on the cluster.
# A weird clock
Each node keeps a local view of the cluster members and have
a strategy to keep track of who their favorite leader is.
This algorithm will explained in the next section.
Again the notion of favorite-leader is local to node.
Instead of an autoincrement counter, node expose, as their heartbeat, the following triplet.
- favorite-leader # their favorite-leader
- elapsed-since-start # local monotonic timestamp since node startup time
- favorite-leader-clock-lower-bound # A lower bound of their favorite leader clock
The last point is tricky.
Each node maintains its own monotonic clock C(N), but also maintains
a biased clock C_N(N') for all of the other node N'.
The clock C_N(N') are meant to be lagging estimates of the monotonic clock of N', C(N').
In other words, we want to build C_N(N') such that it maintains the following invariant:
At all time, C_N(N') < C_N'(N').
We do that as follows
At the first time we receive the heartbeat of N', we extract its elapsed-since-start-value
and we set.
`C_N(N') := received-since-start-value`
**important aparte here
Note that this can be quite lagging.
If it was sent directly by N' then it will be lagging due to the network latency.
If it was gossiped indirectly, then it is lagging by several few seconds.
In between two interaction, we update this clock using our monotonic clock, but we
update it slower with a bias of 1000ppm. This is to take in account of the variation
of our monotonous clock AND the variation of N' monotonous clock.
Upon a new interaction, we just take the best out of the two estimate:
`C_N(N') := max(C_N(N'), received-since-start-value)`.
favorite-leader-clock-lower-bound is precisely C_N(favorite_leader).
# A favorite leader
Let's now describe how a node decide who its favorite leader should be.
Each node is associated with an immutable uuid. Upon initialization, or
when it is time to pick a new favorite-leader (the previous one having been locally detected
as dead), we just use a bully-like strategy and simply pick the
node with the highest uuid that seems still alive.
If a node has a favorite leader who is fine and alive, then it sticks to it
until their favorite leader declares itself as
When a node detects the current leader as faulty, it does not update their favorite leader
right away. It only schedule doing so after a period P.
During that period, it stops updating `elapsed-since-start`.
# Am I the leader?
With just the heartbeat of the other nodes, a node can detect whether it is the leader or
not.
Assuming we are now node N.
We can look at all of the node N' declaring ourselves (N) has their favorite leader.
We can then read the last `C_N'(N)`. That value is lagging, but we know that it is a lower bound
of `C_N(N)` at the instant when they last declared us as favorite leader.
We assume there is a period P that is preset for all nodes.
If we have a majority of nodes N' for which `[C_N(N) - C_N'(N)]*(1 + 500/1000000) < P`, then N can safely identify
itself leader.
These conditions are sufficient to ensure that there are never two leaders in the cluster.
Why?
If there are two leaders L1, and L2, it means that they both have a view of the cluster that
observe the following conditions.
Due to the majority condition that means there exists a node N for which
the view of the state of N in L1 and in L2 looks like this.
- L1 is the favorite-leader and version L1:V1 with V1 emitted less than P seconds ago
- L2 is the favorite-leader and version L2:V2 with V2 emitted less than P seconds ago.
In other word, within `]now - P, now]` node N declared L1 and L2 as favorite-leader.
This is impossible, since by design a node cannot change favorite-leader within P.
** It is possible to greatly improve the performance of this solution
by twisting scuttlebutt a bit and let a node N temper with the heartbeat of a node N'
by replacing the `elapsed-since-start` with C_N(N'). It should help improving the overall accuracy of
C(N, N') for all (N, N').
# Choice of P
The choice of P is critical.
a) After a leader truely fails, by design, this algorithm will leave the cluster leaderless for at least P.
b) On the other hand if P is too short, the cluster will remain leaderless forever.
How short can P be? The question is strongly related to the threshold after which the phi-accrual detection
considers a node dead. If a node switches from their favorite leader, how long does it take for another
node to know about their recent switch?
For a cluster of 100 nodes, with a gossip interval of 1sec, we probably end up with P~=15s.
A failure leader, would leave the cluster leaderless for 30s.
A simple way to bring that number down is to reduce the number of nodes taking part
into this leader election.
Such nodes would then gossip in priority with other leader election nodes,
and do this with a higher frequency. The majority would be considerably lower.
================================================================
Cons:
---------------------
- P is probably relatively large, leaving the cluster leaderless for several seconds.
- a practical `P` probably becomes large as the cluster gets larger.
Some of the cons can be solved by introducing first-class citizen nodes and second-class
citizen nodes. Only first-class citizen nodes take part in the quorum used to elect a leader.
The leader is in charge of deciding who becomes a first class citizen and who does not.
Pros:
--------------------
- easy to code
- very little configuration
- is robust to network link failure without any refinement.
- generally speaking it is not up to a node to decide when to trigger an election.
the election is in a sense entirely continuous.
@Horusiath
Copy link

Horusiath commented Dec 30, 2021

It looks good - a base (phi-accrual detector + raw bully algorithm) are basis of Akka cluster as well as Cassandra. Nonetheless you should verify it via eg. TLA+. My doubt would be a risk of double leader - if two leader candidates are disconnected from each other but can see other nodes, they could both gather majority timestamps needed to assume themselves as leaders. It's quite unlikely but seems possible. In Raft such scenario is prevented by votes cast by each node in each term.

@fulmicoton
Copy link
Author

If two leader candidates are disconnected from each other but can see other nodes, they could both gather majority timestamps needed to assume themselves as leaders.

I think this is strictly impossible (provided all the hypothesis above). (monotonic clock with a frequency divergence < 500ppm).
But then one problem I have is that the algorithm requires all nodes to know the global number of nodes.... Or at least be given a threshold M they can consider a majority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment