As Kyle brought up, Consul at the moment has a single known case of a potential inconsistency (Could be unknown cases lurking). Currently Consul works by electing a leader, who "leases" the position for LeaderLeaseTimeout interval. At each interval, it checks that a quorum of nodes still believes it to be the leader. At the same time, if a follower does not hear from the leader within randomInterva(HeartbeatTimeout, 2 * HeartbeatTimeout), it will start a new election.
There is a subtle case, where an existing leader is partitioned, HeartbeatTimeout is reached causing a new election and new leader. At this point there are 2 leaders. There is no split-brain as the old leader will be unable to commit any new writes. However, until the LeaderLeaseTimeout expires the old leader MAY process reads, which could be stale if the new leader has processed any writes.
As it stands, LeaderLeaseTimeout == HeartbeatTimeout. This means it is very difficult to trigger this bug, since you need a partition and some very unfortunate timing. In almost every case, the old leader has stepped down before an election is started, and even more likely before the new election is over.
Nonetheless, it is an issue that I'd like to address. I am proposing the following changes:
-
For the default reads, continue to rely on leader leasing. It has strong consistency but for this very rare caveat. For most clients, this is not an issue and is a sane default. It can be improved on however by the following:
-
Switch to monotonic clocks. This avoids clock skew from impacting timers. Makes it less likely that a follower will "jump the gun" with an early election.
-
Change defaults so that HeartbeatTimeout >> LeaderLeaseTimeout. This will make it very improbably that the heartbeat is exceeded before a leader steps down. Not impossible, just highly unlikely. It also makes it less likely that a slow IO causes a new round of elections.
-
-
Add a new "?consistent" flag. This flag requires that reads be consistent, no caveats. This is done by doing a heartbeat from the leader to a quorum of followers to check it is still the leader. This means the old leader will step down before processing the read, removing the possibility of a stale read. It does add a network round trip per read, and thus it will not be the default.
-
Add a new "?stale" flag. This flag enables the read to be serviced by any follower. This means values can be arbitrarily stale, with no bound. In the general case, this should be bounded by CommitTimeout which is 80msec. It enables horizontal read scalability. Since it requires a developer to reason about carefully, also not a default. Even more useful, this allows reads even while a quorum is unavailable. Currently no reads are possible without a leader.
I think this three prong approach makes some trade offs. The default is not completely consistent, but in the 99.999% case it is. In that extremely rare case, if it is totally unacceptable for a client (honestly they need to re-architect, or they will have a bad time) they can use the ?consistent flag. For the clients that care more about read scaling, they can use the ?stale flag as long as they understand the limitations
@aphyr Yeah I only meant that its the only known case. Of course there are unknown unknowns :) In terms of the pause, I agree. We try hard for low pause (one driver behind LMDB and off-heap storage). But not sure if the pause will affect this stale read, since the heartbeat will have expired once it unpauses causing a step down. But I will document this nonetheless.