As Kyle brought up, Consul at the moment has a single known case of a potential inconsistency (Could be unknown cases lurking). Currently Consul works by electing a leader, who "leases" the position for LeaderLeaseTimeout interval. At each interval, it checks that a quorum of nodes still believes it to be the leader. At the same time, if a follower does not hear from the leader within randomInterva(HeartbeatTimeout, 2 * HeartbeatTimeout), it will start a new election.
There is a subtle case, where an existing leader is partitioned, HeartbeatTimeout is reached causing a new election and new leader. At this point there are 2 leaders. There is no split-brain as the old leader will be unable to commit any new writes. However, until the LeaderLeaseTimeout expires the old leader MAY process reads, which could be stale if the new leader has processed any writes.
As it stands, LeaderLeaseTimeout == HeartbeatTimeout. This means it is very difficult to trigger this bug, since you need a partition and some very unfortunate timing. In almost every case, the old leader has stepped down before an election is started, and even more likely before the new election is over.
Nonetheless, it is an issue that I'd like to address. I am proposing the following changes:
-
For the default reads, continue to rely on leader leasing. It has strong consistency but for this very rare caveat. For most clients, this is not an issue and is a sane default. It can be improved on however by the following:
-
Switch to monotonic clocks. This avoids clock skew from impacting timers. Makes it less likely that a follower will "jump the gun" with an early election.
-
Change defaults so that HeartbeatTimeout >> LeaderLeaseTimeout. This will make it very improbably that the heartbeat is exceeded before a leader steps down. Not impossible, just highly unlikely. It also makes it less likely that a slow IO causes a new round of elections.
-
-
Add a new "?consistent" flag. This flag requires that reads be consistent, no caveats. This is done by doing a heartbeat from the leader to a quorum of followers to check it is still the leader. This means the old leader will step down before processing the read, removing the possibility of a stale read. It does add a network round trip per read, and thus it will not be the default.
-
Add a new "?stale" flag. This flag enables the read to be serviced by any follower. This means values can be arbitrarily stale, with no bound. In the general case, this should be bounded by CommitTimeout which is 80msec. It enables horizontal read scalability. Since it requires a developer to reason about carefully, also not a default. Even more useful, this allows reads even while a quorum is unavailable. Currently no reads are possible without a leader.
I think this three prong approach makes some trade offs. The default is not completely consistent, but in the 99.999% case it is. In that extremely rare case, if it is totally unacceptable for a client (honestly they need to re-architect, or they will have a bad time) they can use the ?consistent flag. For the clients that care more about read scaling, they can use the ?stale flag as long as they understand the limitations
Unless you're very confident there are no other cases, I might phrase this as "Consul has a known case of inconsistency." It's a new system; I have a hunch you'll find other issues too. ;-)
To be honest, I'm not sure how likely it is that Consul can guarantee the absence of pauses on the order of one second. I've got JVM processes in production which routinely pause for 10 seconds, and up to minutes every few days. I've heard horror stories about Go's GC behavior as well, but it's one of those things that really depends on allocation pattern and workload. You may want to quantify this with stress tests. Might be fine, but timeouts in non-realtime languages tends to be one of those things that comes back to bite you. Ask the ElasticSearch team, haha. ;-)
That said, I think this is a good set of tradeoffs: most reads see up-to-date values, explicitly opt in to stale reads for lower latency and higher availability, and explicitly opt in for expensive reads guaranteeing linearizability. Sounds like a solid plan to me!