armon/consul.md

## consul.md

      
    Raw
  

              consul.md
            
          
    Consul Consistency

As Kyle brought up, Consul at the moment has a  single known case of a potential
inconsistency (Could be unknown cases lurking). Currently Consul works by electing a leader, who
"leases" the position for LeaderLeaseTimeout interval. At each interval,
it checks that a quorum of nodes still believes it to be the leader.
At the same time, if a follower does not hear from the leader within
randomInterva(HeartbeatTimeout, 2 * HeartbeatTimeout), it will start
a new election.
There is a subtle case, where an existing leader is partitioned,
HeartbeatTimeout is reached causing a new election and new leader.
At this point there are 2 leaders. There is no split-brain as the
old leader will be unable to commit any new writes. However, until
the LeaderLeaseTimeout expires the old leader MAY process reads,
which could be stale if the new leader has processed any writes.
As it stands, LeaderLeaseTimeout == HeartbeatTimeout. This means
it is very difficult to trigger this bug, since you need a partition
and some very unfortunate timing. In almost every case, the old
leader has stepped down before an election is started, and even
more likely before the new election is over.
Nonetheless, it is an issue that I'd like to address. I am proposing
the following changes:


For the default reads, continue to rely on leader leasing. It has
strong consistency but for this very rare caveat. For most clients,
this is not an issue and is a sane default. It can be improved on
however by the following:


Switch to monotonic clocks. This avoids clock skew from impacting
timers. Makes it less likely that a follower will "jump the gun"
with an early election.


Change defaults so that HeartbeatTimeout >> LeaderLeaseTimeout. This
will make it very improbably that the heartbeat is exceeded before
a leader steps down. Not impossible, just highly unlikely. It also
makes it less likely that a slow IO causes a new round of elections.


Add a new "?consistent" flag. This flag requires that reads be consistent,
no caveats. This is done by doing a heartbeat from the leader to a quorum
of followers to check it is still the leader. This means the old leader
will step down before processing the read, removing the possibility of
a stale read. It does add a network round trip per read, and thus it will not
be the default.


Add a new "?stale" flag. This flag enables the read to be serviced by
any follower. This means values can be arbitrarily stale, with no bound.
In the general case, this should be bounded by CommitTimeout which is
80msec. It enables horizontal read scalability. Since it requires a developer
to reason about carefully, also not a default. Even more useful, this allows
reads even while a quorum is unavailable. Currently no reads are possible
without a leader.


I think this three prong approach makes some trade offs. The default is
not completely consistent, but in the 99.999% case it is. In that extremely
rare case, if it is totally unacceptable for a client (honestly they need
to re-architect, or they will have a bad time) they can use the ?consistent
flag. For the clients that care more about read scaling, they can use the
?stale flag as long as they understand the limitations