Skip to content

Instantly share code, notes, and snippets.

@armon
Last active June 28, 2021 07:29
Show Gist options
  • Save armon/11059431 to your computer and use it in GitHub Desktop.
Save armon/11059431 to your computer and use it in GitHub Desktop.

Consul Consistency

As Kyle brought up, Consul at the moment has a single known case of a potential inconsistency (Could be unknown cases lurking). Currently Consul works by electing a leader, who "leases" the position for LeaderLeaseTimeout interval. At each interval, it checks that a quorum of nodes still believes it to be the leader. At the same time, if a follower does not hear from the leader within randomInterva(HeartbeatTimeout, 2 * HeartbeatTimeout), it will start a new election.

There is a subtle case, where an existing leader is partitioned, HeartbeatTimeout is reached causing a new election and new leader. At this point there are 2 leaders. There is no split-brain as the old leader will be unable to commit any new writes. However, until the LeaderLeaseTimeout expires the old leader MAY process reads, which could be stale if the new leader has processed any writes.

As it stands, LeaderLeaseTimeout == HeartbeatTimeout. This means it is very difficult to trigger this bug, since you need a partition and some very unfortunate timing. In almost every case, the old leader has stepped down before an election is started, and even more likely before the new election is over.

Nonetheless, it is an issue that I'd like to address. I am proposing the following changes:

  • For the default reads, continue to rely on leader leasing. It has strong consistency but for this very rare caveat. For most clients, this is not an issue and is a sane default. It can be improved on however by the following:

    • Switch to monotonic clocks. This avoids clock skew from impacting timers. Makes it less likely that a follower will "jump the gun" with an early election.

    • Change defaults so that HeartbeatTimeout >> LeaderLeaseTimeout. This will make it very improbably that the heartbeat is exceeded before a leader steps down. Not impossible, just highly unlikely. It also makes it less likely that a slow IO causes a new round of elections.

  • Add a new "?consistent" flag. This flag requires that reads be consistent, no caveats. This is done by doing a heartbeat from the leader to a quorum of followers to check it is still the leader. This means the old leader will step down before processing the read, removing the possibility of a stale read. It does add a network round trip per read, and thus it will not be the default.

  • Add a new "?stale" flag. This flag enables the read to be serviced by any follower. This means values can be arbitrarily stale, with no bound. In the general case, this should be bounded by CommitTimeout which is 80msec. It enables horizontal read scalability. Since it requires a developer to reason about carefully, also not a default. Even more useful, this allows reads even while a quorum is unavailable. Currently no reads are possible without a leader.

I think this three prong approach makes some trade offs. The default is not completely consistent, but in the 99.999% case it is. In that extremely rare case, if it is totally unacceptable for a client (honestly they need to re-architect, or they will have a bad time) they can use the ?consistent flag. For the clients that care more about read scaling, they can use the ?stale flag as long as they understand the limitations

@armon
Copy link
Author

armon commented Apr 18, 2014

@aphyr Yeah I only meant that its the only known case. Of course there are unknown unknowns :) In terms of the pause, I agree. We try hard for low pause (one driver behind LMDB and off-heap storage). But not sure if the pause will affect this stale read, since the heartbeat will have expired once it unpauses causing a step down. But I will document this nonetheless.

@armon
Copy link
Author

armon commented Apr 18, 2014

@evanphx In my head ?stale indicates that any follower should just directly read from their local FSM to serve the request. This means the read will always be serviced, but if you are partitioned it will be more stale than otherwise.

@evanphx
Copy link

evanphx commented Apr 18, 2014

@armon I'd suggest, in that case, that you include an additional X- HTTP header that indicates the number of milliseconds since the leader was seen. That at least gives the client the ability to know the degree that the value is stale. You could also include another header that indicates whether or not the follower believes the cluster to be healthy.

@armon
Copy link
Author

armon commented Apr 18, 2014

@evanphx Great suggestion. That is a good way for clients to bound staleness.

@kellabyte
Copy link

Like the others mentioned, I like the ability for users to choose whether stale reads are acceptable to them.

About GC pauses, there are a lot of issues that can cause pausing. Even if you put things off heap, you're still susceptible to all kinds of other sources. The bigger the cluster, the more crazy stuff that happens. When your cluster has thousands of drives for example, disk related pauses are something that becomes much more frequent.

@aphyr
Copy link

aphyr commented Apr 19, 2014

since the heartbeat will have expired once it unpauses causing a step down.

Yeah, it's a race between the timeout thread and any threads servicing requests. No good way to avoid that, to my knowledge.

In my head ?stale indicates that any follower should just directly read from their local FSM to serve the request.

I like that idea; often "some value, any value, even old" is better than nothing in service discovery.

an additional X- HTTP header that indicates the number of milliseconds since the leader was seen.

Yeah, that sounds really useful.

@matthiasg
Copy link

Is Consul tested with Jepsen on a regular basis or was this a one-time test ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment