Create a gist now

Instantly share code, notes, and snippets.

@armon /consul.md
Last active Jul 12, 2016

What would you like to do?

Consul Consistency

As Kyle brought up, Consul at the moment has a single known case of a potential inconsistency (Could be unknown cases lurking). Currently Consul works by electing a leader, who "leases" the position for LeaderLeaseTimeout interval. At each interval, it checks that a quorum of nodes still believes it to be the leader. At the same time, if a follower does not hear from the leader within randomInterva(HeartbeatTimeout, 2 * HeartbeatTimeout), it will start a new election.

There is a subtle case, where an existing leader is partitioned, HeartbeatTimeout is reached causing a new election and new leader. At this point there are 2 leaders. There is no split-brain as the old leader will be unable to commit any new writes. However, until the LeaderLeaseTimeout expires the old leader MAY process reads, which could be stale if the new leader has processed any writes.

As it stands, LeaderLeaseTimeout == HeartbeatTimeout. This means it is very difficult to trigger this bug, since you need a partition and some very unfortunate timing. In almost every case, the old leader has stepped down before an election is started, and even more likely before the new election is over.

Nonetheless, it is an issue that I'd like to address. I am proposing the following changes:

  • For the default reads, continue to rely on leader leasing. It has strong consistency but for this very rare caveat. For most clients, this is not an issue and is a sane default. It can be improved on however by the following:

    • Switch to monotonic clocks. This avoids clock skew from impacting timers. Makes it less likely that a follower will "jump the gun" with an early election.

    • Change defaults so that HeartbeatTimeout >> LeaderLeaseTimeout. This will make it very improbably that the heartbeat is exceeded before a leader steps down. Not impossible, just highly unlikely. It also makes it less likely that a slow IO causes a new round of elections.

  • Add a new "?consistent" flag. This flag requires that reads be consistent, no caveats. This is done by doing a heartbeat from the leader to a quorum of followers to check it is still the leader. This means the old leader will step down before processing the read, removing the possibility of a stale read. It does add a network round trip per read, and thus it will not be the default.

  • Add a new "?stale" flag. This flag enables the read to be serviced by any follower. This means values can be arbitrarily stale, with no bound. In the general case, this should be bounded by CommitTimeout which is 80msec. It enables horizontal read scalability. Since it requires a developer to reason about carefully, also not a default. Even more useful, this allows reads even while a quorum is unavailable. Currently no reads are possible without a leader.

I think this three prong approach makes some trade offs. The default is not completely consistent, but in the 99.999% case it is. In that extremely rare case, if it is totally unacceptable for a client (honestly they need to re-architect, or they will have a bad time) they can use the ?consistent flag. For the clients that care more about read scaling, they can use the ?stale flag as long as they understand the limitations

@aphyr
aphyr commented Apr 18, 2014

Consul has a single (known) case of a potential inconsistency.

Unless you're very confident there are no other cases, I might phrase this as "Consul has a known case of inconsistency." It's a new system; I have a hunch you'll find other issues too. ;-)

99.999% case it is

To be honest, I'm not sure how likely it is that Consul can guarantee the absence of pauses on the order of one second. I've got JVM processes in production which routinely pause for 10 seconds, and up to minutes every few days. I've heard horror stories about Go's GC behavior as well, but it's one of those things that really depends on allocation pattern and workload. You may want to quantify this with stress tests. Might be fine, but timeouts in non-realtime languages tends to be one of those things that comes back to bite you. Ask the ElasticSearch team, haha. ;-)

That said, I think this is a good set of tradeoffs: most reads see up-to-date values, explicitly opt in to stale reads for lower latency and higher availability, and explicitly opt in for expensive reads guaranteeing linearizability. Sounds like a solid plan to me!

@evanphx
evanphx commented Apr 18, 2014

I agree, sounds like a good toolkit of options for the users to decide what degree of consistency the want. I like the ?stale option because it gives people an out if they're doing something that requires low latency and they can tolerate a more stale value. One question about ?stale: If a partition happens and a follower is in the minority half, will ?stale still work? Or will it know that it is apart of a leader-less group and reject all read requests?

@armon
Owner
armon commented Apr 18, 2014

@aphyr Yeah I only meant that its the only known case. Of course there are unknown unknowns :) In terms of the pause, I agree. We try hard for low pause (one driver behind LMDB and off-heap storage). But not sure if the pause will affect this stale read, since the heartbeat will have expired once it unpauses causing a step down. But I will document this nonetheless.

@armon
Owner
armon commented Apr 18, 2014

@evanphx In my head ?stale indicates that any follower should just directly read from their local FSM to serve the request. This means the read will always be serviced, but if you are partitioned it will be more stale than otherwise.

@evanphx
evanphx commented Apr 18, 2014

@armon I'd suggest, in that case, that you include an additional X- HTTP header that indicates the number of milliseconds since the leader was seen. That at least gives the client the ability to know the degree that the value is stale. You could also include another header that indicates whether or not the follower believes the cluster to be healthy.

@armon
Owner
armon commented Apr 18, 2014

@evanphx Great suggestion. That is a good way for clients to bound staleness.

@kellabyte

Like the others mentioned, I like the ability for users to choose whether stale reads are acceptable to them.

About GC pauses, there are a lot of issues that can cause pausing. Even if you put things off heap, you're still susceptible to all kinds of other sources. The bigger the cluster, the more crazy stuff that happens. When your cluster has thousands of drives for example, disk related pauses are something that becomes much more frequent.

@aphyr
aphyr commented Apr 19, 2014

since the heartbeat will have expired once it unpauses causing a step down.

Yeah, it's a race between the timeout thread and any threads servicing requests. No good way to avoid that, to my knowledge.

In my head ?stale indicates that any follower should just directly read from their local FSM to serve the request.

I like that idea; often "some value, any value, even old" is better than nothing in service discovery.

an additional X- HTTP header that indicates the number of milliseconds since the leader was seen.

Yeah, that sounds really useful.

@matthiasg

Is Consul tested with Jepsen on a regular basis or was this a one-time test ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment