Skip to content

Instantly share code, notes, and snippets.

@KlausTrainer
Created September 30, 2010 20:58
Show Gist options
  • Save KlausTrainer/605314 to your computer and use it in GitHub Desktop.
Save KlausTrainer/605314 to your computer and use it in GitHub Desktop.
Consistency in BigCouch

Consistency in BigCouch

In a BigCouch cluster, there are N replicas, who can either be in the state alive or dead. The case of all replicas being alive is defined as the absence of network partitions, i.e., no message is lost. Network partitions are modeled as total failure of one or more replica. For instance, take a set of network partitions P. For each network partition Pi, all replicas that are located in a different network partition than Pi, are defined as dead.

When there's a read or write request, the request first hits one of the replicas, which is referred to as the leader. The leader starts a voting round, i.e., it forwards the request to all replicas. When the leader receives a response from a replica, it's said that the replica has voted. A response may either contain a document version, or some error code indicating e.g. that the document was not found, or the revision number of a newly created document version.

If at least R or rather W replicas agree (i.e., they gave the same vote), it's said that they've formed a quorum.

In general, it can be said that strict consistency is guaranteed if an absolute majority of replicas (i.e., N / 2 + 1) agree, otherwise only eventual consistency is guaranteed.

Write Operations

A write operation succeeds, if there's a quorum.

In the case that a write operation creates a new document version and W >= N / 2 + 1, strict consistency can be guaranteed for a subsequent read operation on that version (see below). Otherwise, only eventual consistency can be guaranteed.

Read Operations

A read operation succeeds, if there's a quorum.

In the case that R >= N / 2 + 1, strict consistency is guaranteed, otherwise only eventual consistency.

Anti-Entropy Mechanism

In the case that there's no uniform consensus, and either all replicas have voted or there's an absolute majority, a "winning" document version is determined. The winning version is the one an absolute majority of replicas have voted for, or, if there's no absolute majority, the one with the highest revision number (by ASCII order). Note that it's also possible that the majority votes that no version of the document is available. In that case, the algorithm aborts. Otherwise, the winning version is written to those replicas who don't already have it.

To summarize the algorithm:

  1. Start a voting round
  2. try to form a quorum
  3. wait until there's a quorum, or until all replicas have voted, or until there's a timeout
  4. if there's a timeout, return with a timeout error, if there's a quorum, return the version the quorum agrees upon, if all replicas have voted but haven't formed a quorum, determine the winning version and return it to the client
  5. if necessary (i.e., there's no uniform consensus), and if possible (i.e., either there's an absolute majority that votes for a version that actually exists, or all replicas have voted), write the winning version to those replicas who don't have it.
@kocolosk
Copy link

kocolosk commented Oct 1, 2010

Thanks for this precise writeup Klaus. I don't think BigCouch should delete the conflicting version during the read repair. It should take advantage of CouchDB's ability to store multiple edit branches of a document and allow the user to resolve the conflict. Eventually CouchDB may gain the ability to automatically apply certain conflict resolution strategies, but the feature is not BigCouch-specific; it would be useful for any multi-master CouchDB deployment.

This writeup defines a quorum as a majority of replicas, but in fact BigCouch has some additional flexibility. BigCouch allows a quorum to be defined on a per-request level, and it can take any value from 1 to N. Applications may use this feature to opt out of consistency in exchange for lower latencies. It's just semantics, but I think it's important not to presume that quorum == majority.

By the way, a small nit. You wrote that "A write operation succeeds, if at least R replicas commit" -- you meant W instead of R, right?

Finally, about strict consistency. I don't think strict consistency can be guaranteed in the presence of variable read quorum settings and read repair. Consider the case of N=3, W=2, R=3. Concurrent updates to the same document occur; one update achieves the quorum while the other update only succeeds on 1 replica. A subsequent read request with R=3 will initiate read repair. The read repair will have a 50% chance of selecting the version which was saved on only 1 replica as the winning revision, thus breaking strict consistency. It's an important point because one feature which is not currently in BigCouch but which I do have planned is an automatic anti-entropy mechanism based on CouchDB replication. If the shard copies replicate amongst themselves an update which was only saved on one copy will ultimately make it to all copies (though it may not be the winning revision).

@KlausTrainer
Copy link
Author

Thanks for your review, Adam.

Let me start with the nit-pick: yes, you're right. "A write operation succeeds, if at least R replicas commit" s/R/W/ -- I've just corrected it.

As of the current implementation that's in my repo, conflicting versions get only deleted in a few specific cases. It can only be done on a minority of replicas (i.e., on only one replica, if N=3), and should ensure that all replicas return the same version. However, it is not key to the anti-entropy mechanism, and you're right that it shouldn't be default behavior to automatically delete any version.

Concerning your critique regarding my assertion about strict consistency and the anti-entropy mechanism: You're right with what you describe in the last paragraph. However, the implementation is slightly different. You made the false assumption that the algorithm wasn't deterministic in which version it would choose as the winning one. Anyway, to be less mistakable, I'll try to edit the "Anti-Entropy Mechanism" paragraph to make that more clear.
  
The point is that the winning version will either be the one supported by the majority, or if there's no majority, the one with the highest revision number. In the latter case, any read operation with R < N is nondeterministic. Only if you ask all N replicas, a deterministic decision about the winning version based on the highest revision number can be made. If the cluster is inconsistent, consistency will be restored deterministically. Note that an assertion about a cluster being inconsistent can only be made if both all N replicas have voted and there's no majority. According to this, the read repair will only get into action if either there's a majority but no uniform consensus (whereupon the cluster is provable consistent), or if the cluster is provable inconsistent.

This writeup defines a quorum as a majority of replicas, but in fact BigCouch has some additional flexibility. BigCouch allows a quorum to be defined on a per-request level, and it can take any value from 1 to N. Applications may use this feature to opt out of consistency in exchange for lower latencies. It's just semantics, but I think it's important not to presume that quorum == majority.

Yeah, you're right. I've just looked up the paper about Weighted Voting for Replicated Data and verified that quorum is not necessarily equal to majority ;). Better just call it majority.
  
The biggest difference to what's described in the Dynamo paper is that the current solution I propose won't return multiple versions, which is necessary to be API-compliant with CouchDB.

  
In my current implementation, if e.g. N=5 and R=2 and two replicas return different versions, the algorithm tries to get another vote from the third replica, and so on, until N / 2 + 1 replicas agree, or every replica has voted, or until there's a timeout. Thinking about that, I see that the voting round should better stop as soon as N / 2 + 1 or R identical replies have been found.

@KlausTrainer
Copy link
Author

I've changed code and documentation according to Adam's suggestions.

See here the according commit: http://github.com/KlausTrainer/bigcouch/commit/6ee7e0be20098e6c763d
...and here fabric_doc_open.erl as its whole: http://github.com/KlausTrainer/bigcouch/blob/6ee7e0be20098e6c763d/apps/fabric/src/fabric_doc_open.erl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment