In a BigCouch cluster, there are N
replicas, who can either be in the state alive or dead. The case of all replicas being alive is defined as the absence of network partitions, i.e., no message is lost. Network partitions are modeled as total failure of one or more replica. For instance, take a set of network partitions P
. For each network partition Pi
, all replicas that are located in a different network partition than Pi
, are defined as dead.
When there's a read or write request, the request first hits one of the replicas, which is referred to as the leader. The leader starts a voting round, i.e., it forwards the request to all replicas. When the leader receives a response from a replica, it's said that the replica has voted. A response may either contain a document version, or some error code indicating e.g. that the document was not found, or the revision number of a newly created document version.
If at least R
or rather W
replicas agree (i.e., they gave the same vote), it's said that they've formed a quorum.
In general, it can be said that strict consistency is guaranteed if an absolute majority of replicas (i.e., N / 2 + 1
) agree, otherwise only eventual consistency is guaranteed.
A write operation succeeds, if there's a quorum.
In the case that a write operation creates a new document version and W >= N / 2 + 1
, strict consistency can be guaranteed for a subsequent read operation on that version (see below). Otherwise, only eventual consistency can be guaranteed.
A read operation succeeds, if there's a quorum.
In the case that R >= N / 2 + 1
, strict consistency is guaranteed, otherwise only eventual consistency.
In the case that there's no uniform consensus, and either all replicas have voted or there's an absolute majority, a "winning" document version is determined. The winning version is the one an absolute majority of replicas have voted for, or, if there's no absolute majority, the one with the highest revision number (by ASCII order). Note that it's also possible that the majority votes that no version of the document is available. In that case, the algorithm aborts. Otherwise, the winning version is written to those replicas who don't already have it.
To summarize the algorithm:
- Start a voting round
- try to form a quorum
- wait until there's a quorum, or until all replicas have voted, or until there's a timeout
- if there's a timeout, return with a timeout error, if there's a quorum, return the version the quorum agrees upon, if all replicas have voted but haven't formed a quorum, determine the winning version and return it to the client
- if necessary (i.e., there's no uniform consensus), and if possible (i.e., either there's an absolute majority that votes for a version that actually exists, or all replicas have voted), write the winning version to those replicas who don't have it.
Thanks for this precise writeup Klaus. I don't think BigCouch should delete the conflicting version during the read repair. It should take advantage of CouchDB's ability to store multiple edit branches of a document and allow the user to resolve the conflict. Eventually CouchDB may gain the ability to automatically apply certain conflict resolution strategies, but the feature is not BigCouch-specific; it would be useful for any multi-master CouchDB deployment.
This writeup defines a quorum as a majority of replicas, but in fact BigCouch has some additional flexibility. BigCouch allows a quorum to be defined on a per-request level, and it can take any value from 1 to N. Applications may use this feature to opt out of consistency in exchange for lower latencies. It's just semantics, but I think it's important not to presume that quorum == majority.
By the way, a small nit. You wrote that "A write operation succeeds, if at least
R
replicas commit" -- you meantW
instead ofR
, right?Finally, about strict consistency. I don't think strict consistency can be guaranteed in the presence of variable read quorum settings and read repair. Consider the case of N=3, W=2, R=3. Concurrent updates to the same document occur; one update achieves the quorum while the other update only succeeds on 1 replica. A subsequent read request with R=3 will initiate read repair. The read repair will have a 50% chance of selecting the version which was saved on only 1 replica as the winning revision, thus breaking strict consistency. It's an important point because one feature which is not currently in BigCouch but which I do have planned is an automatic anti-entropy mechanism based on CouchDB replication. If the shard copies replicate amongst themselves an update which was only saved on one copy will ultimately make it to all copies (though it may not be the winning revision).