Skip to content

Instantly share code, notes, and snippets.

@KlausTrainer
Created September 30, 2010 20:58
Show Gist options
  • Save KlausTrainer/605314 to your computer and use it in GitHub Desktop.
Save KlausTrainer/605314 to your computer and use it in GitHub Desktop.
Consistency in BigCouch

Consistency in BigCouch

In a BigCouch cluster, there are N replicas, who can either be in the state alive or dead. The case of all replicas being alive is defined as the absence of network partitions, i.e., no message is lost. Network partitions are modeled as total failure of one or more replica. For instance, take a set of network partitions P. For each network partition Pi, all replicas that are located in a different network partition than Pi, are defined as dead.

When there's a read or write request, the request first hits one of the replicas, which is referred to as the leader. The leader starts a voting round, i.e., it forwards the request to all replicas. When the leader receives a response from a replica, it's said that the replica has voted. A response may either contain a document version, or some error code indicating e.g. that the document was not found, or the revision number of a newly created document version.

If at least R or rather W replicas agree (i.e., they gave the same vote), it's said that they've formed a quorum.

In general, it can be said that strict consistency is guaranteed if an absolute majority of replicas (i.e., N / 2 + 1) agree, otherwise only eventual consistency is guaranteed.

Write Operations

A write operation succeeds, if there's a quorum.

In the case that a write operation creates a new document version and W >= N / 2 + 1, strict consistency can be guaranteed for a subsequent read operation on that version (see below). Otherwise, only eventual consistency can be guaranteed.

Read Operations

A read operation succeeds, if there's a quorum.

In the case that R >= N / 2 + 1, strict consistency is guaranteed, otherwise only eventual consistency.

Anti-Entropy Mechanism

In the case that there's no uniform consensus, and either all replicas have voted or there's an absolute majority, a "winning" document version is determined. The winning version is the one an absolute majority of replicas have voted for, or, if there's no absolute majority, the one with the highest revision number (by ASCII order). Note that it's also possible that the majority votes that no version of the document is available. In that case, the algorithm aborts. Otherwise, the winning version is written to those replicas who don't already have it.

To summarize the algorithm:

  1. Start a voting round
  2. try to form a quorum
  3. wait until there's a quorum, or until all replicas have voted, or until there's a timeout
  4. if there's a timeout, return with a timeout error, if there's a quorum, return the version the quorum agrees upon, if all replicas have voted but haven't formed a quorum, determine the winning version and return it to the client
  5. if necessary (i.e., there's no uniform consensus), and if possible (i.e., either there's an absolute majority that votes for a version that actually exists, or all replicas have voted), write the winning version to those replicas who don't have it.
@KlausTrainer
Copy link
Author

I've changed code and documentation according to Adam's suggestions.

See here the according commit: http://github.com/KlausTrainer/bigcouch/commit/6ee7e0be20098e6c763d
...and here fabric_doc_open.erl as its whole: http://github.com/KlausTrainer/bigcouch/blob/6ee7e0be20098e6c763d/apps/fabric/src/fabric_doc_open.erl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment