Skip to content

Instantly share code, notes, and snippets.

@stonegao
Forked from pozorvlak/crdt.md
Created September 4, 2013 10:02
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save stonegao/6435075 to your computer and use it in GitHub Desktop.
Save stonegao/6435075 to your computer and use it in GitHub Desktop.

CvRDTs are as general as they can be

What are you talking about, and why should I care?

Now that we live in the Big Data, Web 3.14159 era, lots of people want to build databases that are too big to fit on a single machine. But there's a problem in the form of the CAP theorem, which states that if your network ever partitions (a machine goes down, or part of the network loses its connection to the rest) then you can keep consistency (all machines return the same answer to queries) or availability (all machines can be read from or written to), but not both. In simple terms, you can either stop and wait until the partition heals before accepting any writes (so-called CP systems, for "consistent and partition-tolerant") or carry on as best you can, knowing that different parts of the system will drift out of sync with each other (so-called AP systems, for "available and partition tolerant"). You can also go for CA, but then your database is effectively restricted to a single machine: real computers crash, and real networks go down.

OK, but what the hell's a CRDT?

One workaround for the CAP theorem, which I believe was pioneered at Amazon but which is now widespread, is to go for "eventual consistency": maintain an AP system, but ensure that every update to a given datum is eventually propagated to every node that needs to know about it. But it turns out that this is hard to do accurately. A rather clever approach to achieving eventual consistency, invented by Baquero and Moura in the late Nineties for use in disconnected mobile systems, is to build the eventual consistency into the datatypes themselves. You define an ordering of possible states, pass around full states between nodes, and merge two states by taking their least upper bound (LUB); the properties of the LUB operation ensure that, provided the network is reliable enough to transmit every update eventually to every node at least once, then eventual consistency will be achieved. These are now known as Convergent Replicated Datatypes, or CvRDTs. Another approach, CmRDTs (for Commutative Replicated Datatypes), is based on passing around update operations and proving that concurrent updates commute; this approach requires less network traffic, since operations are typically smaller than full states, but it places stronger demands on the reliability of the network layer (updates can't be duplicated, and must arrive in an order compatible with causality). CvRDTs and CmRDTs are collectively known as CRDTs. The relevant paper here is this technical report from INRIA, which defines CvRDTs and CmRDTs, proves an equivalence theorem between them (which works in the obvious, inefficient way: the "state" of a datum is represented as an ordered sequence of updates from the initial state, and updates which have been received but which cannot yet be applied are held until their precursors are received). It proves that CvRDTs achieve eventual consistency in the presence of networks that eventually deliver every message at least once and that CmRDTs achieve eventual consistency in the presence of networks that deliver every message exactly once in an order consistent with causality, and gives a bunch of examples.

Yeah, yeah, get to the maths

For about a year now, I've been pointing anyone who'd listen at Robin Houston's excellent post On Editing Text, which sketches an approach to distributed version control based on colimits (see this recent paper for a more detailed treatment). Colimits generalise least-upper-bounds and have the same desirable uniqueness properties, but they are operation-based: in fact, least-upper-bounds are the special case of colimits in which the only allowable operation from state a to state b is "assert that a ≤ b". So, I thought, might there be a third class of CRDTs based on colimits, with the network-traffic needs of CmRDTs and the network-reliability needs of CvRDTs?

As you will have gathered from the title of this post, the answer is "alas, no". I hereby present a proof. We will model a node in our database as a semiautomaton, ie as a device that receives a sequence of messages from other nodes and applies each one to its current state to determine its new state. Let Q be the set of states, q0 be the initial state, M be the set of legal messages, and ev : List M -> Q be the evaluation function, ie ev ms is the state obtained by applying the sequence of messages ms to q0. I'll use Haskell notation throughout, in which f x means "the function f applied to x", square brackets are used for lists and ++ denotes list concatenation. Since ev is the evaluation function of a semiautomaton, it must have the following property:

  • amnesia: ev (xs ++ ys) = ev (zs ++ ys) if ev xs = ev zs.

Intuitively, the amnesia condition says that behaviour only depends on the current state, not the path taken to reach it.

In order to achieve eventual consistency on a network that can arbitrarily duplicate and reorder messages, the semiautomaton (Q, q0, M, ev) must satisfy the following conditions:

  • commutativity: ev [x, y] = ev [y, x] (since the network may reorder messages)
  • idempotence: ev [x, x] = ev [x] (since the network may duplicate messages)

We define a monoid structure on Q', the subset of Q reachable from q0, as follows:

  • q0 is the identity element
  • (ev xs) * (ev ys) = ev (xs ++ ys).

Lemma 1 (Q', *, q0) is a well-defined monoid.
Proof suppose ev xs = ev xs' and ev ys = ev ys'. We must show that (ev xs) * (ev ys) = (ev xs') * (ev ys').

 (ev xs') * (ev ys')
     = ev (xs' ++ ys') 
     = ev (xs ++ ys')     -- by amnesia
     = ev (ys' ++ xs)     -- by commutativity
     = ev (ys ++ xs)      -- by amnesia
     = ev (xs ++ ys)      -- by commutativity
     = (ev xs) * (ev ys)

Hence * is well-defined. It inherits associativity from ++, and q0 is an identity for * since q0 = ev [] and [] is an identity for ++. So (Q', *, q0) is a monoid. QED.

Note that this is not the same as the transition monoid commonly encountered in the theory of semiautomata, though no doubt the construction above is standard and related to the transition monoid in some way I wasn't able to find out in twenty minutes on Wikipedia. Henceforth I'll abuse notation in the conventional way and just call our monoid Q'.

We now need to prove a couple of simple results about Q' to set it up for what is to come.

Lemma 2 Q' is idempotent and commutative.
Proof Let x = ev xs and y = ev ys be elements of Q'. We must show that x*x = x and x*y = y*x.

x*x
    = ev (xs ++ xs)
    = ev xs     -- by idempotence and commutativity of ev; exercise)
    = x

x* y
    = ev (xs ++ ys)
    = ev (ys ++ xs)
    = y * x

QED.

Now for the meat of the proof.

Lemma 3 Let (A, +) be an idempotent commutative semigroup. For all a, b in A, say that a ≤ b iff a + b = b. Then (A,≤) is a partially ordered set, and a + b is the least upper bound of {a, b} with respect to ≤.
Proof To show that (A, ≤) is a partially ordered set, we must show that ≤ is reflexive and transitive.
Reflexivity: Let a be an element of A. Then a + a = a by idempotence, so a ≤ a as required.
Transitivity: Let a, b and c be elements of A such that a ≤ b and b ≤ c. We must show that a ≤ c.

a + c
    = a + (b + c)
    = (a + b) + c
    = b + c
    = c.

So a ≤ c, as required. Hence (A, ≤) is a poset, as required.

a + b is an upper bound of {a, b}: let c = a + b. Then

a + c
    = a + (a + b)
    = (a + a) + b
    = a + b
    = c

so a ≤ c, and similarly

b + c
    = b + (a + b)
    = b + (b + a)
    = (b + b) + a
    = b + a
    = a + b
    = c

so c is an upper bound for {a, b}.

a + b is a least upper bound for {a, b}: Let c = a + b as before, and let d be another upper bound for {a, b}. We must show that c ≤ d.

c + d
    = (a + b) + d
    = a + (b + d)
    = a + d
    = d

so c ≤ d, as required. QED.

Corollary 4 Every idempotent commutative semigroup is a semilattice.

Corollary 4 is almost certainly a standard exercise in order theory textbooks, but it was easier for me to prove the result than it was to find a reference.

Putting together Lemma 1 and 2, we see that (Q', q0, *) is an idempotent commutative monoid, and hence (Q', *) is an idempotent commutative semigroup. Hence by Corollary 4, Q' is a semilattice. In the terminology of the INRIA report, therefore, Q' is a CvRDT.

So we may conclude that every CRDT that can achieve eventual consistency over a network that can arbitrarily reorder and duplicate messages is a CvRDT. This is the converse of Proposition 2.1 from the INRIA report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment