stonegao/crdt.md

## crdt.md

      
    Raw
  

              crdt.md
            
          
    CvRDTs are as general as they can be

What are you talking about, and why should I care?

Now that we live in the Big Data, Web 3.14159 era, lots of people want to build
databases that are too big to fit on a single machine. But there's a problem in
the form of the CAP theorem, which states that if your network ever
partitions (a machine goes down, or part of the network loses its connection to
the rest) then you can keep consistency (all machines return the same answer to
queries) or availability (all machines can be read from or written to), but not
both. In simple terms, you can either stop and wait until the partition heals
before accepting any writes (so-called CP systems, for "consistent and
partition-tolerant") or carry on as best you can, knowing that different parts
of the system will drift out of sync with each other (so-called AP systems, for
"available and partition tolerant"). You can also go for CA, but then your
database is effectively restricted to a single machine: real computers crash,
and real networks go down.
OK, but what the hell's a CRDT?

One workaround for the CAP theorem, which I believe was pioneered at Amazon but
which is now widespread, is to go for "eventual consistency": maintain an AP
system, but ensure that every update to a given datum is eventually propagated
to every node that needs to know about it. But it turns out that this is hard to
do accurately. A rather clever approach to achieving eventual consistency,
invented by Baquero and Moura in the late Nineties for use in disconnected
mobile systems, is to build the eventual consistency into the datatypes
themselves. You define an ordering of possible states, pass around full states
between nodes, and merge two states by taking their least upper bound (LUB); the
properties of the LUB operation ensure that, provided the network is reliable
enough to transmit every update eventually to every node at least once, then
eventual consistency will be achieved. These are now known as Convergent
Replicated Datatypes, or CvRDTs.  Another approach, CmRDTs (for Commutative
Replicated Datatypes), is based on passing around update operations and proving
that concurrent updates commute; this approach requires less network traffic,
since operations are typically smaller than full states, but it places
stronger demands on the reliability of the network layer (updates can't be
duplicated, and must arrive in an order compatible with causality). CvRDTs and
CmRDTs are collectively known as CRDTs. The relevant paper here is this
technical report from
INRIA, which defines
CvRDTs and CmRDTs, proves an equivalence theorem between them (which works in
the obvious, inefficient way: the "state" of a datum is represented as an
ordered sequence of updates from the initial state, and updates which have been
received but which cannot yet be applied are held until their precursors are
received). It proves that CvRDTs achieve eventual consistency in the presence of
networks that eventually deliver every message at least once and that CmRDTs
achieve eventual consistency in the presence of networks that deliver every
message exactly once in an order consistent with causality, and gives a bunch of
examples.
Yeah, yeah, get to the maths

For about a year now, I've been pointing anyone who'd listen at Robin Houston's
excellent post On Editing
Text, which sketches
an approach to distributed version control based on colimits (see this recent
paper for a
more detailed treatment). Colimits generalise least-upper-bounds and have the
same desirable uniqueness properties, but they are operation-based: in fact,
least-upper-bounds are the special case of colimits in which the only allowable
operation from state a to state b is "assert that a ≤ b". So, I thought,
might there be a third class of CRDTs based on colimits, with the
network-traffic needs of CmRDTs and the network-reliability needs of CvRDTs?
As you will have gathered from the title of this post, the answer is "alas,
no". I hereby present a proof. We will model a node in our database as a semiautomaton, ie as a device
that receives a sequence of messages from other nodes and applies each one to
its current state to determine its new state. Let Q be the set of states, q0 be
the initial state, M be the set of legal messages, and ev : List M -> Q be the
evaluation function, ie ev ms is the state obtained by applying the
sequence of messages ms to q0. I'll use Haskell notation throughout, in which
f x means "the function f applied to x", square brackets are used for lists
and ++ denotes list concatenation. Since ev is the evaluation function of a
semiautomaton, it must have the following property:

amnesia: ev (xs ++ ys) = ev (zs ++ ys) if ev xs = ev zs.

Intuitively, the amnesia condition says that behaviour only depends on the
current state, not the path taken to reach it.
In order to achieve eventual consistency on a network that can arbitrarily duplicate and reorder messages, the semiautomaton (Q, q0, M, ev) must satisfy the following conditions:

commutativity: ev [x, y] = ev [y, x] (since the network may reorder messages)
idempotence: ev [x, x] = ev [x] (since the network may duplicate messages)

We define a monoid structure on Q', the subset of Q reachable from q0, as
follows:

q0 is the identity element
(ev xs) * (ev ys) = ev (xs ++ ys).

Lemma 1 (Q', *, q0) is a well-defined monoid.

Proof suppose ev xs = ev xs' and ev ys = ev ys'. We must show that (ev xs) * (ev ys) = (ev xs') * (ev ys').
 (ev xs') * (ev ys')
     = ev (xs' ++ ys') 
     = ev (xs ++ ys')     -- by amnesia
     = ev (ys' ++ xs)     -- by commutativity
     = ev (ys ++ xs)      -- by amnesia
     = ev (xs ++ ys)      -- by commutativity
     = (ev xs) * (ev ys)

Hence *  is well-defined. It inherits associativity from ++, and q0 is an identity for * since q0 = ev [] and [] is an identity for ++. So (Q', *, q0) is a monoid. QED.
Note that this is not the same as the transition monoid commonly encountered in the theory of semiautomata, though no doubt the construction above is standard and related to the transition monoid in some way I wasn't able to find out in twenty minutes on Wikipedia. Henceforth I'll abuse notation in the conventional way and just call our monoid Q'.
We now need to prove a couple of simple results about Q' to set it up for what is to come.
Lemma 2 Q' is idempotent and commutative.

Proof Let x  = ev xs and y = ev ys be elements of Q'. We must show that x*x = x and x*y = y*x.
x*x
    = ev (xs ++ xs)
    = ev xs     -- by idempotence and commutativity of ev; exercise)
    = x

x* y
    = ev (xs ++ ys)
    = ev (ys ++ xs)
    = y * x

QED.
Now for the meat of the proof.
Lemma 3 Let (A, +) be an idempotent commutative semigroup. For all a, b in A, say that a ≤ b iff a + b = b. Then (A,≤) is a partially ordered set, and a + b is the least upper bound of {a, b} with respect to ≤.

Proof To show that (A, ≤) is a partially ordered set, we must show that ≤ is reflexive and transitive.

Reflexivity: Let a be an element of A. Then a + a = a by idempotence, so a ≤ a as required.

Transitivity: Let a, b and c be elements of A such that a ≤ b and b ≤ c. We must show that a ≤ c.
a + c
    = a + (b + c)
    = (a + b) + c
    = b + c
    = c.

So a ≤ c, as required. Hence (A, ≤) is a poset, as required.
a + b is an upper bound of {a, b}: let c = a + b. Then
a + c
    = a + (a + b)
    = (a + a) + b
    = a + b
    = c

so a ≤ c, and similarly
b + c
    = b + (a + b)
    = b + (b + a)
    = (b + b) + a
    = b + a
    = a + b
    = c

so c is an upper bound for {a, b}.
a + b is a least upper bound for {a, b}: Let c = a + b as before, and let d be another upper bound for {a, b}. We must show that c ≤ d.
c + d
    = (a + b) + d
    = a + (b + d)
    = a + d
    = d

so c ≤ d, as required. QED.
Corollary 4 Every idempotent commutative semigroup is a semilattice.
Corollary 4 is almost certainly a standard exercise in order theory textbooks, but it was easier for me to prove the result than it was to find a reference.
Putting together Lemma 1 and 2, we see that (Q', q0, *) is an idempotent
commutative monoid, and hence (Q', *) is an idempotent commutative semigroup.
Hence by Corollary 4, Q' is a semilattice. In the terminology of the INRIA
report, therefore, Q' is a CvRDT.
So we may conclude that every CRDT that can achieve eventual consistency over a network that can arbitrarily reorder and duplicate messages is a CvRDT. This is the converse of Proposition 2.1 from the INRIA report.