Skip to content

Instantly share code, notes, and snippets.


Peter Bailis pbailis

View GitHub Profile

Distributed concurrency in a state of extreme turbulence. More than 20 concurrency control algorithms have been proposed for DDBMSs, and several have been, or are being, implemented. These algorithms are usually complex, hard to understand, and difficult to prove correct (indeed, many are incorrect). Because they are described in different terminologies and make different assumptions about the underlying DDBMS environment, it is difficult to compare the many proposed algorithms, even in qualitative terms. Naturally each author proclaims his or her approach as best, but there is little compelling evidence to support the claims...

> After studying the large number of proposed algorithms, we find that they are compositions of only a few subalgorithms. In fact, the subalgorithms used by all practical DDBMS concurrency control algorithms are variations of just two basic techniques: two-phase locking and timestamp ordering; thus the state of the art is far more coherent than a review of the liter

pbailis /
Last active Feb 15, 2016
Reproducing (un)reproducibility results

edit: see

Not deserving of a full post, but nonetheless worth writing about: @ongardie, @aalevy, and a few others on Twitter were surprised by the number of papers that were flagged as "not reproducible" according to the recent study at Digging deeper, it appeared that 1.) "code builds" is the standard for reproducibility in this study and that 2.) many broken builds were the result of missing dependencies on the researchers' systems.

I tried reproducing a few of the authors' "unreproducible" results. It's hard to vet 600+ research code repositories, but, with a little effort (< ~10 minutes each?), I was able to get all of the following to actually build (on Ubuntu 13.10). This doesn't inspire confidence in the reproducibility of the study results.


pbailis / gist:8279494
Last active Jan 2, 2016
serializable but not linearizable
View gist:8279494
T1: w(x=1)
T2: r(x=null)
@ wall clock time 1, T1 begins and commits
@ wall clock time 2, T2 begins and commits
NL is serializable: NL is equivalent to executing T2;T1
NL is not linearizable: T2 should have read x=1
pbailis / gist:5660980
Last active Apr 27, 2020
Assorted distributed database readings
View gist:5660980

Context: I was asked for a list of interesting reading relating to "distributed databases, behavior under partitions and failures, failure detection." Here's what I came up with in about an hour.

For textbooks, "Introduction to Reliable and Secure Distributed Programming" is a superb introduction to distributed computing from a formal perspective; it's really not about "programming" or "engineering" but about distributed system fundamentals like consensus, distributed registers, and broadcast. Used in Berkeley's Distributed Computing course (and HT to @lalithsuresh) Book Site

Notes from courses like Lorenzo Alvisi's Distributed Computing class can be great.

There are a bunch of classics on causality, [Paxos](ht

pbailis /
Last active Apr 15, 2018
Quick and dirty (incomplete) list of interesting, mostly recent data warehousing/"big data" papers

A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and "big data" database systems, with an eye towards real-world deployments. I figured I'd share the list. It's biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I've omitted several, like MapReduce), I think there are a few underappreciated gems.

###Dataflow Engines:

Dryad--general-purpose distributed parallel dataflow engine

Spark--in memory dataflow

pbailis /
Last active Jul 23, 2019
Randomized load balancing comparison

RapGenius has an interesting post about Heroku's randomized load balancing, complaining about how random placement degrades performance compared to prior omniscient approaches. RapGenius ran some simulations, including an experiments with a "Choice of Two" method:

Choice of two routing is the naive method from before, with the twist that when you assign a request to a random dyno, if that dyno is already busy then you reassign the request to a second random dyno, with no regard for whether the second dyno is busy

This differs subtly but substantially from the standard "Power of Two Choices" randomized load balancing:

each [request] is placed in the least loaded of d >= 2 [Dynos] chosen independently and uniformly at random

Take a look at the difference in queue lengths below, for 200 Dynos, 100

pbailis / gist:3978273
Created Oct 30, 2012
Cassandra YCSB benchmarking by value size
View gist:3978273
CASS_HOST=#host ip here
for threads in 10 20 40 50
for size in 10 1000 10000
# reset Cassandra on remote host here; something like
ssh ubuntu@$CASS_HOST "pkill -9 java; rm -rf /mnt/md0/cassandra/; cassandra &; disown"