Skip to content

Instantly share code, notes, and snippets.

View pbailis's full-sized avatar

Peter Bailis pbailis

View GitHub Profile

Distributed concurrency in a state of extreme turbulence. More than 20 concurrency control algorithms have been proposed for DDBMSs, and several have been, or are being, implemented. These algorithms are usually complex, hard to understand, and difficult to prove correct (indeed, many are incorrect). Because they are described in different terminologies and make different assumptions about the underlying DDBMS environment, it is difficult to compare the many proposed algorithms, even in qualitative terms. Naturally each author proclaims his or her approach as best, but there is little compelling evidence to support the claims...

> After studying the large number of proposed algorithms, we find that they are compositions of only a few subalgorithms. In fact, the subalgorithms used by all practical DDBMS concurrency control algorithms are variations of just two basic techniques: two-phase locking and timestamp ordering; thus the state of the art is far more coherent than a review of the liter

pbailis /
Last active February 15, 2016 12:18
Reproducing (un)reproducibility results

edit: see

Not deserving of a full post, but nonetheless worth writing about: @ongardie, @aalevy, and a few others on Twitter were surprised by the number of papers that were flagged as "not reproducible" according to the recent study at Digging deeper, it appeared that 1.) "code builds" is the standard for reproducibility in this study and that 2.) many broken builds were the result of missing dependencies on the researchers' systems.

I tried reproducing a few of the authors' "unreproducible" results. It's hard to vet 600+ research code repositories, but, with a little effort (< ~10 minutes each?), I was able to get all of the following to actually build (on Ubuntu 13.10). This doesn't inspire confidence in the reproducibility of the study results.


pbailis / gist:8279494
Last active April 9, 2023 05:07
serializable but not linearizable
T1: w(x=1)
T2: r(x=null)
@ wall clock time 1, T1 begins and commits
@ wall clock time 2, T2 begins and commits
NL is serializable: NL is equivalent to executing T2;T1
NL is not linearizable: T2 should have read x=1
pbailis / gist:5660980
Last active April 27, 2020 11:46
Assorted distributed database readings

Context: I was asked for a list of interesting reading relating to "distributed databases, behavior under partitions and failures, failure detection." Here's what I came up with in about an hour.

For textbooks, "Introduction to Reliable and Secure Distributed Programming" is a superb introduction to distributed computing from a formal perspective; it's really not about "programming" or "engineering" but about distributed system fundamentals like consensus, distributed registers, and broadcast. Used in Berkeley's Distributed Computing course (and HT to @lalithsuresh) Book Site

Notes from courses like Lorenzo Alvisi's Distributed Computing class can be great.

There are a bunch of classics on causality, [Paxos](ht

pbailis /
Last active April 15, 2018 08:54
Quick and dirty (incomplete) list of interesting, mostly recent data warehousing/"big data" papers

A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and "big data" database systems, with an eye towards real-world deployments. I figured I'd share the list. It's biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I've omitted several, like MapReduce), I think there are a few underappreciated gems.

###Dataflow Engines:

Dryad--general-purpose distributed parallel dataflow engine

Spark--in memory dataflow

pbailis /
Last active July 23, 2019 12:57
Randomized load balancing comparison

RapGenius has an interesting post about Heroku's randomized load balancing, complaining about how random placement degrades performance compared to prior omniscient approaches. RapGenius ran some simulations, including an experiments with a "Choice of Two" method:

Choice of two routing is the naive method from before, with the twist that when you assign a request to a random dyno, if that dyno is already busy then you reassign the request to a second random dyno, with no regard for whether the second dyno is busy

This differs subtly but substantially from the standard "Power of Two Choices" randomized load balancing:

each [request] is placed in the least loaded of d >= 2 [Dynos] chosen independently and uniformly at random

Take a look at the difference in queue lengths below, for 200 Dynos, 100

pbailis / gist:3978273
Created October 30, 2012 04:21
Cassandra YCSB benchmarking by value size
CASS_HOST=#host ip here
for threads in 10 20 40 50
for size in 10 1000 10000
# reset Cassandra on remote host here; something like
ssh ubuntu@$CASS_HOST "pkill -9 java; rm -rf /mnt/md0/cassandra/; cassandra &; disown"