Peter Bailis pbailis

## bernstein.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                pbailis
                / bernstein.md
            
            
              Created
              November 3, 2015 02:15
            
          
Distributed concurrency control...is in a state of extreme turbulence. More than 20 concurrency control algorithms have been proposed for DDBMSs, and several have been, or are being, implemented. These algorithms are usually complex, hard to understand, and difficult to prove correct (indeed, many are incorrect). Because they are described in different terminologies and make different assumptions about the underlying DDBMS environment, it is difficult to compare the many proposed algorithms, even in qualitative terms. Naturally each author proclaims his or her approach as best, but there is little compelling evidence to support the claims...

> After studying the large number of proposed algorithms, we find that they are compositions of only a few subalgorithms. In fact, the subalgorithms used by all practical DDBMS concurrency control algorithms are variations of just two basic techniques: two-phase locking and timestamp ordering; thus the state of the art is far more coherent than a review of the liter

  
## reproducibility.md

      
              1 file
            
          
              0 forks
            
          
              2 comments
            
          
              4 stars
            
          
                pbailis
                / reproducibility.md
            
            
              Last active
              February 15, 2016 12:18
            
              
                Reproducing (un)reproducibility results
              
          
    edit: see http://cs.brown.edu/~sk/Memos/Examining-Reproducibility/
Not deserving of a full post, but nonetheless worth writing about: @ongardie, @aalevy, and a few others on Twitter were surprised by the number of papers that were flagged as "not reproducible" according to the recent study at http://reproducibility.cs.arizona.edu. Digging deeper, it appeared that 1.) "code builds" is the standard for reproducibility in this study and that 2.) many broken builds were the result of missing dependencies on the researchers' systems.
I tried reproducing a few of the authors' "unreproducible" results. It's hard to vet 600+ research code repositories, but, with a little effort (< ~10 minutes each?), I was able to get all of the following to actually build (on Ubuntu 13.10). This doesn't inspire confidence in the reproducibility of the study results.
Peter
pbailis@cs.berkeley.edu

["Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors" in TOCS](http://reproducibility.cs.ariz


## gist:8279494
T1: w(x=1)
T2: r(x=null)

SCHEDULE NL:
@ wall clock time 1, T1 begins and commits
@ wall clock time 2, T2 begins and commits

NL is serializable: NL is equivalent to executing T2;T1
NL is not linearizable: T2 should have read x=1

## gist:5660980

      
              1 file
            
          
              8 forks
            
          
              0 comments
            
          
              24 stars
            
          
                pbailis
                / gist:5660980
            
            
              Last active
              April 27, 2020 11:46
            
              
                Assorted distributed database readings
              
          
    Context: I was asked for a list of interesting reading relating to "distributed databases, behavior under partitions and failures, failure detection." Here's what I came up with in about an hour.
For textbooks, "Introduction to Reliable and Secure Distributed Programming" is a superb introduction to distributed computing from a formal perspective; it's really not about "programming" or "engineering" but about distributed system fundamentals like consensus, distributed registers, and broadcast. Used in Berkeley's Distributed Computing course (and HT to @lalithsuresh) Book Site
Notes from courses like Lorenzo Alvisi's Distributed Computing class can be great.
There are a bunch of classics on causality, [Paxos](ht

  
## list.md

      
              1 file
            
          
              10 forks
            
          
              6 comments
            
          
              49 stars
            
          
                pbailis
                / list.md
            
            
              Last active
              April 15, 2018 08:54
            
              
                Quick and dirty (incomplete) list of interesting, mostly recent data warehousing/"big data" papers
              
          
    A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and "big data" database systems, with an eye towards real-world deployments. I figured I'd share the list. It's biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I've omitted several, like MapReduce), I think there are a few underappreciated gems.
###Dataflow Engines:
Dryad--general-purpose distributed parallel dataflow engine

http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf
Spark--in memory dataflow

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

  
## rapgenius-balancing.md

      
              1 file
            
          
              2 forks
            
          
              1 comment
            
          
              13 stars
            
          
                pbailis
                / rapgenius-balancing.md
            
            
              Last active
              July 23, 2019 12:57
            
              
                Randomized load balancing comparison
              
          
    RapGenius has an interesting post about Heroku's randomized load balancing, complaining about how random placement degrades performance compared to prior omniscient approaches. RapGenius ran some simulations, including an experiments with a "Choice of Two" method:

Choice of two routing is the naive method from before, with the twist that when you assign a request to a random dyno, if that dyno is already busy then you reassign the request to a second random dyno, with no regard for whether the second dyno is busy

This differs subtly but substantially from the standard "Power of Two Choices" randomized load balancing:

each [request] is placed in the least loaded of d >= 2 [Dynos] chosen independently and uniformly at random

Take a look at the difference in queue lengths below, for 200 Dynos, 100

  
## gist:3978273
CASS_HOST=#host ip here
sec=90
records=100000

for threads in 10 20 40 50
do
    for size in 10 1000 10000
    do
         # reset Cassandra on remote host here; something like
         ssh ubuntu@$CASS_HOST "pkill -9 java;  rm -rf /mnt/md0/cassandra/; cassandra &; disown"
	T1: w(x=1)
	T2: r(x=null)

	SCHEDULE NL:
	@ wall clock time 1, T1 begins and commits
	@ wall clock time 2, T2 begins and commits

	NL is serializable: NL is equivalent to executing T2;T1
	NL is not linearizable: T2 should have read x=1
	CASS_HOST=#host ip here
	sec=90
	records=100000

	for threads in 10 20 40 50
	do
	for size in 10 1000 10000
	do
	# reset Cassandra on remote host here; something like
	ssh ubuntu@$CASS_HOST "pkill -9 java; rm -rf /mnt/md0/cassandra/; cassandra &; disown"