Skip to content

Instantly share code, notes, and snippets.

@pbailis
Last active April 27, 2020 11:46
Show Gist options
  • Save pbailis/5660980 to your computer and use it in GitHub Desktop.
Save pbailis/5660980 to your computer and use it in GitHub Desktop.
Assorted distributed database readings

Context: I was asked for a list of interesting reading relating to "distributed databases, behavior under partitions and failures, failure detection." Here's what I came up with in about an hour.

For textbooks, "Introduction to Reliable and Secure Distributed Programming" is a superb introduction to distributed computing from a formal perspective; it's really not about "programming" or "engineering" but about distributed system fundamentals like consensus, distributed registers, and broadcast. Used in Berkeley's Distributed Computing course (and HT to @lalithsuresh) Book Site

Notes from courses like Lorenzo Alvisi's Distributed Computing class can be great.

There are a bunch of classics on causality, Paxos (and more practical takes on Paxos), and distributed snapshots.

Edit: aside from these below, Alex Feinberg and Henry Robinson's lists at this Quora post contain a bunch of good practically-oriented but theoretically grounded papers.

Practical databases:

  • Consistency in Partitioned Networks PDF ACM A nice, practical discussion of techniques database systems can employ to ensure consistency under partitions. This survey predates CAP by several decades but is well-written and summarizes several important ideas.
  • Megastore: Providing Scalable, Highly Available Storage for Interactive Services PDF Megastore gives a reasonable example of a Paxos-based database architecture.
  • Consistency Tradeoffs in Modern Distributed Database System Design PDF IEEE is a great paper from Daniel Abadi reminding us that, aside from behavior during failures, highly available ("AP") systems also achieve low latency.
  • There are many remnants of the Bayou project in many "AP" systems today. The project was aimed at disconnected operation in a proto-smartphone/mobile computing era; a good overview is The Bayou Architecture: Support for Data Sharing among Mobile Users. Also good is Managing update conflicts in Bayou, a weakly connected replicated storage system. Definitely a more practically oriented paper. Optimistic Replication PDF ACM is a great survey of similar techniques.

More formal stuff:

  • Unreliable failure detectors for reliable distributed systems PDF ACM A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with The Weakest Failure Detector for Solving Consensus make for a great if tough tutorial on failure detectors (may be better off reading a textbook) PDF ACM
  • (Even better, A short introduction to failure detectors for asynchronous distributed systems PS.GZ ACM)
  • The Byzantine Generals Problem PDF ACM introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment