meetme2meat/gist:ed1223b7f9675243b9e0da082726fe2e

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Context: I was asked for a list of interesting reading relating to "distributed databases, behavior under partitions and failures, failure detection." Here's what I came up with in about an hour.
For textbooks, "Introduction to Reliable and Secure Distributed Programming" is a superb introduction to distributed computing from a formal perspective; it's really not about "programming" or "engineering" but about distributed system fundamentals like consensus, distributed registers, and broadcast. Used in Berkeley's Distributed Computing course (and HT to @lalithsuresh) Book Site
Notes from courses like Lorenzo Alvisi's Distributed Computing class can be great.
There are a bunch of classics on causality, Paxos (and more practical takes on Paxos), and distributed snapshots.
Edit: aside from these below, Alex Feinberg and Henry Robinson's lists at this Quora post contain a bunch of good practically-oriented but theoretically grounded papers.
Practical databases:

Consistency in Partitioned Networks PDF ACM A nice, practical discussion of techniques database systems can employ to ensure consistency under partitions. This survey predates CAP by several decades but is well-written and summarizes several important ideas.
Megastore: Providing Scalable, Highly Available
Storage for Interactive Services PDF Megastore gives a reasonable example of a Paxos-based database architecture.
Consistency Tradeoffs in Modern Distributed Database System Design PDF IEEE is a great paper from Daniel Abadi reminding us that, aside from behavior during failures, highly available ("AP") systems also achieve low latency.
There are many remnants of the Bayou project in many "AP" systems today. The project was aimed at disconnected operation in a proto-smartphone/mobile computing era; a good overview is The Bayou Architecture: Support for Data Sharing among Mobile Users. Also good is Managing update conflicts in Bayou, a weakly connected replicated storage system. Definitely a more practically oriented paper. Optimistic Replication PDF ACM is a great survey of similar techniques.

More formal stuff:

Unreliable failure detectors for reliable distributed systems PDF ACM A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with The Weakest Failure Detector for Solving Consensus make for a great if tough tutorial on failure detectors (may be better off reading a textbook) PDF ACM
(Even better, A short introduction to failure detectors for asynchronous distributed systems PS.GZ ACM)
The Byzantine Generals Problem  PDF ACM introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)