Skip to content

Instantly share code, notes, and snippets.

@miguel-vila
Last active October 15, 2015 18:57
Show Gist options
  • Save miguel-vila/4b390a38768bcd725d03 to your computer and use it in GitHub Desktop.
Save miguel-vila/4b390a38768bcd725d03 to your computer and use it in GitHub Desktop.

Distributed systems for fun and profit

  • Latency: The state of being latent; delay, a period between the initiation of something and the occurrence.
  • Latent: From Latin latens, latentis, present participle of lateo ("lie hidden"). Existing or present but concealed or inactive.
  • The other key point based on this definition is that if nothing happens, there is no "latent period". A system in which data doesn't change doesn't (or shouldn't) have a latency problem.
  • One can often gain performance by exposing more details about the internals of the system.
  • Systems which hide these kinds of details are easier to understand (since they act more like single unit, with fewer details to think about), while systems that expose more real- world details may be more performant (because they correspond more closely to reality).
  • In the end, the ideal system meets both programmer needs (clean semantics) and business needs (availability/consistency/latency)
  • Only one consistency model for replication - strong consistency - allows you to program as- if the underlying data was not replicated. Other consistency models expose some internals of the replication to the programmer.
  • Abstractions, fundamentally, are fake. Every situation is unique, as is every node. But abstractions make the world manageable

A System Model

  • Some implications of a distributed system:
    • each node executes a program concurrently
    • any information about global state is potentially out of date
    • independent of node failure; it is not easy to distinguish network failure and node failure
    • clocks are not synchronized accross nodes
  • A robust system model is one that makes the weakest assumptions: any algorithm written for such a system is very tolerant of different environments, since it makes very few and very weak assumptions.
  • Another alternative is to assume that nodes can fail by misbehaving in any arbitrary way. This is known as Byzantine fault tolerance. Byzantine faults are rarely handled in real world commercial systems, because algorithms resilient to arbitrary faults are more expensive to run and more complex to implement. I will not discuss them here.

Two impossibility results

  • The first impossibility result, known as the FLP impossibility result, is an impossibility result that is particularly relevant to people who design distributed algorithms.
  • The second - the CAP theorem - is a related result that is more relevant to practitioners; people who need to choose between different system designs but who are not directly concerned with the design of algorithms.

The FLP impossibility result

  • It is assumed that nodes can only fail by crashing; that the network is reliable, and that the typical timing assumptions of the asynchronous system model hold: e.g. there are no bounds on message delay.
  • "there does not exist a (deterministic) algorithm for the consensus problem in an asynchronous system subject to failures, even if messages can never be lost, at most one process may fail, and it can only fail by crashing (stopping executing)"
  • This impossibility result is important because it highlights that assuming the asynchronous system model leads to a tradeoff: algorithms that solve the consensus problem must either give up safety or liveness when the guarantees regarding bounds on message delivery do not hold.

The CAP theorem

  • We get three different system types:
    • CA (consistency + availability). Examples include full strict quorum protocols, such as two-phase commit.
    • CP (consistency + partition tolerance). Examples include majority quorum protocols in which minority partitions are unavailable such as Paxos.
    • AP (availability + partition tolerance). Examples include protocols using conflict resolution, such as Dynamo.
  • The CA and CP system designs both offer the same consistency model: strong consistency. The only difference is that a CA system cannot tolerate any node failures; a CP system can tolerate up to f faults given 2f+1 nodes in a non- Byzantine failure model
  • A CA system does not distinguish between node failures and network failures, and hence must stop accepting writes everywhere to avoid introducing divergence (multiple copies).
  • Four conclusions that should be drawn from the CAP theorem:
    • many system designs used in early distributed relational database systems did not take into account partition tolerance (e.g. they were CA designs).
    • here is a tension between strong consistency and high availability during network partitions
      • Strong consistency guarantees require us to give up availability during a partition.
      • How can we work around this? By strengthening the assumptions (assume no partitions) or by weakening the guarantees.
      • If "consistency" is defined as something less than "all nodes see the same data at the same time" then we can have both availability and some (weaker) consistency guarantee.
    • there is a tension between strong consistency and performance in normal operation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment