Skip to content

Instantly share code, notes, and snippets.

@MattJOlson
Last active January 23, 2020 03:12
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MattJOlson/e739bd49d76f0d3c8c3a1fc8a3112004 to your computer and use it in GitHub Desktop.
Save MattJOlson/e739bd49d76f0d3c8c3a1fc8a3112004 to your computer and use it in GitHub Desktop.
Notes on "Distributed Systems and the End of the API"

https://writings.quilt.org/2014/05/12/distributed-systems-and-the-end-of-the-api/

Distributed systems

"A distributed system is one where a machine I've never heard of can cause my program to fail" - Leslie Lamport

Cute, but "my program" is kind of quaint now, isn't it?

"...multiple processes that must communicate to perform work" - hmm, hmm, let me tell you about disk controllers. Oh wait, that's your point... almost, "an air-gapped computer" is considered nondistributed, but well whatever.

There's probably something about latency and error rates in here. SLOs! A sufficiently fast and reliable networked dependency is indistinguishable from local, and vice versa.

"...distributed systems exhibit a set of uniformly unintuitive behaviours related to causality, consistency, and availability. These behaviours are largely emergent, and spring from the equally unintuitive semantics of the non-locality of the parts of those distsyses and the networks that connect them." Plausible approach: address the semantics, model them (TLA+? Hillel-infected), and incorporate them into the architectural model.

Training: we aren't trained to take this into account. "Praaagmatism"

APIs

"An API is the set of names we interact with in our programming languages and libraries"

Oh no, naming things

"When people started hooking up computers over networks, it was natural to want to carry along this notion of using language as a way of naming things we interact with programmatically." Oh snap, I see some formalisms on the horizon.

"client.send(data); doesn't inform us at all of the semantics of the operation we're attempting"

  • Performance bounds
  • Pre/postconditions
  • Sad path handling
  • ...all undefined

The API Problem

Basically, RPC -> SOAP -> REST -> whatever are all fundamentally equivalent, synchronous request/response things:

  • A request/response lifecycle
  • Synchronous
  • Presume point-to-point topology (client/server, two-party)
  • Ops are imperative (mutable data, side effects)
  • Few constraints on acceptable data models

Ok, I'm excited about this blog post now

Blames this all on "the programming language heritage" of APIs, which (along with billion-dollar-mistake shitlangs in general) "...are a fatally bad mismatch for the job of supporting the communications between actors in a distsys".

Fixes, e.g. HTTP APIs offering async "202 Accepted -> wait for my signal" interfaces, are "nothing more than idioms and best practices, not substantive solutions".

APIs: Sisyphean programmer convenience

Loving this post even more, great phrasing

Devs value "isomorphism between function/method calls in a typical imperative shitlang, and the various manifestations of network APIs". "...nominal concerns are paramount, but network and other operational semantics, failure modes, notions of causality and consistency are entirely unaccounted for".

Even better, "RPC sucks, but use our native client library to make REST-over-HTTP look like RPC" prevails, which makes it "...impossible to know which calls will result in strictly local computation, and which will incur one or many network operations and all the complications that entails." Fucking rekt. (If only the type of the function would tell you if it needs to do network IO... never mind)

The API: An anachronism

Aside from the other complaints listed above, "...APIs provide no way to naturally talk about anything other than two-party client/server communication". Yes, you can build any topology you want from single edges. No, that's not a feature.

Acknowledge the network or fail The problem with APIs as typically used is that they don't acknowledge the network being a bitch. (Probably this should include the networks currently implemented on motherboards.)

Start -- start! -- by understanding how networks fail, and how that affects your code:

  • Partitions
  • Latency, everything from variable to permanent loss of connection
  • Reordered and repeated messages

"...the networks used to reach the people and devices and vendors at the edges of your system are almost never owned and operated by you, yet your system is subject to their failures as well"

YAML will not save you.

Consistency decisions affect everything

Such a lost opportunity for "Consistency Rules Everything Around Me"

Common cases of consistency control:

  • Strict linearizability, global total ordering, jeeeeezus
  • Causal consistency, partial order of all operations, "read your own writes"
  • Eventual consistency, eat trash, be free

CAP theorem makes its first appearance. Summarizing: Less consensus means more apparent uptime, more consensus means more "stop sending traffic, we need to figure this out"

Network APIs are zero help, they push it all onto the developer/architect. Similarly, JSON and friends are of no help resolving concurrent changes. Put together, "concurrent actors moving state around ... Have no generally applicable way of resolving conflicting concurrent changes. This forces programmers to regularly reimplement such resolution mechanisms, or more commonly to rely completely upon centralized backend DBs". Barf.

What do we want?

Different tools that make building and reasoning about distributed systems simpler and easier than it is today.

In any distsys, we want two things: - Communication - ability to share data among components - Computation - ability to consume and transform that data Claim: Everything else is incidental.

Desired features of building blocks:

  • Allow us to control/apply capabilities as we please
  • Actively prevent us from committing obvious reasoning errors
  • Guard rails: architected such that large classes of errors are impossible

Yeah, that's ambitious.

We've been here before

Analogy: "we used to use C, now we use Java!" Um. Focus on the progression: abstracted memory management leading to memory safety. We "cannot keep dragging along this notion that one can [work] at the level of spitting data out of sockets and expect to build robust and understandable distributed systems on top of such primitive primitives."

In-place mutation gets a mention as bad! About goddamn time.

Need to abstract away from:

  • Accommodating network failure modes (by hand)
  • Implementing consensus mechanisms (by hand)
  • Controlling consistency levels (by hand)

Appeal to authority

Leslie Lamport: "Time, Clocks and the Ordering of Events in a Distributed System".

Communication between nodes is a "physics problem" - the physics of communicating between separate actors, and the progression of time that implies, sets limits on what is and isn't possible. (Seems too low-level to be meaningful at the distsys level? People thought relativistic effects on electrons was negligible too, but apparently it's vital! https://blogs.sciencemag.org/pipeline/archives/2019/10/11/copernicium-is-a-strange-element-indeed)

There's a bit of "pragmatic" "clever" bashing I don't need to repeat.

Sound approaches nearing practicality

  • CALM theorem - consistency as logical monotonicity (I kinda understand that title)
  • CRDTs

"Both of these approaches constrain the types of operations that your system can perform in order to ensure convergence over time of changes to data shared by uncoordinated concurrent actors, and to eliminate network failure modes as a source of error." Whew. Let's break that down.

  • Constrain the types of operations that your system can perform

    You can't just do random bullshit across a network and expect to have a good time. If you want to do random bullshit, you need to confine it.

  • In order to ensure convergence over time

    Eventual consistency: eat trash, be free. If you want strong consistency, again, you need to confine it.

  • Of changes to data shared by uncoordinated concurrent actors

    Shared data is, once again, the problem. If you can arrange to not share your data, you are more free to do your random bullshit.

  • And to eliminate network failure modes as a source of error

    That's kind of out of nowhere, but "network failure modes" are broadly where all these constraints come from.

CALM has its roots in temporal logic, CRDT is applied semilattices. Let's talk about semilattices!

Bounded join semilattices

Let's do definitions, I fucking love definitions.

A lattice is a poset where, for any two members in the set, there exists both a least upper bound (a value "greater than" both members, the join) and a greatest upper bound (a value "less than" both members, the meet). If a lattice admits only one of these, it's a semilattice.

If a lattice has an absolute maximum (top) or minimum (bottom), it's bounded.

CRDTs are based on bounded-join semilattices. Nat is a bounded-join semilattice on max, Set is a bounded-join semilattice on union. Set is a good example because unions never lose information; you can make concurrent modifications to a (logically shared) structure, and those modifications will always yield the same value when joined. (Is this associativity? Oh wait, that's next.)

Join and meet operations must satisfy three properties:

  • Associativity (yup)
  • Commutativity (whoa)
  • Idempotence (sure)

Obviously these are helpful for reasoning about even a single-node system, but in distsys commutativity laughs in the face of request ordering and idempotence laughs in the face of request duplication. And "as long as liveness is preserved [...], then semilattice semantics ensure that" operations initiated concurrently will converge without conflict.

Not all operations are associative, commutative, and idempotent; fortunately, only the primitive operations need to be. You can build useful operations out of these primitives. - https://hal.inria.fr/file/index/docid/555588/filename/techreport.pdf

Data models are everything

You can build CRDTs around an awful lot of data types, including trees. That makes me wonder if there are CRDTs for algebraic data types or GADTs, and the answer is probably "yes" and almost certainly "it's possible" (but note the "almost").

CRDTs are replicated, which abstracts the network-layer crap and gives rise to this enraging "I would simply" sentence:

Instead of working out a way to distill your model data into JSON or XML so it can be shared with other actors in your system, you simply add the data in question to a CRDT, and let its replication implementation carry it abroad, with exactly zero loss of representational fidelity.

Semilattices encourage immutability, oh look this is event sourcing.

Operations are cast as data, not actions. "You can copy them, route them, reorder them, manipulate them, apply programs to them" - some LISPy shit here.

"People familiar with message queues will think this is very natural" - yep, convergent evolution.

Pick a programming model

CRDTs are available as libraries for a bunch of platforms. "Best of both worlds"? Maybe... could also be a patch over flawed thinking. Bloom is mentioned as the CALM theorem research language, where the compiler can tell you where your program goes non-monotonic. Feels like linear or affine types, which I should also research.

Related practices:

With constraints come costs

What do we need to give up to get our CRDTs?

  • Well, we need to make sure our primitives are ACI. Use property testing.
  • Next, we need to maximize the safe footprint that communicates via CRDTs and strangle the parts of the system that communicate by imperative APIs. This favours rewrites over incremental refactors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment