Skip to content

Instantly share code, notes, and snippets.

@adohe-zz
Created May 7, 2015 06:18
Show Gist options
  • Save adohe-zz/2ce41edec784796faff7 to your computer and use it in GitHub Desktop.
Save adohe-zz/2ce41edec784796faff7 to your computer and use it in GitHub Desktop.
Notes on Distributed Systems
1. Distributed systems are different because they fail often. What sets distributed systems engineering apart is the probability of failure and, worse, the probability of partial failure. Networked systems fail more than systems that exist on only a single machine and that failures tend to be partial instead of total. Design for failure.
2. Writing robust distributed systems costs more than writing robust single-machine systems. There are failure conditions that are difficult to replicate on a single machine. Distributed systems tend to need actual, not simulated, distribution to flush out their bugs. Simulation is, of course, very useful.
3. Robust, open source distributed systems are much less common than robust, single-machine systems.
4. Coordination is very hard. Avoid coordination machines wherever possible. This is often describled as "horizontal scalability". The real trick of horizontal scalability is independence - being able to get data to machines such that communication and consensus between those machines is kept to a minimum. Every time two machines have to agree on something, the service is harder to implement. Information has an upper limit to the speed it can travel, and networked communication is flakier than you think.
5. If you can fit your problem in memory, it's probably trivial. To a distributed systems engineer, problems that are local to one machine are easy.
6. "It's slow" is the hardest problem you'll ever debug. "It's slow" is hard, in part, because the problem statement doesn't provide many clues to location of the flaw. Partial failures, ones that don't show up on the graphs you usually look up, are lurking in a dark corner.
7. Implement backpressure throughout your system. Backpressure is the signaling of failure from a serving system to the requesting system and how the requesting system handles those failures to prevent overloading itself and the serving system. Designing for backpressure means bounding resource utilization during times of overload and times of system failure. This is one of the basic building blocks of creating a robust distributed system. Common versions include dropping new messages on the floor (and incrementing a metric) if the system's resources are already over-scheduled, and shipping errors back to users when the system determines it will be unable to finish the request in a given amount of time. Timeouts and exponential back-offs on connections and requests to other systems are also useful.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment