ardiereally/SystemDesignQuickRef.md

## SystemDesignQuickRef.md

      
    Raw
  

              SystemDesignQuickRef.md
            
          
    System Design Cheatsheet


Picking the right architecture = Picking the right battles + Managing trade-offs

Basic Steps

1) Clarify and agree on the scope of the system


Use cases

Who is going to use it?
How are they going to use it?

How would you use it?


Which features might be required?


Constraints

What's in scope, what's out of scope?
Scale of the system (Requests per sec, data written, data read, storage required).
Special system requirements (multi-threading, read or write oriented).


2) High level architecture


Sketch the important components and connections between them, but don't go into some details.

Split the system into domains
Connect the domains (maintaining loose coupling, single reponsibility, etc.).
Design the component for each domain (service, storage, CDN etc.).


3) Component Design


Component interface (APIs, messages etc.)
Object oriented design for functionalities.

Map features to modules.
Design each module:

Class Design

Single Responsibility, YAGNI, Open-Closed
Liskov Substitution, Interface Segregation
Dependency Inversion


Package Cohesion

Release Reuse, Common Closure, Common Reuse


Inter-Package Coupling

Acyclic Dependencies, Stable Dependencies, Stable Abstrations


Data model

What kind of data are we storing?
How is it going to be accessed?
Is the data continuous vs. quantized?

DBs work best with quantized data
Continous data (like latitude-longitude) can be quantized by the application


Database schema design.

Which pieces of data do we need to store? How are they identified? How are they searched?
What's the primary key?
What's the shard-key?
How do we search the tables?
Are there any foreign keys or references?
What other indexes might we need?
How do we do cleanup?


Message schema design

What information will the consumer need to do its job?
Tracing information


4) Decisions & trade-offs


Single point of failure? → Distributed Systems
Request rate too much? → Load Balancing
Too much data for one machine to store? → Sharding & Replication
Queries are expensive? Need to reduce load on DB? → Caching
Need to serve lots of static content? → CDNs
Need to decouple components? → Message Queues
Need to handle data center failures? → Multi-region
Does the client need to do too much polling? → Long-polling & websockets
What's the workload like? → CPU-intensive vs. IO-intensive
How much memory does the system need? → Memory Usage
Don't know what the bottleneck might be? → Monitoring & Alerting

5) Scaling your abstract design


Vertical scaling

Should you even consider it?


Horizontal scaling

How should your fleet be deployed?


Caching

Application caching
Database caching
Standalone


Load balancing

At which layer?
What algorithm should be used for balancing?


Database replication

What's the cost?
Should you split reads from writes?


Database partitioning & sharding

On what basis should you partition?
Do you need sharding? If so, on what should you shard?


Loose coupling via message queues

Which components need to talk to each other?
Domain events


CDNs

Push vs Pull


Background jobs via Scheduler

Detecting & fixing data corruption
Garbage Collection


Key topics for designing a system


Concurrency

Threads, deadlock, and starvation.
Shared nothing
Parallelize algorithms.
CSP, Actor model
Consistency and coherence.


Tools

Cloud
Databases & Queues
Caches
OS
Disks & SSDs
Containerization
Scheduler
REMEMBER LIMITS OF THE TOOLS


Estimation of Capacity

The best architecture is where you can add or remove capacity on demand
Back-of-the-envelope calculation

Storage
Performance


Availability & Reliability

Node failure → Health checks, Gossip-based detection & auto-healing
Network failure → Choose between availability & consistency (CAP theorem)
GC pauses or slow processes → Asynchrony, Timeouts
Split-brain → Resolvers, CRDTs, Last-write-wins
Load spikes → Auto-scaling
Cascading failures → Circuit breakers
Distributed mutual exclusion → Safety & liveness + Timeout & Fencing tokens


Microservices Patterns

Eventual consistency
Backpressure
Optimistic concurrency, fencing tokens (etags)
Distributed transactions: two-phase commit, saga pattern, Dynamo model
Loose coupling: event sourcing, CQRS


Security

Authentication & Authorization
Untrusted input validation
Least privilege
Sandboxing
Encryption (in-transit & at-rest)
Integrity (signatures)
Denial-of-Service → Rate-limiting


Visiility

Log aggregation: splunk
Application metrics: new relic, datadog, prometheus
Distributed Tracing: request-ids, breadcrumb-trail