Skip to content

Instantly share code, notes, and snippets.

@ardiereally
Last active July 21, 2021 18:50
Show Gist options
  • Save ardiereally/692a3f591dc5b7410180c814bb4928be to your computer and use it in GitHub Desktop.
Save ardiereally/692a3f591dc5b7410180c814bb4928be to your computer and use it in GitHub Desktop.

System Design Cheatsheet

Picking the right architecture = Picking the right battles + Managing trade-offs

Basic Steps

1) Clarify and agree on the scope of the system

  • Use cases

    • Who is going to use it?
    • How are they going to use it?
      • How would you use it?
    • Which features might be required?
  • Constraints

    • What's in scope, what's out of scope?
    • Scale of the system (Requests per sec, data written, data read, storage required).
    • Special system requirements (multi-threading, read or write oriented).

2) High level architecture

  • Sketch the important components and connections between them, but don't go into some details.
    • Split the system into domains
    • Connect the domains (maintaining loose coupling, single reponsibility, etc.).
    • Design the component for each domain (service, storage, CDN etc.).

3) Component Design

  • Component interface (APIs, messages etc.)
  • Object oriented design for functionalities.
    • Map features to modules.
    • Design each module:
      • Class Design
        • Single Responsibility, YAGNI, Open-Closed
        • Liskov Substitution, Interface Segregation
        • Dependency Inversion
      • Package Cohesion
        • Release Reuse, Common Closure, Common Reuse
      • Inter-Package Coupling
        • Acyclic Dependencies, Stable Dependencies, Stable Abstrations
  • Data model
    • What kind of data are we storing?
    • How is it going to be accessed?
    • Is the data continuous vs. quantized?
      • DBs work best with quantized data
      • Continous data (like latitude-longitude) can be quantized by the application
  • Database schema design.
    • Which pieces of data do we need to store? How are they identified? How are they searched?
    • What's the primary key?
    • What's the shard-key?
    • How do we search the tables?
    • Are there any foreign keys or references?
    • What other indexes might we need?
    • How do we do cleanup?
  • Message schema design
    • What information will the consumer need to do its job?
    • Tracing information

4) Decisions & trade-offs

  • Single point of failure? → Distributed Systems
  • Request rate too much? → Load Balancing
  • Too much data for one machine to store? → Sharding & Replication
  • Queries are expensive? Need to reduce load on DB? → Caching
  • Need to serve lots of static content? → CDNs
  • Need to decouple components? → Message Queues
  • Need to handle data center failures? → Multi-region
  • Does the client need to do too much polling? → Long-polling & websockets
  • What's the workload like? → CPU-intensive vs. IO-intensive
  • How much memory does the system need? → Memory Usage
  • Don't know what the bottleneck might be? → Monitoring & Alerting

5) Scaling your abstract design

  • Vertical scaling
    • Should you even consider it?
  • Horizontal scaling
    • How should your fleet be deployed?
  • Caching
    • Application caching
    • Database caching
    • Standalone
  • Load balancing
    • At which layer?
    • What algorithm should be used for balancing?
  • Database replication
    • What's the cost?
    • Should you split reads from writes?
  • Database partitioning & sharding
    • On what basis should you partition?
    • Do you need sharding? If so, on what should you shard?
  • Loose coupling via message queues
    • Which components need to talk to each other?
    • Domain events
  • CDNs
    • Push vs Pull
  • Background jobs via Scheduler
    • Detecting & fixing data corruption
    • Garbage Collection

Key topics for designing a system

  1. Concurrency

    • Threads, deadlock, and starvation.
    • Shared nothing
    • Parallelize algorithms.
    • CSP, Actor model
    • Consistency and coherence.
  2. Tools

    • Cloud
    • Databases & Queues
    • Caches
    • OS
    • Disks & SSDs
    • Containerization
    • Scheduler
    • REMEMBER LIMITS OF THE TOOLS
  3. Estimation of Capacity

    • The best architecture is where you can add or remove capacity on demand
    • Back-of-the-envelope calculation
      • Storage
      • Performance
  4. Availability & Reliability

    • Node failure → Health checks, Gossip-based detection & auto-healing
    • Network failure → Choose between availability & consistency (CAP theorem)
    • GC pauses or slow processes → Asynchrony, Timeouts
    • Split-brain → Resolvers, CRDTs, Last-write-wins
    • Load spikes → Auto-scaling
    • Cascading failures → Circuit breakers
    • Distributed mutual exclusion → Safety & liveness + Timeout & Fencing tokens
  5. Microservices Patterns

    • Eventual consistency
    • Backpressure
    • Optimistic concurrency, fencing tokens (etags)
    • Distributed transactions: two-phase commit, saga pattern, Dynamo model
    • Loose coupling: event sourcing, CQRS
  6. Security

    • Authentication & Authorization
    • Untrusted input validation
    • Least privilege
    • Sandboxing
    • Encryption (in-transit & at-rest)
    • Integrity (signatures)
    • Denial-of-Service → Rate-limiting
  7. Visiility

    • Log aggregation: splunk
    • Application metrics: new relic, datadog, prometheus
    • Distributed Tracing: request-ids, breadcrumb-trail
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment