Picking the right architecture = Picking the right battles + Managing trade-offs
-
Use cases
- Who is going to use it?
- How are they going to use it?
- How would you use it?
- Which features might be required?
-
Constraints
- What's in scope, what's out of scope?
- Scale of the system (Requests per sec, data written, data read, storage required).
- Special system requirements (multi-threading, read or write oriented).
- Sketch the important components and connections between them, but don't go into some details.
- Split the system into domains
- Connect the domains (maintaining loose coupling, single reponsibility, etc.).
- Design the component for each domain (service, storage, CDN etc.).
- Component interface (APIs, messages etc.)
- Object oriented design for functionalities.
- Map features to modules.
- Design each module:
- Class Design
- Single Responsibility, YAGNI, Open-Closed
- Liskov Substitution, Interface Segregation
- Dependency Inversion
- Package Cohesion
- Release Reuse, Common Closure, Common Reuse
- Inter-Package Coupling
- Acyclic Dependencies, Stable Dependencies, Stable Abstrations
- Class Design
- Data model
- What kind of data are we storing?
- How is it going to be accessed?
- Is the data continuous vs. quantized?
- DBs work best with quantized data
- Continous data (like latitude-longitude) can be quantized by the application
- Database schema design.
- Which pieces of data do we need to store? How are they identified? How are they searched?
- What's the primary key?
- What's the shard-key?
- How do we search the tables?
- Are there any foreign keys or references?
- What other indexes might we need?
- How do we do cleanup?
- Message schema design
- What information will the consumer need to do its job?
- Tracing information
- Single point of failure? → Distributed Systems
- Request rate too much? → Load Balancing
- Too much data for one machine to store? → Sharding & Replication
- Queries are expensive? Need to reduce load on DB? → Caching
- Need to serve lots of static content? → CDNs
- Need to decouple components? → Message Queues
- Need to handle data center failures? → Multi-region
- Does the client need to do too much polling? → Long-polling & websockets
- What's the workload like? → CPU-intensive vs. IO-intensive
- How much memory does the system need? → Memory Usage
- Don't know what the bottleneck might be? → Monitoring & Alerting
- Vertical scaling
- Should you even consider it?
- Horizontal scaling
- How should your fleet be deployed?
- Caching
- Application caching
- Database caching
- Standalone
- Load balancing
- At which layer?
- What algorithm should be used for balancing?
- Database replication
- What's the cost?
- Should you split reads from writes?
- Database partitioning & sharding
- On what basis should you partition?
- Do you need sharding? If so, on what should you shard?
- Loose coupling via message queues
- Which components need to talk to each other?
- Domain events
- CDNs
- Push vs Pull
- Background jobs via Scheduler
- Detecting & fixing data corruption
- Garbage Collection
-
Concurrency
- Threads, deadlock, and starvation.
- Shared nothing
- Parallelize algorithms.
- CSP, Actor model
- Consistency and coherence.
-
Tools
- Cloud
- Databases & Queues
- Caches
- OS
- Disks & SSDs
- Containerization
- Scheduler
- REMEMBER LIMITS OF THE TOOLS
-
Estimation of Capacity
- The best architecture is where you can add or remove capacity on demand
- Back-of-the-envelope calculation
- Storage
- Performance
-
Availability & Reliability
- Node failure → Health checks, Gossip-based detection & auto-healing
- Network failure → Choose between availability & consistency (CAP theorem)
- GC pauses or slow processes → Asynchrony, Timeouts
- Split-brain → Resolvers, CRDTs, Last-write-wins
- Load spikes → Auto-scaling
- Cascading failures → Circuit breakers
- Distributed mutual exclusion → Safety & liveness + Timeout & Fencing tokens
-
Microservices Patterns
- Eventual consistency
- Backpressure
- Optimistic concurrency, fencing tokens (etags)
- Distributed transactions: two-phase commit, saga pattern, Dynamo model
- Loose coupling: event sourcing, CQRS
-
Security
- Authentication & Authorization
- Untrusted input validation
- Least privilege
- Sandboxing
- Encryption (in-transit & at-rest)
- Integrity (signatures)
- Denial-of-Service → Rate-limiting
-
Visiility
- Log aggregation: splunk
- Application metrics: new relic, datadog, prometheus
- Distributed Tracing: request-ids, breadcrumb-trail