System Design Cheatsheet
Step One: Framing The Problem a.k.a get the MVP
- Identify the use cases that are in scope
- Determine constraints based on scoped use cases
use case : the things your system needs to be do.
constraints : the things your system will have to consider to be able to do stuff
Break down features into simple tasks to get your use cases. E.g: url shortening can be broken down into redirection and shortening
Find some extra use cases to determine what's out of scope. E.g: high availability, analytics, custom urls
Notes: This is a clarifying step. Get as clear as you can on your scope.
- Determine what your system needs to handle in terms of traffic and data to clarify direction for abstract design
First solve for units per month, then per second
traffic : Requests to be handled
data : Information you need to store
Get some idea of how popular your system might be/can be assumed to be and guess at the ballpark number for overall traffic. Use the Scale numbers below to help.
Figure out the % of traffic that goes to each use case. Is it write-heavy or read-heavy? Chances are it will be read-heavy. E.g: 10% of traffic will be shortening, 90% will be redirection.
Calculate the numbers. Try to get rough quantities for each use case per month and per second. E.g: 100+ mil requests per month -> 400+ requests per second.
More calculating, now for data. Use your percentages for the ballpark, use knowledge of data format to calculate storage space. E.g: if 10% of requests are for shortening and we get 400 requests per second, that means 40 per second are new urls. If urls are made of chars, which are each 1 byte... so on so forth.
Helpful Numbers To Know:
- No. of Facebook users: ~1.3bil
- No. of Twitter users: ~650mil
- No. of tweets per day: 500mil
- 1 byte -> 8 bits
char-> 1 byte
int-> 4 bytes
- 1kb (kilobyte) -> 1000 bytes
- 1mb (megabyte) -> 1mil bytes
- 1gb (gigabyte) -> 10bil bytes
- 1tb (terabyte) -> 1000bil bytes or 1k bil bytes
Step Two: Abstract Design
- Identify the components needed to address all use cases
- Establish relationships between components
- (Optional) - Narrow down choices for technologies. Better to leave this for later, as it can lead you into a more granular rabbit hole than is helpful.
You will have two main layers:
Application/Service Layer - this is the layer serves up requests
Data Storage/Persistence Layer - handles all yo' data
- Start with one use case and figure out what services it needs to be able to work.
- Elaborate on how each service will work, and how it will interact with the data layer.
- Take this opportunity to clarify how the data layer should behave.
- Repeat with other use cases.
NOTE: Optional is presentation layer. Not a lot of people will ask for this.
Step Three(A): Review (and Choose Tech)
- Review current design and get feedback if possible or direction on where to go next
Go over what you have and clean up the drawings. Now depending on your interviewer, you can either fix what you have, work on more features, or pick a rudimentary tech stack. You might need to flesh out more specifically how the relationships work, what database you are using, etc.
Step Three(B): Bottlenecks
- Identify potential scaling issues
Going back to
data. Are you read-heavy or write-heavy? Are you bound by the requests(IO-bound) or by processing data (CPU-bound)?
What are the bottlenecks and single points of failure? I.e if something goes down, what is going to cause the most downtime?
If you've picked tech, what are their disadvantages?
NOTE: Add section on some useful tech and the problems they're good for here
Step Four: Scaling
- Prove that you can scale your design
Basic Concepts and the Problems They Solve:
- Redundancy: Having copies of data so you can back shit up. This solves the single point of failure and data loss.
- Distribution: Have more things doing your stuff concurrently. This solves a high amount of load and also availability/downtime problems.
- Caching: Storing stuff so you don't have to go so far to get it again. This solves for speed and high load.
** NOTE: Everything after this is just a draft. Will be cleaned up, but not as organized as it could be.**
Application Service Layer (DRAFT)
- Always have a load balancer, or multiple load balancers
- Load balancers distribute requests among several servers
- For persistent sessions, load balancers can stick some kind of encrypted server ID in the cookie and parse it on each request to determine if user should be sent to a particular server
- If you have two load balancers (should be enough), there are two architectures for high availability:
Active - Active: both LBs are active, share load and send heartbeats to each other. When one goes down, the other notices and assumes full responsibility for requests.
Active - Passive: one active LB and one passive LB. Active LB sends heartbeats to passive one. When it goes down, the other notices and promotes itself to active.
- Services: General rule of thumb is to split out services according to read/write.
- Why? Writes take significantly longer than reads and you can't cache writes.
- Reduce how much you touch the data storage layer if at all possible. Use caching.
- Caches can get full. Get around this by expiring objects (like sessions) or using an LRU cache.
- You can cache queries, values, even whole files (though you probably don't want to do that.)
- There are different types of caches.
- A Note on Partitioning: You can partition according to many things: Users names (not that helpful), but most (probably) useful at the macro scale is geographically. Think data centers in different regions.
- Keep in mind how you share state among servers.
Data Persistence Layer (DRAFT)
- Rule One: Do not shard unless you absolutely have to.
- Add an index to be able to look things up better in the table.
- An index is a data structure. A data structure. Built from database columns. It lets you look things up faster instead of going through everything. Often a B-Tree.
- Why don't we just index everything, wouldn't that be better? NOPE. That makes updating the database SUPER expensive, cause then we have to update the index too.
- So index smart.
- Replication - multiple databases. A way of automatic copying of data. REDUNDANCY!
Master - Slave: Master DB is the only one that can be written to. Writes trickle down to slaves DBs, which are read from. This allows reads to be distributed. Note it's still a single point of failure.
- However, if Master goes down, we can promote a Slave to Master until we fix it. High availability?
Master - Master: Two masters that can be written to, and they both have Slaves. Some synchronicity problems, but more writes.
- Partitioning: Different databases for different things?
- Partition smart. If we do it alphabetically for users, how many users have names starting with 'X'?
- Some effective segmentation strategies: Domain (Harvard students vs MIT students like Facebook did for the early days), Denormalized Data (User profiles, etc)
- Some downsides: Locality. It takes longer to go between databases, and what if you need info from a bunch of places? How might we be able to offset this?
- The same problems present themselves over and over again.
- High load, single point of failure, availability, synchronization of data/info
- The same techniques apply in many different ways.