Skip to content

Instantly share code, notes, and snippets.

@gernd
Last active October 18, 2022 23:49
Show Gist options
  • Save gernd/bbbd5faf7e92537e5df40d3e65094f55 to your computer and use it in GitHub Desktop.
Save gernd/bbbd5faf7e92537e5df40d3e65094f55 to your computer and use it in GitHub Desktop.
My notes for the Harvard scalability lecture (https://www.youtube.com/watch?v=-W9F__D3oY4)

Scalability notes

Notes for the Harvard scalability lecture (https://www.youtube.com/watch?v=-W9F__D3oY4)

  • Things to look for in a hosting company
  • accessible in your country
  • access via S(!)FTP
  • decision whether you need a physical machine for yourself or a shared host might be sufficient
  • VPS: own virtual system, e.g. Vmware, system is isolated from other virtual systems running on the same hardware
  • Shared Webhost: Host is shared with other users

Vertical Scaling

  • get more RAM / processors (multicore) / faster HD for the server
  • constrained by the state of the art technology available for a single machine

Horizontal Scaling

  • using a lot of cheaper hardware instead of using only a few high end machines
  • inbound HTTP requests from different clients are distributed amongst the available machines
  • Component performing this distribution is called Load Balancer
  • DNS Server for www.yourwebsite.com should return the IP address of the Load Balancer to make sure that every request is handled by it and load balancing is transparent for clients
  • available machines handling the requests can have private IP adresses, only the Load Balancer needs a public one
  • Decision for routing incoming HTTP requests could be made depending on
  • Load: Which server is currently busy / least busy -> all servers have to be identical
  • Type of request: Use hostname to decide, to which server the request should be routed to (one for images, one for HTML, one for JS...)
  • use DNS to return different IPs upon every request (Round Robin). can be configured quite easily with e.g. bind.
  • Disadvantages:
  • One of the servers could get all computational difficult requests while other servers get only requests for static files
  • DNS answers are often cached by clients (OSes, Browsers) determined by TTL values

Sessions

  • Load Balancing breaks sessions (each machine has different sessions saved in serialized text files in /tmp on each machine)
  • Request will always be handled by a different machine resulting in different sessions
  • Possible solution: Only one server contains sessions, other contains images ...
  • Problem with this solution: No redundancy for the PHP server, can only handle a certain amount of load
  • Another solution: One dedicated server is responsible for storing sessions
  • Load Balancer handles sessions <- Single Point of Failure
  • Solution: Load Balancer stores data on a RAID

RAID - Redundant array of independent disks

  • RAID 0: two harddrives of identical size. Data is striped across them. Motivation: Each HD takes some to write data, waiting time can be minimized by writing a stripe to data to each disk
  • RAID 1: data is mirrored across two harddrives. effect: little performance overhead because data has to te written out twice but data is redundantly saved on two HDs. RAID array can rebuild itself if one HD is damanged and has to be replaced
  • RAID 10: combination of RAID 0 and RAID 1. 4 HDs are needed
  • RAID 5: variant of RAID 1. only HD is used for redundancy, other HDs can be fully used.
  • RAID 6: two HD are used for redundancy, other HDs can be fully used.

Load Balancer

  • can be implemented in Hardware and Software
  • prices of load balancer go up to 100.000$

Shared Storage / Sticky sessions

  • cookies could contain all necessary information -> violates privacy/cookie has limited size
  • -> use a db to store session data on server side
  • use several fileservers -> sync needed
  • client encodes "its" server in the URL/cookie for every request -> private IPs of servers are exploited / can change.
  • Solution: Set ID in Cookie. Load Balancer maps Ids to available servers -> Client doesn't know about the actual server IP

PHP

  • Interpreted languages not as fast compiled ones
  • PHP Accelerators exist that perform precompiling into bytecode
  • similar to python

Caching

  • HTML
    • prerender sites so that rendering and DB interaction does not have to be performed upon every HTTP request
    • prerendered sites are cached on a file-bases
    • downsides: additional space needed
    • change css/image/static elements requires regeneration of all cached sites
  • Mysql Caching
    • query cache can be enabled in my.cnf
    • caches results of identical queries
  • memcached
    • in-memory cache
    • results that are expensive to fetch (e.g. complex db queries) can be stored in RAM
    • cache can get so big that it does not fit in RAM
    • objects can have a expiration time and can be garbage collected after a certain amount of time
    • every time the cache is hit, the objects expiration time can be increased

Mysql Storage Engines

  • e.g. InnoDB (supports transactions) vs. MyISAM(uses full table locks)
  • other engines: Memory, Archive, NDB
  • engine properties: Locking granularity, MVCC support, Geospatial data type support, avilable index types ...
  • Archive: tables are compressed -> need less space but increase query time

DB Replication

  • Master-Slave
    • master performs read/write actions
    • data is replicated to slaves
    • pro: backup (one slave can be promoted to new master if master dies)
    • pro: load balancing across slaves
    • pro: good topology for read heavy systems (calls can be delegated to slaves)
    • cons: downtime for writes until new master is promoted
  • Master-Master
    • all master nodes perform read/write actions
    • master <-> master and master <->slave nodes are synchronized

Load Balancing

  • Active-Active pattern
    • Several load balancers are active and can distribute packets
    • Send heartbeats to each other
    • If one load balancer stops receiving heartbeats, it assumes that the other instnance is offline
  • Active-Passive pattern
    • Only one instance is active, passive instance listens for heartbeats
    • If no heartbeats are received from active instance anymore, passive instance promotes itself to active
  • Partitioning
    • Distribution of several users to different servers (e.g. facebook had one server for Harvard, one for MIT...)
    • Data Partitioning: Data is clustered e.g. by the first letter of the user's name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment