gernd/harvard_scalability_notes.md

## harvard_scalability_notes.md

      
    Raw
  

              harvard_scalability_notes.md
            
          
    Scalability notes

Notes for the Harvard scalability lecture (https://www.youtube.com/watch?v=-W9F__D3oY4)

Things to look for in a hosting company
accessible in your country
access via S(!)FTP
decision whether you need a physical machine for yourself or a shared host might be sufficient
VPS: own virtual system, e.g. Vmware, system is isolated from other virtual systems running on the same hardware
Shared Webhost: Host is shared with other users

Vertical Scaling


get more RAM / processors (multicore) / faster HD for the server
constrained by the state of the art technology available for a single machine

Horizontal Scaling


using a lot of cheaper hardware instead of using only a few high end machines
inbound HTTP requests from different clients are distributed amongst the available machines
Component performing this distribution is called Load Balancer
DNS Server for www.yourwebsite.com should return the IP address of the Load Balancer to make sure that
every request is handled by it and load balancing is transparent for clients
available machines handling the requests can have private IP adresses, only the Load Balancer needs a public one
Decision for routing incoming HTTP requests could be made depending on
Load: Which server is currently busy / least busy -> all servers have to be identical
Type of request: Use hostname to decide, to which server the request should be routed to (one for images, one for HTML,
one for JS...)
use DNS to return different IPs upon every request (Round Robin). can be configured quite easily with e.g. bind.
Disadvantages:
One of the servers could get all computational difficult requests while other servers
get only requests for static files
DNS answers are often cached by clients (OSes, Browsers) determined by TTL values

Sessions


Load Balancing breaks sessions (each machine has different sessions saved in serialized text files in /tmp on each machine)
Request will always be handled by a different machine resulting in different sessions
Possible solution: Only one server contains sessions, other contains images ...
Problem with this solution: No redundancy for the PHP server, can only handle a certain amount of load
Another solution: One dedicated server is responsible for storing sessions
Load Balancer handles sessions <- Single Point of Failure
Solution: Load Balancer stores data on a RAID

RAID - Redundant array of independent disks


RAID 0: two harddrives of identical size. Data is striped across them. Motivation: Each HD takes some to write data,
waiting time can be minimized by writing a stripe to data to each disk
RAID 1: data is mirrored across two harddrives. effect: little performance overhead because data has to te written out
twice but data is redundantly saved on two HDs. RAID array can rebuild itself if one HD is damanged and has to be replaced
RAID 10: combination of RAID 0 and RAID 1. 4 HDs are needed
RAID 5: variant of RAID 1. only HD is used for redundancy, other HDs can be fully used.
RAID 6: two HD are used for redundancy, other HDs can be fully used.

Load Balancer


can be implemented in Hardware and Software
prices of load balancer go up to 100.000$

Shared Storage / Sticky sessions


cookies could contain all necessary information -> violates privacy/cookie has limited size
-> use a db to store session data on server side
use several fileservers -> sync needed
client encodes "its" server in the URL/cookie for every request -> private IPs of servers are exploited / can change.
Solution: Set ID in Cookie. Load Balancer maps Ids to available servers -> Client doesn't know about the actual server IP

PHP


Interpreted languages not as fast compiled ones
PHP Accelerators exist that perform precompiling into bytecode
similar to python

Caching


HTML

prerender sites so that rendering and DB interaction does not have to be performed upon
every HTTP request
prerendered sites are cached on a file-bases
downsides: additional space needed
change css/image/static elements requires regeneration of all cached sites


Mysql Caching

query cache can be enabled in my.cnf
caches results of identical queries


memcached

in-memory cache
results that are expensive to fetch (e.g. complex db queries) can be stored in RAM
cache can get so big that it does not fit in RAM
objects can have a expiration time and can be garbage collected after a certain amount of time
every time the cache is hit, the objects expiration time can be increased


Mysql Storage Engines


e.g. InnoDB (supports transactions) vs. MyISAM(uses full table locks)
other engines: Memory, Archive, NDB
engine properties: Locking granularity, MVCC support, Geospatial data type support, avilable index types ...
Archive: tables are compressed -> need less space but increase query time

DB Replication


Master-Slave

master performs read/write actions
data is replicated to slaves
pro: backup (one slave can be promoted to new master if master dies)
pro: load balancing across slaves
pro: good topology for read heavy systems (calls can be delegated to slaves)
cons: downtime for writes until new master is promoted


Master-Master

all master nodes perform read/write actions
master <-> master and master <->slave nodes are synchronized


Load Balancing


Active-Active pattern

Several load balancers are active and can distribute packets
Send heartbeats to each other
If one load balancer stops receiving heartbeats, it assumes that the other instnance is offline


Active-Passive pattern

Only one instance is active, passive instance listens for heartbeats
If no heartbeats are received from active instance anymore, passive instance promotes itself to active


Partitioning

Distribution of several users to different servers (e.g. facebook had one server for Harvard, one for MIT...)
Data Partitioning: Data is clustered e.g. by the first letter of the user's name