Notes for the Harvard scalability lecture (https://www.youtube.com/watch?v=-W9F__D3oY4)
- Things to look for in a hosting company
- accessible in your country
- access via S(!)FTP
- decision whether you need a physical machine for yourself or a shared host might be sufficient
- VPS: own virtual system, e.g. Vmware, system is isolated from other virtual systems running on the same hardware
- Shared Webhost: Host is shared with other users
- get more RAM / processors (multicore) / faster HD for the server
- constrained by the state of the art technology available for a single machine
- using a lot of cheaper hardware instead of using only a few high end machines
- inbound HTTP requests from different clients are distributed amongst the available machines
- Component performing this distribution is called Load Balancer
- DNS Server for www.yourwebsite.com should return the IP address of the Load Balancer to make sure that every request is handled by it and load balancing is transparent for clients
- available machines handling the requests can have private IP adresses, only the Load Balancer needs a public one
- Decision for routing incoming HTTP requests could be made depending on
- Load: Which server is currently busy / least busy -> all servers have to be identical
- Type of request: Use hostname to decide, to which server the request should be routed to (one for images, one for HTML, one for JS...)
- use DNS to return different IPs upon every request (Round Robin). can be configured quite easily with e.g. bind.
- Disadvantages:
- One of the servers could get all computational difficult requests while other servers get only requests for static files
- DNS answers are often cached by clients (OSes, Browsers) determined by TTL values
- Load Balancing breaks sessions (each machine has different sessions saved in serialized text files in /tmp on each machine)
- Request will always be handled by a different machine resulting in different sessions
- Possible solution: Only one server contains sessions, other contains images ...
- Problem with this solution: No redundancy for the PHP server, can only handle a certain amount of load
- Another solution: One dedicated server is responsible for storing sessions
- Load Balancer handles sessions <- Single Point of Failure
- Solution: Load Balancer stores data on a RAID
- RAID 0: two harddrives of identical size. Data is striped across them. Motivation: Each HD takes some to write data, waiting time can be minimized by writing a stripe to data to each disk
- RAID 1: data is mirrored across two harddrives. effect: little performance overhead because data has to te written out twice but data is redundantly saved on two HDs. RAID array can rebuild itself if one HD is damanged and has to be replaced
- RAID 10: combination of RAID 0 and RAID 1. 4 HDs are needed
- RAID 5: variant of RAID 1. only HD is used for redundancy, other HDs can be fully used.
- RAID 6: two HD are used for redundancy, other HDs can be fully used.
- can be implemented in Hardware and Software
- prices of load balancer go up to 100.000$
- cookies could contain all necessary information -> violates privacy/cookie has limited size
- -> use a db to store session data on server side
- use several fileservers -> sync needed
- client encodes "its" server in the URL/cookie for every request -> private IPs of servers are exploited / can change.
- Solution: Set ID in Cookie. Load Balancer maps Ids to available servers -> Client doesn't know about the actual server IP
- Interpreted languages not as fast compiled ones
- PHP Accelerators exist that perform precompiling into bytecode
- similar to python
- HTML
- prerender sites so that rendering and DB interaction does not have to be performed upon every HTTP request
- prerendered sites are cached on a file-bases
- downsides: additional space needed
- change css/image/static elements requires regeneration of all cached sites
- Mysql Caching
- query cache can be enabled in my.cnf
- caches results of identical queries
- memcached
- in-memory cache
- results that are expensive to fetch (e.g. complex db queries) can be stored in RAM
- cache can get so big that it does not fit in RAM
- objects can have a expiration time and can be garbage collected after a certain amount of time
- every time the cache is hit, the objects expiration time can be increased
- e.g. InnoDB (supports transactions) vs. MyISAM(uses full table locks)
- other engines: Memory, Archive, NDB
- engine properties: Locking granularity, MVCC support, Geospatial data type support, avilable index types ...
- Archive: tables are compressed -> need less space but increase query time
- Master-Slave
- master performs read/write actions
- data is replicated to slaves
- pro: backup (one slave can be promoted to new master if master dies)
- pro: load balancing across slaves
- pro: good topology for read heavy systems (calls can be delegated to slaves)
- cons: downtime for writes until new master is promoted
- Master-Master
- all master nodes perform read/write actions
- master <-> master and master <->slave nodes are synchronized
- Active-Active pattern
- Several load balancers are active and can distribute packets
- Send heartbeats to each other
- If one load balancer stops receiving heartbeats, it assumes that the other instnance is offline
- Active-Passive pattern
- Only one instance is active, passive instance listens for heartbeats
- If no heartbeats are received from active instance anymore, passive instance promotes itself to active
- Partitioning
- Distribution of several users to different servers (e.g. facebook had one server for Harvard, one for MIT...)
- Data Partitioning: Data is clustered e.g. by the first letter of the user's name