####Basics####
- Memory Hierarchy - Tape, Disk, SSD, Memory, Cache
- Kryder's law
- Long Tail vs 80/20 rule
- Drawbacks of monolithic systems - Supercomputers
- Distributed Systems - Advantages & Problems
- CAP theorem - Consistency, Availability and Partition Tolerance
- PACELC - if(Partition){ Tradoffs: Consistency vs Availability } else { Tradeoffs: Consistency vs Latency }
- Concurrency vs Parallelism
- Parallel vs Distributed computing
- Horizontal vs Vertical Scaling - Tradeoffs
- Network Computing Models: Client-Server, N-tier, P2P, Hybrid
- RPCs vs REVs vs Mobile Agents
- REST vs SOAP
- JSON vs XML
- Service-Oriented Architectures (SOA)
- HTTP verbs in REST calls for CRUD
- Distributed Computing Models: Parallel Algorithms in Shared Memory Systems, Parallel Algorithms in Message Passing and Distributed Algorithms in Message passing.
####Cloud Computing Characteristics####
- On demand Self Service
- Ubiquitous network access
- Resource Pooling
- Rapid Elasticity
- Measured Service - Pay as you go - Utility computing
- Virtualized Resources - Infinite compute
####Cloud Computing Technology Enablers####
- Hardware - virtualization + multicore
- Internet Technologies - Web 2.0 + Web services + SOA
- Systems Management - Data center automation + autonomic computing
- Distributed Computing - Utility + Grid Computing
####Cloud Deployment Models####
-
Public clouds - EC2, Azure
-
Private clouds - Appliance
-
Community clouds - OpenCirrus
-
Hybrid clouds - Combination
-
Public vs Private Clouds - Legal, Efficiency, Reliability of the cloud, Long term cost benefit
-
Business drivers - Cheaper for small workloads, No setup and maintenance costs, Elasticity - high efficiency
-
Public cloud problems - Security, Compliance, Vendor lock-in
####Cloud Service Models####
- IaaS - Physical hardware virtualization server, storage, network - Can install any OS/software etc. EC2, Rackspace, RightScale, Azure
- PaaS - Middleware virtualization - handle to create a database - create and manage a web server (Web Role in Azure) - Hadoop cluster (EMR)
- SaaS - Application virtualization - Salesforce.com CRM, Web mail, Google docs
Decreasing flexibility, increasing automation
####IaaS - Compute and Storage as a service####
- Flexible - load any OS, install any Software, but management and maintenance overhead
- Storage as a Service
- S3, RDS, SimpleDB, EBS
- Files are objects - max of 5GB
- Bucket like directories
- Consistency not guaranteed.
- Default replication - 3 Reduced Redundancy Storage - 2 for data protection
- Regions - geographies
- Versioning - like source code
- SimpleDB - Key value store
- Snapshot, replication, mirroring
- EC2 - instance (compute unit)
- PEM - configurations - web servers
- AMI - Amazon Machine Images
- Pre-built AMI or install software and save as AMI
- Availability Zone - virtual data center
- Elastic load balancer
- Instance Storage, EBS-storage, S3 storage
- VPC - virtual private cloud - configure own networking, routing etc., VPN to your enterprise
####SaaS####
- Application as a Service - Pay for what you use, no installation all browser, any platform
- Piracy, upgrade, efficient resources, lower costs because of lower support
- Authentication vs Authorization
- oAuth2 - Fine grained authorization properties, revocable permissions - no password sharing.
- Access token, refresh token
####Virtualization####
- Applications can be moved from one VM to another
- Load balancing + consolidation + isolation
- Software Virtualization
- System Virtualization - Testing use case Hardware/OS virtualization
- Process Virtualization - OS / User virtualization JVM, .net
####System Virtualization####
- Native Hypervisors or Type 1 hypervisors
- Hosted Hypervisors or Type 2 hypervisors
- Hybrid Hypervisors - Paravirtualization or Binary Translation and Hardware-assisted Virtualization
####Trap and Emulate virtualization####
- Privilege level or protection ring in processors
- 0-3 rings - guests operate at ring 3. Privileged action - hypervisor emulates and stores state of every guest.
- Problems - performance context switches + extra layer of indirection
- Problems - sensitive instructions are a subset of privileged instructions - otherwise failure.
- popf in x86 not trapped
- binary translation - runtime binary rewrite of sensitive instructions
- paravirtualization - rewrite all sensitive calls with HyperV APIs. but per hypervisor
####Hardware-assisted virtualization####
#####Processor Virtualization#####
- VT-x - Vanderpool - only on x86 processors
- Intel - 2 processor modes - VMX root operation (Hyper-V) and VMX non-root operation (guests)
#####Memory virtualization#####
- TLB, page tables
- Shadow page tables
- Extended page tables (EPT) one more level of indirection - Full control by the guest
- VPID (Virtual processor ID) = TLB contains a virtual id in the buffer
#####IO virtualization#####
- Interrupt remapping
- DMA remapping (zerocopy)
####Misc####
- Grid vs Cloud
- Data Modeling
####Sharding####
- Row sharding or Horizontal Partitioning
- Column sharding or Vertical Partitioning
- Round-Robin Partitioning
- Range Partitioning
- Hash Partitioning - Consistent Hashing
- Lookup-based Partitioning
####Replication vs Sharding vs backups####
- Replication - hot standby and failover, redundant data
- Sharding/Partitioning - Dividing the dataset
- Backups/Archives - Disaster Recovery
####Functional Programming####
- Functions as first-class objects
- Advantages: Composability, Purity (no shared data), Reusability, Idiomatic (improved readability)
- Disadvantages: Language with GC, Not natural on imperative hardware
- Higher-order functions
####MapReduce####
- 2 Functions map and reduce
- Map: maps from domain to co-domain. Produces a single result in the co-domain or a list of results.
- Reduce: Summarizing/Aggregation functionality - Aggregates across a bunch of records.
- Keys are present for distribution among machines (sharding), i.e., each record is a key-value pair.
- SQL equivalent of MapReduce, SELECT REDUCE(data) FROM dataset GROUP BY (key).
- Identity Mappers and Reducers
####Hadoop####
- 2 parts, Compute Layer and Data Layer
- HDFS basics - blocks, replication and rack-awareness
- HDFS basics - Namenode
- Compute basics - Differences between Hadoop 1.0 and 2.0. YARN - Resource Manager, Application Master, Node Manager
- Distinction between Map tasks and map functions
- Advantages of YARN
- MapReduce workflow: InputFormat, RecordReader, InputSplits, Map tasks, Combiners, Shuffle/Sort, Reduce tasks, OutputFormat.
####MapReduce problems####
- Count number of words in a book
- Total number of words in a book
- tf-idf calculations for ranking
- Cosine distance measure between 2 documents
- Matrix multiplication in MapReduce
- ...
####Case Study - Key takeaways####
- Atomicity to ensure failure handling
- C10K problem
- Asynchronous IO to handle C10K problem - Netty, Twisted (higher abstractions) on kqueue, epoll, NIO (primitives)
- Data store persistence models in Redis - Snapshots, AOF/WAL (Append-only files/Write ahead logs)
- IMAP - Mail reading, IMAP IDLE - mail push
- Microservices - Vert.x
- CORS - Cross-Origin Resource Sharing, JSONP
- Opennlp, http://www-nlp.stanford.edu/ etc.
- Solr and Lucene - Problems with multi-tenant Solr
####Stream Processing####
- Move data to compute
- Use case: Low latency analytics
- Batch vs Stream - High accuracy, High latency vs Low accuracy, Low latency
- Complex Event Processing, Online MapReduce, Stream Processing etc.
- S4.io, Storm, MS StreamInsight etc.
- Storm model - Topology of Spouts and Bolts
- Joins on timestamps.
- Algorithms are strictly single pass or multi-pass on windows of data (in-memory). Only small amount of data can be held.
- Bolts have to be fast to prevent backlog/loss of events
- Windowing - Sliding window, hopping windows, point windows (each data point is looked into individually)
####Lambda Architecture####
- Best of Batch and Streaming
- Composed of Batch, Speed and Serving layers
- User sees merged query from Batch + Speed Layers