Skip to content

Instantly share code, notes, and snippets.

@karanth
Last active April 13, 2021 16:38
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save karanth/9392098 to your computer and use it in GitHub Desktop.
Save karanth/9392098 to your computer and use it in GitHub Desktop.
Distributed Computing Models, Definitions and Brewer's theorem

####Basics####

  • Memory Hierarchy - Tape, Disk, SSD, Memory, Cache
  • Kryder's law
  • Long Tail vs 80/20 rule
  • Drawbacks of monolithic systems - Supercomputers
  • Distributed Systems - Advantages & Problems
  • CAP theorem - Consistency, Availability and Partition Tolerance
  • PACELC - if(Partition){ Tradoffs: Consistency vs Availability } else { Tradeoffs: Consistency vs Latency }
  • Concurrency vs Parallelism
  • Parallel vs Distributed computing
  • Horizontal vs Vertical Scaling - Tradeoffs
  • Network Computing Models: Client-Server, N-tier, P2P, Hybrid
  • RPCs vs REVs vs Mobile Agents
  • REST vs SOAP
  • JSON vs XML
  • Service-Oriented Architectures (SOA)
  • HTTP verbs in REST calls for CRUD
  • Distributed Computing Models: Parallel Algorithms in Shared Memory Systems, Parallel Algorithms in Message Passing and Distributed Algorithms in Message passing.

####Cloud Computing Characteristics####

  • On demand Self Service
  • Ubiquitous network access
  • Resource Pooling
  • Rapid Elasticity
  • Measured Service - Pay as you go - Utility computing
  • Virtualized Resources - Infinite compute

####Cloud Computing Technology Enablers####

  • Hardware - virtualization + multicore
  • Internet Technologies - Web 2.0 + Web services + SOA
  • Systems Management - Data center automation + autonomic computing
  • Distributed Computing - Utility + Grid Computing

####Cloud Deployment Models####

  • Public clouds - EC2, Azure

  • Private clouds - Appliance

  • Community clouds - OpenCirrus

  • Hybrid clouds - Combination

  • Public vs Private Clouds - Legal, Efficiency, Reliability of the cloud, Long term cost benefit

  • Business drivers - Cheaper for small workloads, No setup and maintenance costs, Elasticity - high efficiency

  • Public cloud problems - Security, Compliance, Vendor lock-in

####Cloud Service Models####

  • IaaS - Physical hardware virtualization server, storage, network - Can install any OS/software etc. EC2, Rackspace, RightScale, Azure
  • PaaS - Middleware virtualization - handle to create a database - create and manage a web server (Web Role in Azure) - Hadoop cluster (EMR)
  • SaaS - Application virtualization - Salesforce.com CRM, Web mail, Google docs

Decreasing flexibility, increasing automation

####IaaS - Compute and Storage as a service####

  • Flexible - load any OS, install any Software, but management and maintenance overhead
  • Storage as a Service
  • S3, RDS, SimpleDB, EBS
  • Files are objects - max of 5GB
  • Bucket like directories
  • Consistency not guaranteed.
  • Default replication - 3 Reduced Redundancy Storage - 2 for data protection
  • Regions - geographies
  • Versioning - like source code
  • SimpleDB - Key value store
  • Snapshot, replication, mirroring
  • EC2 - instance (compute unit)
  • PEM - configurations - web servers
  • AMI - Amazon Machine Images
  • Pre-built AMI or install software and save as AMI
  • Availability Zone - virtual data center
  • Elastic load balancer
  • Instance Storage, EBS-storage, S3 storage
  • VPC - virtual private cloud - configure own networking, routing etc., VPN to your enterprise

####SaaS####

  • Application as a Service - Pay for what you use, no installation all browser, any platform
  • Piracy, upgrade, efficient resources, lower costs because of lower support
  • Authentication vs Authorization
  • oAuth2 - Fine grained authorization properties, revocable permissions - no password sharing.
  • Access token, refresh token

####Virtualization####

  • Applications can be moved from one VM to another
  • Load balancing + consolidation + isolation
  • Software Virtualization
  • System Virtualization - Testing use case Hardware/OS virtualization
  • Process Virtualization - OS / User virtualization JVM, .net

####System Virtualization####

  • Native Hypervisors or Type 1 hypervisors
  • Hosted Hypervisors or Type 2 hypervisors
  • Hybrid Hypervisors - Paravirtualization or Binary Translation and Hardware-assisted Virtualization

####Trap and Emulate virtualization####

  • Privilege level or protection ring in processors
  • 0-3 rings - guests operate at ring 3. Privileged action - hypervisor emulates and stores state of every guest.
  • Problems - performance context switches + extra layer of indirection
  • Problems - sensitive instructions are a subset of privileged instructions - otherwise failure.
  • popf in x86 not trapped
  • binary translation - runtime binary rewrite of sensitive instructions
  • paravirtualization - rewrite all sensitive calls with HyperV APIs. but per hypervisor

####Hardware-assisted virtualization####

#####Processor Virtualization#####

  • VT-x - Vanderpool - only on x86 processors
  • Intel - 2 processor modes - VMX root operation (Hyper-V) and VMX non-root operation (guests)

#####Memory virtualization#####

  • TLB, page tables
  • Shadow page tables
  • Extended page tables (EPT) one more level of indirection - Full control by the guest
  • VPID (Virtual processor ID) = TLB contains a virtual id in the buffer

#####IO virtualization#####

  • Interrupt remapping
  • DMA remapping (zerocopy)

####Misc####

  • Grid vs Cloud
  • Data Modeling

####Sharding####

  • Row sharding or Horizontal Partitioning
  • Column sharding or Vertical Partitioning
  • Round-Robin Partitioning
  • Range Partitioning
  • Hash Partitioning - Consistent Hashing
  • Lookup-based Partitioning

####Replication vs Sharding vs backups####

  • Replication - hot standby and failover, redundant data
  • Sharding/Partitioning - Dividing the dataset
  • Backups/Archives - Disaster Recovery

####Functional Programming####

  • Functions as first-class objects
  • Advantages: Composability, Purity (no shared data), Reusability, Idiomatic (improved readability)
  • Disadvantages: Language with GC, Not natural on imperative hardware
  • Higher-order functions

####MapReduce####

  • 2 Functions map and reduce
  • Map: maps from domain to co-domain. Produces a single result in the co-domain or a list of results.
  • Reduce: Summarizing/Aggregation functionality - Aggregates across a bunch of records.
  • Keys are present for distribution among machines (sharding), i.e., each record is a key-value pair.
  • SQL equivalent of MapReduce, SELECT REDUCE(data) FROM dataset GROUP BY (key).
  • Identity Mappers and Reducers

####Hadoop####

  • 2 parts, Compute Layer and Data Layer
  • HDFS basics - blocks, replication and rack-awareness
  • HDFS basics - Namenode
  • Compute basics - Differences between Hadoop 1.0 and 2.0. YARN - Resource Manager, Application Master, Node Manager
  • Distinction between Map tasks and map functions
  • Advantages of YARN
  • MapReduce workflow: InputFormat, RecordReader, InputSplits, Map tasks, Combiners, Shuffle/Sort, Reduce tasks, OutputFormat.

####MapReduce problems####

  • Count number of words in a book
  • Total number of words in a book
  • tf-idf calculations for ranking
  • Cosine distance measure between 2 documents
  • Matrix multiplication in MapReduce
  • ...

####Case Study - Key takeaways####

  • Atomicity to ensure failure handling
  • C10K problem
  • Asynchronous IO to handle C10K problem - Netty, Twisted (higher abstractions) on kqueue, epoll, NIO (primitives)
  • Data store persistence models in Redis - Snapshots, AOF/WAL (Append-only files/Write ahead logs)
  • IMAP - Mail reading, IMAP IDLE - mail push
  • Microservices - Vert.x
  • CORS - Cross-Origin Resource Sharing, JSONP
  • Opennlp, http://www-nlp.stanford.edu/ etc.
  • Solr and Lucene - Problems with multi-tenant Solr

####Stream Processing####

  • Move data to compute
  • Use case: Low latency analytics
  • Batch vs Stream - High accuracy, High latency vs Low accuracy, Low latency
  • Complex Event Processing, Online MapReduce, Stream Processing etc.
  • S4.io, Storm, MS StreamInsight etc.
  • Storm model - Topology of Spouts and Bolts
  • Joins on timestamps.
  • Algorithms are strictly single pass or multi-pass on windows of data (in-memory). Only small amount of data can be held.
  • Bolts have to be fast to prevent backlog/loss of events
  • Windowing - Sliding window, hopping windows, point windows (each data point is looked into individually)

####Lambda Architecture####

  • Best of Batch and Streaming
  • Composed of Batch, Speed and Serving layers
  • User sees merged query from Batch + Speed Layers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment