Skip to content

Instantly share code, notes, and snippets.

@kovasb
Last active January 13, 2016 16:59
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kovasb/9b625b8e82b0aa08c1c9 to your computer and use it in GitHub Desktop.
Save kovasb/9b625b8e82b0aa08c1c9 to your computer and use it in GitHub Desktop.
Assessing Spark & Flink from Clojure POV
**Concerns
- Interactivity
-- Incremental extension or modification of running system
- Modularity
-- Pluggable serialization, storage, conveyance, scheduling, lifecycle hooks etc etc
**Spark Summary
- RDDs: represent a lazily-computed distributed dataset
-- each dataset is broken up into 'partitions' that individually sit at different machines
-- just a DAG with edges annotated to describe relationship between partitions of parent and partitions of child
-- DAG is extended via 'transformations' that close over a supplied function
-- eventually an 'action' materializes the root, causing depedencies to evaluate
--- caching & durability policies at nodes; automatic recomputation on worker failure
-- suitable for interactive computation
- RDD Representation
-- Logical representation:
--- list of partitions
--- function for computing each partition
--- list of 'dependencies' (DAG children + edge annotation)
--- optional partitioner & optional 'preferred locations'
-- Programmatic representation
--- Concrete classes inheriting from base class
--- No programmatic formal distinction between transformations and actions
--- rddtypeA.transformationX(fn) -> new rddtypeB(fn, rddtypeA)
---- DAG treated as implementation detail, not directly exposed or manipulated
---- Computation context ("Task Context") available to RDD but not directly to fn
- RDD Evaluation
-- RDD value is a logical list of partitions [p1 p2 pN]
-- RDD 'transformations' defined in terms of (transform-fn input-iterator) -> output-iterator
--- (map
(fn [input-partition-iterator] output-partition-iterator)
[p1-iterator p2-iterator pN-iterator]
--- specialized transforms like elemetwise map, filter, mapcat etc built on top
--- plays well with 'sequence' transduction context (would prefer eduction if available?)
- Coordination
-- Coordination of job execution happens at driver
--- In in same process, interleaved, as job description
-- User supplies driver program to a driver (typically via shell command)
-- Driver program builds up RDDs via transformations and performs actions
--- Behind the scenes, driver process coordinates with workers ('executors') and cluster manager
--- API entry point is a "Spark Context"; leaf RDDs created via Spark Context.
--- RDDs created in driver are serialized and shipped to executors
---- RDD methods are invoked on BOTH driver and executors
-- 'Actions' invoked at driver invoke executor computation & IO side-effects
- Inputs & Outputs
-- Data inputs & outputs via side effects; not formalized
--- example input: leaf RDD contains urls; RDD xform reads data from urls and outputs via iterator
-- Jar + main class is supplied to driver
--- driver supplies resources to executors; fairly inscrutable process
-- Driver can launch REPL
--- REPL-created classes supplied to executors via URL classloader
- Modularity
-- plug in functions for consuming/creating iterators (great)
-- Downhill from there
-- Fundamental construct (DAG) is hidden behind impl details
--- Cannot be directly fabricated or manipulated
-- Concrete types, not interfaces
--- Worse, scala types (cumbersome to inhabit from Clojure). Java API an afterthought
-- RDD types form a complex web of relationships without formal semantics
-- Conveyance, coordination, initialization mechanisms inscrutable, adhoc
- Interactivity
-- Core design allows extending the computation DAG
--- extension happens within driver process, which has many opaque entanglements
-- Scala shell allows new definitions visible at executors
--- Hacked into system, not clear how to extend to Clojure
**Flink Summary
- Job Graph
-- Represents dataflow
--- Edges of system are sources and sinks
--- Various 'operators' in between
-- High level API --> DAG --> Optimizer/Compiler --> 'Job Graph'
--- 'Job Graph' distributed execution managed by JobManager
- Job Graph Representation
-- Set of concrete classes that are just AST-like descriptors
--- Computation nodes, input&output nodes, edges of various kinds
--- Nodes contain user-supplied functions/components
- Job Graph Evaluation
-- Evaluation executes the dataflow
-- Job Graph turns into 'Execution Graph' at Job Manager
--- Nodes tell worker components what to do and when to do it, track high-level state
- Coordination
-- Client process submits Job Graph + resources to JobManager
-- Client process not involved with coordination
- Inputs & Outputs
-- Well-defined interfaces for input and output data
-- Well-defined job submission boundary
-- No built-in facilities for REPL input
- Modularity
-- Fundamental construct (Job Graph) easy to fabricate
-- Interfaces everywhere
--- Most concrete classes are in Java
-- Well-defined components & interactions between them
- Interactivity
-- No affordances specifically for interactivity
-- Provided implementations assume static Job Graph
-- However, it is plausible to modify behavior of Job Manager and of worker nodes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment