kovasb/gist:9b625b8e82b0aa08c1c9

## gistfile1.txt
**Concerns
- Interactivity
-- Incremental extension or modification of running system
- Modularity
-- Pluggable serialization, storage, conveyance, scheduling, lifecycle hooks etc etc

**Spark Summary
- RDDs: represent a lazily-computed distributed dataset
-- each dataset is broken up into 'partitions' that individually sit at different machines
-- just a DAG with edges annotated to describe relationship between partitions of parent and partitions of child
-- DAG is extended via 'transformations' that close over a supplied function
-- eventually an 'action' materializes the root, causing depedencies to evaluate
--- caching & durability policies at nodes; automatic recomputation on worker failure
-- suitable for interactive computation
- RDD Representation
-- Logical representation:
--- list of partitions
--- function for computing each partition
--- list of 'dependencies' (DAG children + edge annotation)
--- optional partitioner & optional 'preferred locations'
-- Programmatic representation
--- Concrete classes inheriting from base class
--- No programmatic formal distinction between transformations and actions
--- rddtypeA.transformationX(fn) -> new rddtypeB(fn, rddtypeA)
---- DAG treated as implementation detail, not directly exposed or manipulated
---- Computation context ("Task Context") available to RDD but not directly to fn
- RDD Evaluation
-- RDD value is a logical list of partitions [p1 p2 pN]
-- RDD 'transformations' defined in terms of (transform-fn input-iterator) -> output-iterator
--- (map
      (fn [input-partition-iterator] output-partition-iterator)
       [p1-iterator p2-iterator pN-iterator]
--- specialized transforms like elemetwise map, filter, mapcat etc built on top
--- plays well with 'sequence' transduction context (would prefer eduction if available?)
- Coordination
-- Coordination of job execution happens at driver
--- In in same process, interleaved, as job description
-- User supplies driver program to a driver (typically via shell command)
-- Driver program builds up RDDs via transformations and performs actions
--- Behind the scenes, driver process coordinates with workers ('executors') and cluster manager
--- API entry point is a "Spark Context"; leaf RDDs created via Spark Context.
--- RDDs created in driver are serialized and shipped to executors
---- RDD methods are invoked on BOTH driver and executors
-- 'Actions' invoked at driver invoke executor computation & IO side-effects
- Inputs & Outputs
-- Data inputs & outputs via side effects; not formalized
--- example input: leaf RDD contains urls; RDD xform reads data from urls and outputs via iterator
-- Jar + main class is supplied to driver
--- driver supplies resources to executors; fairly inscrutable process
-- Driver can launch REPL
--- REPL-created classes supplied to executors via URL classloader

- Modularity
-- plug in functions for consuming/creating iterators (great)
-- Downhill from there
-- Fundamental construct (DAG) is hidden behind impl details
--- Cannot be directly fabricated or manipulated
-- Concrete types, not interfaces
--- Worse, scala types (cumbersome to inhabit from Clojure). Java API an afterthought
-- RDD types form a complex web of relationships without formal semantics
-- Conveyance, coordination, initialization mechanisms inscrutable, adhoc

- Interactivity
-- Core design allows extending the computation DAG
--- extension happens within driver process, which has many opaque entanglements
-- Scala shell allows new definitions visible at executors
--- Hacked into system, not clear how to extend to Clojure

**Flink Summary
- Job Graph
-- Represents dataflow
--- Edges of system are sources and sinks
--- Various 'operators' in between
-- High level API --> DAG --> Optimizer/Compiler --> 'Job Graph'
--- 'Job Graph' distributed execution managed by JobManager
- Job Graph Representation
-- Set of concrete classes that are just AST-like descriptors
--- Computation nodes, input&output nodes, edges of various kinds
--- Nodes contain user-supplied functions/components
- Job Graph Evaluation
-- Evaluation executes the dataflow
-- Job Graph turns into 'Execution Graph' at Job Manager
--- Nodes tell worker components what to do and when to do it, track high-level state
- Coordination
-- Client process submits Job Graph + resources to JobManager
-- Client process not involved with coordination
- Inputs & Outputs
-- Well-defined interfaces for input and output data
-- Well-defined job submission boundary
-- No built-in facilities for REPL input

- Modularity
-- Fundamental construct (Job Graph) easy to fabricate
-- Interfaces everywhere
--- Most concrete classes are in Java
-- Well-defined components & interactions between them

- Interactivity
-- No affordances specifically for interactivity
-- Provided implementations assume static Job Graph
-- However, it is plausible to modify behavior of Job Manager and of worker nodes
	**Concerns
	- Interactivity
	-- Incremental extension or modification of running system
	- Modularity
	-- Pluggable serialization, storage, conveyance, scheduling, lifecycle hooks etc etc

	**Spark Summary
	- RDDs: represent a lazily-computed distributed dataset
	-- each dataset is broken up into 'partitions' that individually sit at different machines
	-- just a DAG with edges annotated to describe relationship between partitions of parent and partitions of child
	-- DAG is extended via 'transformations' that close over a supplied function
	-- eventually an 'action' materializes the root, causing depedencies to evaluate
	--- caching & durability policies at nodes; automatic recomputation on worker failure
	-- suitable for interactive computation
	- RDD Representation
	-- Logical representation:
	--- list of partitions
	--- function for computing each partition
	--- list of 'dependencies' (DAG children + edge annotation)
	--- optional partitioner & optional 'preferred locations'
	-- Programmatic representation
	--- Concrete classes inheriting from base class
	--- No programmatic formal distinction between transformations and actions
	--- rddtypeA.transformationX(fn) -> new rddtypeB(fn, rddtypeA)
	---- DAG treated as implementation detail, not directly exposed or manipulated
	---- Computation context ("Task Context") available to RDD but not directly to fn
	- RDD Evaluation
	-- RDD value is a logical list of partitions [p1 p2 pN]
	-- RDD 'transformations' defined in terms of (transform-fn input-iterator) -> output-iterator
	--- (map
	(fn [input-partition-iterator] output-partition-iterator)
	[p1-iterator p2-iterator pN-iterator]
	--- specialized transforms like elemetwise map, filter, mapcat etc built on top
	--- plays well with 'sequence' transduction context (would prefer eduction if available?)
	- Coordination
	-- Coordination of job execution happens at driver
	--- In in same process, interleaved, as job description
	-- User supplies driver program to a driver (typically via shell command)
	-- Driver program builds up RDDs via transformations and performs actions
	--- Behind the scenes, driver process coordinates with workers ('executors') and cluster manager
	--- API entry point is a "Spark Context"; leaf RDDs created via Spark Context.
	--- RDDs created in driver are serialized and shipped to executors
	---- RDD methods are invoked on BOTH driver and executors
	-- 'Actions' invoked at driver invoke executor computation & IO side-effects
	- Inputs & Outputs
	-- Data inputs & outputs via side effects; not formalized
	--- example input: leaf RDD contains urls; RDD xform reads data from urls and outputs via iterator
	-- Jar + main class is supplied to driver
	--- driver supplies resources to executors; fairly inscrutable process
	-- Driver can launch REPL
	--- REPL-created classes supplied to executors via URL classloader

	- Modularity
	-- plug in functions for consuming/creating iterators (great)
	-- Downhill from there
	-- Fundamental construct (DAG) is hidden behind impl details
	--- Cannot be directly fabricated or manipulated
	-- Concrete types, not interfaces
	--- Worse, scala types (cumbersome to inhabit from Clojure). Java API an afterthought
	-- RDD types form a complex web of relationships without formal semantics
	-- Conveyance, coordination, initialization mechanisms inscrutable, adhoc

	- Interactivity
	-- Core design allows extending the computation DAG
	--- extension happens within driver process, which has many opaque entanglements
	-- Scala shell allows new definitions visible at executors
	--- Hacked into system, not clear how to extend to Clojure

	**Flink Summary
	- Job Graph
	-- Represents dataflow
	--- Edges of system are sources and sinks
	--- Various 'operators' in between
	-- High level API --> DAG --> Optimizer/Compiler --> 'Job Graph'
	--- 'Job Graph' distributed execution managed by JobManager
	- Job Graph Representation
	-- Set of concrete classes that are just AST-like descriptors
	--- Computation nodes, input&output nodes, edges of various kinds
	--- Nodes contain user-supplied functions/components
	- Job Graph Evaluation
	-- Evaluation executes the dataflow
	-- Job Graph turns into 'Execution Graph' at Job Manager
	--- Nodes tell worker components what to do and when to do it, track high-level state
	- Coordination
	-- Client process submits Job Graph + resources to JobManager
	-- Client process not involved with coordination
	- Inputs & Outputs
	-- Well-defined interfaces for input and output data
	-- Well-defined job submission boundary
	-- No built-in facilities for REPL input

	- Modularity
	-- Fundamental construct (Job Graph) easy to fabricate
	-- Interfaces everywhere
	--- Most concrete classes are in Java
	-- Well-defined components & interactions between them

	- Interactivity
	-- No affordances specifically for interactivity
	-- Provided implementations assume static Job Graph
	-- However, it is plausible to modify behavior of Job Manager and of worker nodes