danyim/notes.md

## notes.md

      
    Raw
  

              notes.md
            
          
    Spark Fundamentals Workshop

April 28, 2017 @ Mode Analytics
Spark vs Hadoop


Hadoop is a combination of three projects

HDFS is the Hadoop data storage/file system that stores the data
MapReduce/batch processing framework -- open-sourced from MapReduce
YARN (yet another resource negotiator) - resource management


Spark takes advantage of processessing through memory instead of saving to disk (Hadoop)
Hadoop involves writing several Java jobs that chains mapreduces and saves to disk (expensive operation); Spark allows for that but also enables you to process via memory
Spark is traditionally written in Scala; works in Python too (PySpark), but with a performance hit

Resilient Distributed Datasets (RDDs)


Spark splits the data into partitions in memory
Spark uses a master-slave architecture
Partitions of data are stored on the workers
Master coordinates how the data is processed & read; purely a coordination role--doesn't really do work

Transformations and Actions


Transformations are lazily evaluated; actions are immediate

i.e. transformation examples: one transformation (.filter(lambda y: y < 4) turns one "parent RDD" into another "child RDD"
i.e. .map(lambda y: (y, 1))) turns 1 -> (1, 1) and 2 -> (2, 1)
i.e. .groupByKey()) turns the data above into (1, [1, 1, 1]), (2, [1, 1, 1, 1]), etc.


Narrow vs Wide Transformations


Narrow: one to one transformation with parent/child (more efficient)

One caveat: join w/ co-partitioned inputs


Wide: multiple child RDDs require data from single or multiple parent RDDs (slower; potential to touch multiple machines)

Actions


.collect() - brings all RDDs together into the driver node (the node that executed the action); can possibly crash the machine if the data collected is larger than the driver node's memory -- this command is frequently used for debugging but can be dangerous on larger datasets; use take() instead where possible (safer)

Building a Directed Acyclic Graph (DAG)


Transformations are defined but are only executed once an action is called, such as .collect()
Airbnb's Airflow is a DAG visualization tools
How does Spark actually perform this operation behind the scenes?

Determine stages of the data by looking for shuffle boundaries

Find where the data is narrow and group those operations together
The boundaries are usually drawn right before a wide operation
A single flow machine in a Spark cluster will be a bottleneck for the entire system (aka "laggers")


Turn data flow through a stage into a task, such as a partition running n transformations on data. # of tasks hint at the amount of parallelism you have in your system


Ways to Run Spark


Workshop will focus on a local install
YARN/Mesos

Fun fact: Spark comes from Mesos, but has superseded it


Spark DataFrames and Spark SQL


Dataframes provide structure in the data
Spark SQL allows you to express transformations and maps using traditional SQL syntax
Machine Learning

In an ML Pipeline, you have a dataframe, a transformer, an estimator (train a model with data), and a pipeline -- a higher level abstraction from the ML libraries using the structure of dataframes


Persistence and Caching

If there's an RDD that's frequently used or could benefit from caching, .persist() or .cache()
off heap - Tachyon Storage System; experimental feature with an esoteric use case (sister OSS project that came out of Mesos @ Berkeley); saves to a separate JVM
Prefer .reduceByKey over .groupByKey (!!)


Resources


Slide deck
https://sites.google.com/a/insightdatascience.com/spark-lab/
https://docs.google.com/a/hirevisor.com/document/d/1JgtfKkQvSilAE_jlj9fmDnU46KWdVJTFyFQzfTA5O-w/edit?usp=sharing