April 28, 2017 @ Mode Analytics
- Hadoop is a combination of three projects
- HDFS is the Hadoop data storage/file system that stores the data
- MapReduce/batch processing framework -- open-sourced from MapReduce
- YARN (yet another resource negotiator) - resource management
- Spark takes advantage of processessing through memory instead of saving to disk (Hadoop)
- Hadoop involves writing several Java jobs that chains mapreduces and saves to disk (expensive operation); Spark allows for that but also enables you to process via memory
- Spark is traditionally written in Scala; works in Python too (PySpark), but with a performance hit
- Spark splits the data into partitions in memory
- Spark uses a master-slave architecture
- Partitions of data are stored on the workers
- Master coordinates how the data is processed & read; purely a coordination role--doesn't really do work
- Transformations are lazily evaluated; actions are immediate
- i.e. transformation examples: one transformation (
.filter(lambda y: y < 4
) turns one "parent RDD" into another "child RDD" - i.e.
.map(lambda y: (y, 1))
) turns1
->(1, 1)
and2
->(2, 1)
- i.e.
.groupByKey()
) turns the data above into(1, [1, 1, 1])
,(2, [1, 1, 1, 1])
, etc.
- i.e. transformation examples: one transformation (
- Narrow: one to one transformation with parent/child (more efficient)
- One caveat: join w/ co-partitioned inputs
- Wide: multiple child RDDs require data from single or multiple parent RDDs (slower; potential to touch multiple machines)
.collect()
- brings all RDDs together into the driver node (the node that executed the action); can possibly crash the machine if the data collected is larger than the driver node's memory -- this command is frequently used for debugging but can be dangerous on larger datasets; usetake()
instead where possible (safer)
- Transformations are defined but are only executed once an action is called, such as
.collect()
- Airbnb's Airflow is a DAG visualization tools
- How does Spark actually perform this operation behind the scenes?
- Determine stages of the data by looking for shuffle boundaries
- Find where the data is narrow and group those operations together
- The boundaries are usually drawn right before a wide operation
- A single flow machine in a Spark cluster will be a bottleneck for the entire system (aka "laggers")
- Turn data flow through a stage into a task, such as a partition running n transformations on data. # of tasks hint at the amount of parallelism you have in your system
- Determine stages of the data by looking for shuffle boundaries
- Workshop will focus on a local install
- YARN/Mesos
- Fun fact: Spark comes from Mesos, but has superseded it
- Dataframes provide structure in the data
- Spark SQL allows you to express transformations and maps using traditional SQL syntax
- Machine Learning
- In an ML Pipeline, you have a dataframe, a transformer, an estimator (train a model with data), and a pipeline -- a higher level abstraction from the ML libraries using the structure of dataframes
- Persistence and Caching
- If there's an RDD that's frequently used or could benefit from caching,
.persist()
or.cache()
off heap
- Tachyon Storage System; experimental feature with an esoteric use case (sister OSS project that came out of Mesos @ Berkeley); saves to a separate JVM- Prefer
.reduceByKey
over.groupByKey
(!!)
- If there's an RDD that's frequently used or could benefit from caching,