Skip to content

Instantly share code, notes, and snippets.

@danyim
Last active April 26, 2017 23:33
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save danyim/784fae58621122747a65c70a4eca72a9 to your computer and use it in GitHub Desktop.
Save danyim/784fae58621122747a65c70a4eca72a9 to your computer and use it in GitHub Desktop.
DataEng '17 - Intro to Spark Workshop Notes

Spark Fundamentals Workshop

April 28, 2017 @ Mode Analytics

Spark vs Hadoop

  • Hadoop is a combination of three projects
    1. HDFS is the Hadoop data storage/file system that stores the data
    2. MapReduce/batch processing framework -- open-sourced from MapReduce
    3. YARN (yet another resource negotiator) - resource management
  • Spark takes advantage of processessing through memory instead of saving to disk (Hadoop)
  • Hadoop involves writing several Java jobs that chains mapreduces and saves to disk (expensive operation); Spark allows for that but also enables you to process via memory
  • Spark is traditionally written in Scala; works in Python too (PySpark), but with a performance hit

Resilient Distributed Datasets (RDDs)

  • Spark splits the data into partitions in memory
  • Spark uses a master-slave architecture
  • Partitions of data are stored on the workers
  • Master coordinates how the data is processed & read; purely a coordination role--doesn't really do work

Transformations and Actions

  • Transformations are lazily evaluated; actions are immediate
    • i.e. transformation examples: one transformation (.filter(lambda y: y < 4) turns one "parent RDD" into another "child RDD"
    • i.e. .map(lambda y: (y, 1))) turns 1 -> (1, 1) and 2 -> (2, 1)
    • i.e. .groupByKey()) turns the data above into (1, [1, 1, 1]), (2, [1, 1, 1, 1]), etc.

Narrow vs Wide Transformations

  • Narrow: one to one transformation with parent/child (more efficient)
    • One caveat: join w/ co-partitioned inputs
  • Wide: multiple child RDDs require data from single or multiple parent RDDs (slower; potential to touch multiple machines)

Actions

  • .collect() - brings all RDDs together into the driver node (the node that executed the action); can possibly crash the machine if the data collected is larger than the driver node's memory -- this command is frequently used for debugging but can be dangerous on larger datasets; use take() instead where possible (safer)

Building a Directed Acyclic Graph (DAG)

  • Transformations are defined but are only executed once an action is called, such as .collect()
  • Airbnb's Airflow is a DAG visualization tools
  • How does Spark actually perform this operation behind the scenes?
    • Determine stages of the data by looking for shuffle boundaries
      • Find where the data is narrow and group those operations together
      • The boundaries are usually drawn right before a wide operation
      • A single flow machine in a Spark cluster will be a bottleneck for the entire system (aka "laggers")
    • Turn data flow through a stage into a task, such as a partition running n transformations on data. # of tasks hint at the amount of parallelism you have in your system

Ways to Run Spark

  • Workshop will focus on a local install
  • YARN/Mesos
    • Fun fact: Spark comes from Mesos, but has superseded it

Spark DataFrames and Spark SQL

  • Dataframes provide structure in the data
  • Spark SQL allows you to express transformations and maps using traditional SQL syntax
  • Machine Learning
    • In an ML Pipeline, you have a dataframe, a transformer, an estimator (train a model with data), and a pipeline -- a higher level abstraction from the ML libraries using the structure of dataframes
  • Persistence and Caching
    • If there's an RDD that's frequently used or could benefit from caching, .persist() or .cache()
    • off heap - Tachyon Storage System; experimental feature with an esoteric use case (sister OSS project that came out of Mesos @ Berkeley); saves to a separate JVM
    • Prefer .reduceByKey over .groupByKey (!!)

Resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment