Skip to content

Instantly share code, notes, and snippets.

@frank-leap
Forked from jaceklaskowski/spark-intro.md
Last active August 8, 2016 07:20
Show Gist options
  • Save frank-leap/8765cec0ef30e5aaef64497f1ecd47cb to your computer and use it in GitHub Desktop.
Save frank-leap/8765cec0ef30e5aaef64497f1ecd47cb to your computer and use it in GitHub Desktop.
Introduction to Apache Spark

Introducting Apache Spark

  • What use cases are a good fit for Apache Spark? How to work with Spark?
    • create RDDs, transform them, and execute actions to get result of a computation
    • All computations in memory = "memory is cheap" (we do need enough of memory to fit all the data in)
      • the less disk operations, the faster (you do know it, don't you?)
    • You develop such computation flows or pipelines using a programming language - Scala, Python or Java <-- that's where ability to write code is paramount
    • Data is usually on a distributed file system like Hadoop HDFS or NoSQL databases like Cassandra
    • Data mining = analysis / insights / analytics
      • log mining
      • twitter stream processing
  • What's RDD?
    • RDD = resilient distributed dataset
    • different types - key-value RDD
    • RDD API
    • collect
    • take
    • count
    • reduce
    • reduceByKey, groupByKey, sortByKey
    • saveAs...
  • What's SparkContext?
  • main entry point to Spark
  • sc under the Spark shell
  • Create your own in standalone apps
  • imports + SparkContext
  • local (1 thread) or local[n] where n is the number of threads to use (use * for the max)
  • on cluster with Cluster URL spark://host:port
  • How to create RDDs?
    • sc.textFile("hdfs://...")
    • sc.parallelize
  • What are transformations?
    • they're lazy = nothing happens at the time of calling transformations
    • filter
    • map
    • knowing how to write applications using functional programming concepts helps a lot
  • What are actions?
    • count
    • first
  • cache method
    • Stages of executions
    • laziness and computation graph optimization
    • unpersist
    • Eviction (LRU by default)
    • Cache API
  • Spark Cluster
    • a Driver
    • Workers
    • Tasks
    • Executors
    • Tasks run locally or on cluster
    • Clusters using Mesos, YARN or standalone mode
    • Local mode uses local threads
  • Task scheduling
  • task graph
  • aware of data locality & partitions (to avoid shuffles)
  • Supported storage systems
  • using Hadoop InputFormat API
  • HBase
  • HDFS
  • S3
  • to save the content of a RDD use rdd.saveAs...
  • Data locality awareness
  • Fault recovery
    • lineage (to recompute lost data)
    • partitions
  • Advanced features
  • controllable partitioning
  • controllable storage formats
  • shared variables: broadcasts & accumulators
  • Writing Spark applications
  • define dependency on Spark
  • Running Spark applications
  • standalone applications
  • interactive shell
  • the fastest approach to master Spark
  • How to run it - ./bin/spark-shell.sh (FIXME what are the available options)
  • Tab completion
  • Exercises (use Docker whenever possible)
    • Getting started using simple line count example
    • Prepare environment to measure computation time of different pipelines
    • Read data from Hadoop HDFS, Twitter, Cassandra, Kafka, Redis
    • Setting up a Spark cluster
    • Start spark-shell using the cluster and see how UI shows it under Running Applications
  • Administration
  • UI - standalone master:8080
  • Level of Parallelism
  • All the pair RDD ops take an optional second parameter for the number of tasks
  • Shuffle service
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment