Skip to content

Instantly share code, notes, and snippets.

@jaceklaskowski
Last active February 29, 2020 19:38
Show Gist options
  • Save jaceklaskowski/accfeb6b2406dc4704b4 to your computer and use it in GitHub Desktop.
Save jaceklaskowski/accfeb6b2406dc4704b4 to your computer and use it in GitHub Desktop.
Introduction to Apache Spark

Introducting Apache Spark

  • What use cases are a good fit for Apache Spark? How to work with Spark?
    • create RDDs, transform them, and execute actions to get result of a computation
    • All computations in memory = "memory is cheap" (we do need enough of memory to fit all the data in)
      • the less disk operations, the faster (you do know it, don't you?)
    • You develop such computation flows or pipelines using a programming language - Scala, Python or Java <-- that's where ability to write code is paramount
    • Data is usually on a distributed file system like Hadoop HDFS or NoSQL databases like Cassandra
    • Data mining = analysis / insights / analytics
      • log mining
      • twitter stream processing
  • What's RDD?
    • RDD = resilient distributed dataset
    • different types - key-value RDD
    • RDD API
    • collect
    • take
    • count
    • reduce
    • reduceByKey, groupByKey, sortByKey
    • saveAs...
  • What's SparkContext?
  • main entry point to Spark
  • sc under the Spark shell
  • Create your own in standalone apps
  • imports + SparkContext
  • local (1 thread) or local[n] where n is the number of threads to use (use * for the max)
  • on cluster with Cluster URL spark://host:port
  • How to create RDDs?
    • sc.textFile("hdfs://...")
    • sc.parallelize
  • What are transformations?
    • they're lazy = nothing happens at the time of calling transformations
    • filter
    • map
    • knowing how to write applications using functional programming concepts helps a lot
  • What are actions?
    • count
    • first
  • cache method
    • Stages of executions
    • laziness and computation graph optimization
    • unpersist
    • Eviction (LRU by default)
    • Cache API
  • Spark Cluster
    • a Driver
    • Workers
    • Tasks
    • Executors
    • Tasks run locally or on cluster
    • Clusters using Mesos, YARN or standalone mode
    • Local mode uses local threads
  • Task scheduling
  • task graph
  • aware of data locality & partitions (to avoid shuffles)
  • Supported storage systems
  • using Hadoop InputFormat API
  • HBase
  • HDFS
  • S3
  • to save the content of a RDD use rdd.saveAs...
  • Data locality awareness
  • Fault recovery
    • lineage (to recompute lost data)
    • partitions
  • Advanced features
  • controllable partitioning
  • controllable storage formats
  • shared variables: broadcasts & accumulators
  • Writing Spark applications
  • define dependency on Spark
  • Running Spark applications
  • standalone applications
  • interactive shell
  • the fastest approach to master Spark
  • How to run it - ./bin/spark-shell.sh (FIXME what are the available options)
  • Tab completion
  • Exercises (use Docker whenever possible)
    • Getting started using simple line count example
    • Prepare environment to measure computation time of different pipelines
    • Read data from Hadoop HDFS, Twitter, Cassandra, Kafka, Redis
    • Setting up a Spark cluster
    • Start spark-shell using the cluster and see how UI shows it under Running Applications
  • Administration
  • UI - standalone master:8080
  • Level of Parallelism
  • All the pair RDD ops take an optional second parameter for the number of tasks
  • Shuffle service
@maelfosso
Copy link

HI,
Thank you very much for that sample and detailed explanation.

I built an interactive data query platform using MySQL and Spring Boot which analyse data from many data sources (MySQL, Excel, Access, PostgreSQL). Now, I face a problem: The platform cratch because the data is too big and the operation slow the platform.

Face to that problem, I'm thinking about using Apache Spark for its Interactive Query abilities. What do you think please ?
I don't know how to make all the databases which contains data to analyse (MySQL, Excel, Access, PostgreSQL) interact with Apache Spark and Spring (Spring it used to build the client part from which the user will analyse its data). Any suggestion please ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment