jaceklaskowski/spark-intro.md

## spark-intro.md

      
    Raw
  

              spark-intro.md
            
          
    Introducting Apache Spark


What use cases are a good fit for Apache Spark? How to work with Spark?

create RDDs, transform them, and execute actions to get result of a computation
All computations in memory = "memory is cheap" (we do need enough of memory to fit all the data in)

the less disk operations, the faster (you do know it, don't you?)


You develop such computation flows or pipelines using a programming language - Scala, Python or Java <-- that's where ability to write code is paramount
Data is usually on a distributed file system like Hadoop HDFS or NoSQL databases like Cassandra
Data mining = analysis / insights / analytics

log mining
twitter stream processing


What's RDD?

RDD = resilient distributed dataset
different types - key-value RDD
RDD API
collect
take
count
reduce
reduceByKey, groupByKey, sortByKey
saveAs...


What's SparkContext?
main entry point to Spark
sc under the Spark shell
Create your own in standalone apps
imports + SparkContext
local (1 thread) or local[n] where n is the number of threads to use (use * for the max)
on cluster with Cluster URL spark://host:port
How to create RDDs?

sc.textFile("hdfs://...")
sc.parallelize


What are transformations?

they're lazy = nothing happens at the time of calling transformations
filter
map
knowing how to write applications using functional programming concepts helps a lot


What are actions?

count
first


cache method

Stages of executions
laziness and computation graph optimization
unpersist
Eviction (LRU by default)
Cache API


Spark Cluster

a Driver
Workers
Tasks
Executors
Tasks run locally or on cluster
Clusters using Mesos, YARN or standalone mode
Local mode uses local threads


Task scheduling
task graph
aware of data locality & partitions (to avoid shuffles)
Supported storage systems
using Hadoop InputFormat API
HBase
HDFS
S3
to save the content of a RDD use rdd.saveAs...
Data locality awareness
Fault recovery

lineage (to recompute lost data)
partitions


Advanced features
controllable partitioning
controllable storage formats
shared variables: broadcasts & accumulators
Writing Spark applications
define dependency on Spark
Running Spark applications
standalone applications
interactive shell
the fastest approach to master Spark
How to run it - ./bin/spark-shell.sh (FIXME what are the available options)
Tab completion
Exercises (use Docker whenever possible)

Getting started using simple line count example
Prepare environment to measure computation time of different pipelines
Read data from Hadoop HDFS, Twitter, Cassandra, Kafka, Redis
Setting up a Spark cluster
Start spark-shell using the cluster and see how UI shows it under Running Applications


Administration
UI - standalone master:8080
Level of Parallelism
All the pair RDD ops take an optional second parameter for the number of tasks
Shuffle service