Skip to content

Instantly share code, notes, and snippets.

@vkroz
Last active January 30, 2023 03:05
Show Gist options
  • Save vkroz/b7bf00c5b4340ad16769b59d3ac73dde to your computer and use it in GitHub Desktop.
Save vkroz/b7bf00c5b4340ad16769b59d3ac73dde to your computer and use it in GitHub Desktop.
Scala

Spark / EMR cookbook

val sample: Array[String] = . . . . .
// 1. foreach style
sample.foreach { println }
// 2. comprehension style
for (name <- sample) println(name)
for (name <- sample if name.startsWith("2"))
| println(name)
// Initializing Spark
val conf: SparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]");
val sc: JavaSparkContext = new JavaSparkContext(conf);
val rdd = sc.textFile("/my/path/*.gz")
// Persist into memory for fast re-use
rdd.cache()
// Basic map reduce
val lineLength = rdd.map(l => l.length)
val totalLength = lineLength.reduce((a,b)=>a+b)
// or simply
rdd.map(s=>s.length).reduce((a,b) => a + b)
// rdd from collection
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
val accum = sc.longAccumulator
data.map { x => accum.add(x); x }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment