Skip to content

Instantly share code, notes, and snippets.

@ceteri
Last active May 14, 2020 13:12
Show Gist options
  • Save ceteri/8ae5b9509a08c08a1132 to your computer and use it in GitHub Desktop.
Save ceteri/8ae5b9509a08c08a1132 to your computer and use it in GitHub Desktop.
Intro to Apache Spark: code example for RDD animation
// load error messages from a log into memory
// then interactively search for various patterns
// base RDD
val lines = sc.textFile("log.txt")
// transformed RDDs
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
messages.cache()
// actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
val messages = errors.map(_.split("\t")).map(r => r(1))
messages.cache()
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
ERROR php: dying for unknown reasons
WARN dave, are you angry at me?
ERROR did mysql just barf?
WARN xylons approaching
ERROR mysql cluster: replace with spark cluster
scala> messages.toDebugString
res5: String =
MappedRDD[4] at map at <console>:16 (1 partitions)
MappedRDD[3] at map at <console>:16 (1 partitions)
FilteredRDD[2] at filter at <console>:14 (1 partitions)
MappedRDD[1] at textFile at <console>:12 (1 partitions)
HadoopRDD[0] at textFile at <console>:12 (1 partitions)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment