Skip to content

Instantly share code, notes, and snippets.

@shlomiv
Last active December 21, 2015 12:58
Show Gist options
  • Save shlomiv/6309305 to your computer and use it in GitHub Desktop.
Save shlomiv/6309305 to your computer and use it in GitHub Desktop.
// load the entire file, and call it fs (all lazy)
val fs = sc.textFile("/data01/fs.txt")
// lets find all lines that contains the string "song", and cache that data source
val songs = fs.filter(x=>x.toLowerCase.contains("song")).cache
// now that we are trying to count, all the previous lazy computations will have to get realized, so this will take about 85
// seconds to complete, but then it will be completly cached.
songs.count
// lets try that again, now after the cache
songs.count
// we now realize that we our previous predicate was to general, and included things like "songwriter"
// so say we still want sentences containing just the word "song".
val onlysongs = song.filter(x=>x.contains(" song "))
// lets count. again, this will relize the lazy computation we just wanted, but this time it will take just a few seconds
onlysongs.count
// now finally, lets write this to the filesystem. this will take longer because of the io involved, around 30 seconds
songs.saveAsTextFile("/tmp/songs")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment