Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Learn Scalding with Alice
git clone
cd scalding
./sbt scalding-repl/console
val alice = Source.fromURL("").getLines
// Add the line numbers, which we might want later
val aliceLineNum = alice.zipWithIndex.toList
// Now for scalding, TypedPipe is the main scalding object representing
// your data.
val alicePipe = TypedPipe.from(aliceLineNum)
val aliceWordList = { line => line._1.split("\\s+").toList }
// Three things: map, function, tuples
// but that's ugly, so we can use tuple matching the be clearer:
val aliceWordList = { case (text, lineno) =>
// But we want words, not lists of words. We need to flatten!
val aliceWords = aliceWordList.flatten
// Scala has a common function for this map + flatten == flatMap
val aliceWords = alicePipe.flatMap { case (text, _) => text.split("\\s+").toList }
// Now lets add a count for each word:
val aliceWithCount = { word => (word, 1L) }
// let's sum them for each word:
val wordCount =
// or: .group.sum == .sumByKey
// let's print them to the screen (REPL only)
// Let's print just the ones with more that 100 appearances:
wordCount.filter { case (word, count) => count > 100 }.dump
// but which is the biggest word?
// use, :paste to put multi-line expressions
val top10 = wordCount
.sortBy { case (word, count) => -count }
// Where is Alice? What is with the ()?
// use, :paste to put multi-line expressions
val top20 = wordCount
.sortBy { case (word, count) => -count }
.values // ignore the ()-all key
// there she is!
// what is the last line, on which each word appears?
* How to solve this?
* (flat)map text to (word, lineno) pairs
* for each word, take the maximum line num
* then join the line number to the original input
val wordLine = alicePipe.flatMap { case (text, line) =>
text.split("\\s+") { word => (word, line) }
// Take the max
// see all the functions on grouped things here:
val lastLine =
// now lookup the initial line: { case (word, lastLine) => (lastLine, word) }
// same as .swap, by the way
/** That's it.
* You have learned the basics:
* TypedPipe, map/flatMap/filter
* groups do reduce/join: max, sum, join, take, sortBy

harcek commented Mar 31, 2015

As of 0.13.1 Scalding release the way how to start the REPL local mode is

> ./sbt "scalding-repl/run --local"

Doesn't impact the code, but for people following along, the URL for the book has changed:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment