Skip to content

Instantly share code, notes, and snippets.

@epishkin
Last active August 29, 2015 14:11
Show Gist options
  • Save epishkin/a33163de9a0b9d1e7508 to your computer and use it in GitHub Desktop.
Save epishkin/a33163de9a0b9d1e7508 to your computer and use it in GitHub Desktop.
Examples for Optimize Scalding Jobs
// 2 m/r jobs :-(
.unique('item_id_from, 'item_id_to, 'user_id) // 1st m/r
.groupBy('item_id_from, 'item_id_to) { _.size('count) } // 2nd m/r
// 1 m/r job but more code
.map('user_id -> 'user_id) { id: String => Set(id) }
.groupBy('item_id_from, 'item_id_to) {
_.sum[Set[String}]('user_id)
}
.map('user_id -> 'count) { ids: Set[String] => ids.size }
// 1 m/r job :-)
.groupBy('item_id_from, 'item_id_to) {
_.mapReduceMap[String, Set[String], Int]('user_id -> 'count)(Set(_))(_ ++ _)(_.size)
}
// 1m/r job for big numbers (HLL)
.groupBy('item_id_from, 'item_id_to) {
import util.utf8
_.approximateUniqueCount[String]('user_id -> 'count)
}
.map('count -> 'count) { count: Double => count.toLong }
def readTaps = ParquetAvroSource.project[Tap](tapsInput,
Projection[Tap]("campaign_id", "action_id", "header.device.storage_key")).read
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment