Skip to content

Instantly share code, notes, and snippets.

@zezutom
Last active December 20, 2015 20:57
Show Gist options
  • Save zezutom/24c87900224969edc9d3 to your computer and use it in GitHub Desktop.
Save zezutom/24c87900224969edc9d3 to your computer and use it in GitHub Desktop.
A list of commonly used words as a broadcast variable
class TextAnalyser(val sc: SparkContext, ...) {
...
// Instance variable
val _commonWords = sc.broadcast(TextAnalyser.loadCommonWords())
...
// In a worker thread
def analyse(rdd: RDD[String]): TextStats = {
// To prevent the whole class to be 'sucked' into serialization
val commonWords = _commonWords
...
// Like accumulators a Broadcast is a wrapper, the 'value' method provides access to the actual data.
.filter(!commonWords.value.contains(_)) // Filter out all too common words
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment