Skip to content

Instantly share code, notes, and snippets.

@leelakrishna
Last active January 29, 2021 13:01
Show Gist options
  • Save leelakrishna/c477cba6b3174924b470 to your computer and use it in GitHub Desktop.
Save leelakrishna/c477cba6b3174924b470 to your computer and use it in GitHub Desktop.
Spark scala - most frequent words
val f = sc.textFile("sample.txt")
// word count
val wc = f.flatMap(l => l.split(" ")).map(word => (word,1)).reduceByKey(_ + _)
// swap k,v to v,k to sort by word frequency
val wc_swap = wc.map(_.swap)
// sort keys by ascending=false (descending)
val hifreq_words = wc_swap.sortByKey(false,1)
hifreq_words.saveAsTextFile("hifreq_words")
// get an array of top 20 frequent words
val top20 = hifreq_words.take(20)
// convert array to RDD
val top20rdd = sc.parallelize(top20)
top20rdd.saveAsTextFile("hifreq_top20")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment