Skip to content

Instantly share code, notes, and snippets.

@arnesund
Created July 13, 2015 13:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arnesund/fa25f4f98bdcab314d53 to your computer and use it in GitHub Desktop.
Save arnesund/fa25f4f98bdcab314d53 to your computer and use it in GitHub Desktop.
Word count of Twitter hashtags using Apache Spark
# Count the number of occurrences for each hashtag,
# by first extracting the hashtag and lowercasing it,
# then do a standard word count with map and reduceByKey
countsRDD = (filteredTweetsRDD
.flatMap(lambda tweet: [hashtag['text'].lower() for hashtag in tweet['entities']['hashtags']])
.map(lambda tag: (tag, 1))
.reduceByKey(lambda a, b: a + b)
)
# Get the most used hashtags (order countsRDD descending by count)
countsRDD.takeOrdered(20, lambda (key, value): -value)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment