Created
July 13, 2015 13:55
-
-
Save arnesund/fa25f4f98bdcab314d53 to your computer and use it in GitHub Desktop.
Word count of Twitter hashtags using Apache Spark
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Count the number of occurrences for each hashtag, | |
# by first extracting the hashtag and lowercasing it, | |
# then do a standard word count with map and reduceByKey | |
countsRDD = (filteredTweetsRDD | |
.flatMap(lambda tweet: [hashtag['text'].lower() for hashtag in tweet['entities']['hashtags']]) | |
.map(lambda tag: (tag, 1)) | |
.reduceByKey(lambda a, b: a + b) | |
) | |
# Get the most used hashtags (order countsRDD descending by count) | |
countsRDD.takeOrdered(20, lambda (key, value): -value) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment