Skip to content

Instantly share code, notes, and snippets.

@arnesund
Created July 13, 2015 12:34
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save arnesund/2b671d945c73fe655eaa to your computer and use it in GitHub Desktop.
Load tweets into Spark and filter
# Extract tweets from MongoDB
allTweets = []
for doc in db.tweets.find():
allTweets.append(doc['tweet'])
# Load tweets into Spark for analysis
allTweetsRDD = sc.parallelize(allTweets, 8)
# Set up filter to only get tweets from the last week
DAYS_LIMIT=7
limit = datetime.datetime.now() - datetime.timedelta(days=DAYS_LIMIT)
limit_unixtime = time.mktime(limit.timetuple())
# Filter tweets to get rid of those who either have no hashtags or are too old
tweetsWithTagsRDD = allTweetsRDD.filter(lambda t: len(t['entities']['hashtags']) > 0)
filteredTweetsRDD = tweetsWithTagsRDD.filter(lambda t: time.mktime(parser.parse(t['created_at']).timetuple()) > limit_unixtime)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment