Last active
June 26, 2016 15:41
-
-
Save hardik-vala/bb33db5af527773051cfd8bb089c382b to your computer and use it in GitHub Desktop.
Scala solution to David's command-line puzzler for the April 17th (2015) NDG meeting (Description in the file).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import java.io.File | |
/* | |
* Problem description: Given a .tsv file where the first column is a tweet’s text content, the second | |
* column is the user id who made the tweet, and the third column is the lat lon of the tweet (if any), | |
* with a space in between the values how would you find the most frequently used hashtag for each user | |
* who has tweeted more than five times and has at least one tweet from in the continental US? | |
*/ | |
val inputPath: String = "/home/ndg/project/jurgens/command-line-demo/command-line-challenge-input.BIG.tsv" | |
io.Source.fromFile(new File(inputPath)) | |
// Get an iterator over the rows in the .tsv. | |
.getLines | |
// Map each row into an Array with the .tsv entries. | |
.map(_.split("\t")) | |
// Filter out rows without a user Id or an empty user Id. | |
.filterNot(_.size < 2) | |
// Filter our rows with an empty user Id. | |
.filterNot(_(1).trim == "") | |
// Convert to list so we can call groupBy. | |
.toList | |
// Group the rows according to user Id. | |
.groupBy(_(1)) | |
// Filter users with more than 5 tweets. | |
.filter(_._2.size > 5) | |
// Filter users with at least one tweet in the continental U.S. | |
.filter(_._2.exists(v => v.size > 2 && | |
24.3115 < v(2).split(" ")(0).toDouble && | |
v(2).split(" ")(0).toDouble < 49.2341 && | |
-124.626080 < v(2).split(" ")(1).toDouble && | |
v(2).split(" ")(1).toDouble < -62.361014)) | |
// Map the rows for each user to the user's most used hashtag. | |
.mapValues(_ | |
// Map each row to the list of hashtags used in the tweet text. | |
// (Tokenization performed using single space separation). | |
.map(_(0).split(" ").filter(_.startsWith("#"))) | |
// Combine all the hashtags used by a user in their invidiual | |
// tweets. | |
.flatten | |
// Group the hashtags by unique hashtags. | |
.groupBy(ht => ht) | |
// Map the hashtag groupings to the number of occurrences. | |
.mapValues(_.size) | |
// Get the highest occurrence-hashtag pair (If there's a tie, | |
// then the one occurring last, lexicographically, is returned. | |
.maxBy(_.swap) | |
// Get the most-occurring hashtag. | |
._1) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment