Skip to content

Instantly share code, notes, and snippets.

@ignasi35
Last active August 29, 2015 14:18
Show Gist options
  • Save ignasi35/f1890a17ad7d04cf92d5 to your computer and use it in GitHub Desktop.
Save ignasi35/f1890a17ad7d04cf92d5 to your computer and use it in GitHub Desktop.
// see http://www.meetup.com/Spark-Barcelona/events/220256138/
lazy val file1 = sc.textFile("smalldataset/pa*000000")
lazy val file123 = sc.textFile("smalldataset/pa*")
lazy val countries = sc.textFile("smalldataset/coun*")
val ex1 = file1.map(_.split("\\s")).map(pieces => pieces(0).split("\\.")(0)).distinct().count()
val ex2Filtered = file123.filter(line => line.startsWith("en") && line.contains(" Kayak "))
val ex2tupled = ex2Filtered.map(_.split("\\s")).map { pieces => (pieces(0).split("\\.")(0), pieces(2).toInt) }
val ex2 = ex2tupled.reduceByKey(_ + _).collect().mkString(",")
// 'countries' file has human readable names while wikimedia links have
// canonicalized names (replacing ' ' with '_' for URL safety I presume).
// Parse and canonicalize the countries.
val countryName = countries.map { line => (line.split("\\s")(0).replaceAll(" ", "_"), 1) }
// only contry names in english are interesting (hack ;-) )
val countryVisits = file123.filter { line => line.startsWith("en") }.map { line => line.split("\\s") }.map { xs => (xs(1), xs(2).toInt) }
val preCountries = countryName.join(countryVisits)
val ex3 = preCountries.map { case (k, v) => (k, v._2) }.reduceByKey(_ + _).sortBy(_._2, ascending = false).take(10).mkString(",")
println(ex1) // 328
println(ex2) // (en,35)
println(ex3) // (Indonesia,1423),(Canada,1154),(India,1084),(Australia,1074),(China,1024),(Germany,924),(Japan,893),(Singapore,883),(Israel,882),(Cuba,859)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment