Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save dgadiraju/f7161054162f717a75ede76725a03a8a to your computer and use it in GitHub Desktop.
Save dgadiraju/f7161054162f717a75ede76725a03a8a to your computer and use it in GitHub Desktop.
val path = "/public/yelp-dataset/yelp_review.csv"
val conf = sc.hadoopConfiguration
conf.set("textinputformat.record.delimiter", "\r")
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
val yelpReview = sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
yelpReview.count
yelpReview.map(r => r._2.toString).take(10).foreach(println)
yelpReview.map(r => (r._2.toString.split("\",\"").size, 1)).reduceByKey(_ + _).collect.foreach(println)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment