Skip to content

Instantly share code, notes, and snippets.

@dmpetrov
Created March 6, 2017 05:31
Show Gist options
  • Save dmpetrov/1c694ccc60c7324ab5f9a33fa14db194 to your computer and use it in GitHub Desktop.
Save dmpetrov/1c694ccc60c7324ab5f9a33fa14db194 to your computer and use it in GitHub Desktop.
Read reddit dataset to Spark
# Code for blog post:
# https://fullstackml.com/2015/11/24/where-to-find-terabyte-size-dataset-for-machine-learning/
import org.apache.spark.sql.catalyst.plans._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fileName = "reddit-May2015.tsv"
val textFile = sc.textFile(fileName)
val rdd = textFile.map(_.split("\t")).filter( _.length == 22 ).map { p =>
Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7), p(8), p(9),
p(10), p(11), p(12), p(13), p(14), p(15), p(16), p(17), p(18), p(19),
p(20), p(21))
}
val schemaString = "created_utc,ups,subreddit_id,link_id,name,score_hidden,author_flair_css_class,author_flair_text,subreddit,id,removal_reason,gilded,downs,archived,author,score,retrieved_on,body,distinguished,edited,controversiality,parent_id"
val schema = StructType(
schemaString.split(",").map(fieldName => StructField(fildName, StringType, true)))
val df = sqlContext.createDataFrame(rdd, schema)
df.show
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment