Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save joshisa/30ba72b3d72e1ad99d100ddb4a0ec960 to your computer and use it in GitHub Desktop.
Save joshisa/30ba72b3d72e1ad99d100ddb4a0ec960 to your computer and use it in GitHub Desktop.
Spark wholeTextFiles
val sparkConf = new SparkConf().setMaster("local").setAppName("text")
val sc = new SparkContext(sparkConf)
val hadoopConf = sc.hadoopConfiguration
//set the aws secret information
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId","youraccesskeyid")
hadoopConf.set("fs.s3n.awsSecretAccessKey","secretkey")
val docs = sc.wholeTextFiles("s3n://files.sparks.public/data/enwiki_category_text/part-00000").map({case (name, contents) =>
(name, contents.replaceAll("[^A-Za-z']+", " ").trim.toLowerCase.split("\\s+"))
})
println(docs.collect())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment