Skip to content

Instantly share code, notes, and snippets.

@phatak-dev
Last active November 14, 2021 23:46
Show Gist options
  • Save phatak-dev/e75d5d0d773b857903c1 to your computer and use it in GitHub Desktop.
Save phatak-dev/e75d5d0d773b857903c1 to your computer and use it in GitHub Desktop.
Spark wholeTextFiles
val sparkConf = new SparkConf().setMaster("local").setAppName("text")
val sc = new SparkContext(sparkConf)
val hadoopConf = sc.hadoopConfiguration
//set the aws secret information
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId","youraccesskeyid")
hadoopConf.set("fs.s3n.awsSecretAccessKey","secretkey")
val docs = sc.wholeTextFiles("s3n://files.sparks.public/data/enwiki_category_text/part-00000").map({case (name, contents) =>
(name, contents.replaceAll("[^A-Za-z']+", " ").trim.toLowerCase.split("\\s+"))
})
println(docs.collect())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment