Skip to content

Instantly share code, notes, and snippets.

@holacode
Last active November 1, 2015 10:31
Show Gist options
  • Save holacode/551eca4a958f2f057aa8 to your computer and use it in GitHub Desktop.
Save holacode/551eca4a958f2f057aa8 to your computer and use it in GitHub Desktop.
Reading file from s3 using spark , jars and config for sparkshell
Adding multiple jar to spark classpath
#make comma seperated jar list and give input to --jar flag
./spark-shell --jars $(echo ~/lib/*.jar | tr ' ' ',')
sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId","yourID")
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey","yourAccesskey")
val input = sc.textFile("s3:/pathtoyourcsv")
//some custom processing
val pairs = input.map(x => (x.split(",")(1), x))
//should print first line of csv
pairs.first
## list of jar required to for accessing s3 from spark, source and doc can be removed.
aspectjrt.jar commons-codec-1.6.jar httpclient-4.3.6.jar joda-time-2.8.1.jar
aspectjweaver.jar commons-logging-1.1.3.jar httpcore-4.3.3.jar spring-beans-3.0.7.jar
aws-java-sdk-1.10.30.jar freemarker-2.3.18.jar jackson-annotations-2.5.3.jar spring-context-3.0.7.jar
aws-java-sdk-1.10.30-javadoc.jar guava-18.0.jar jackson-core-2.5.3.jar spring-core-3.0.7.jar
aws-java-sdk-1.10.30-sources.jar hadoop-aws-2.6.0.jar jackson-databind-2.5.3.jar
aws-java-sdk-flow-build-tools-1.10.30.jar hadoop-common-2.6.0.jar javax.mail-api-1.4.6.jar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment