Created October 14, 2015 20:27
Set up Apache Spark 1.5+ with Hadoop 2.6+ s3a
# For a local environment
# Install hadoop and apache-spark via homebrew
# Apache Spark conf file
# libexec/conf/spark-defaults.conf
# Make the AWS jars available to Spark
spark.executor.extraClassPath /usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.1.jar
spark.driver.extraClassPath /usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.1.jar
# Add file
# libexec/conf/hdfs-site.xml
<?xml version="1.0"?>
cfeduke commented Dec 20, 2015

A couple of important things:

aws-java-sdk-1.7.4 is required as of Dec 19, 2015. Even though its a 2014 built JAR, the API changed.[1]

In the HDFS configurations can be set:


Or they can be set like this, where sc is an instance of SparkContext:

sc.hadoopConfiguration.set("fs.s3a.impl", "see above")
// ...

If you encounter an NPE while trying to write to an S3A URL:

re: Lost task 1.3 in stage 4.0 (TID 42, x.x.x.x): java.lang.NullPointerException
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(

then additional configuration is needed for the temporary scratch space. If your files are small enough you can instead just:

sc.hadoopConfiguration.set("", "true")

as this bypasses the temporary local disk scratch space.[2]


I've tried to leave an even more concise version of what you need on my own gist here:

Thanks for this.
pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
also works

