Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Set up Apache Spark 1.5+ with Hadoop 2.6+ s3a
# For a local environment
# Install hadoop and apache-spark via homebrew
# Apache Spark conf file
# libexec/conf/spark-defaults.conf
# Make the AWS jars available to Spark
spark.executor.extraClassPath /usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.1.jar
spark.driver.extraClassPath /usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.1.jar
# Add file
# libexec/conf/hdfs-site.xml
# http://stackoverflow.com/questions/30262567/unable-to-load-aws-credentials-when-using-spark-sql-through-beeline
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>xxx</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>xxx</value>
</property>
</configuration>
@cfeduke

This comment has been minimized.

Copy link

cfeduke commented Dec 20, 2015

A couple of important things:

aws-java-sdk-1.7.4 is required as of Dec 19, 2015. Even though its a 2014 built JAR, the API changed.[1]

In spark.properties the HDFS configurations can be set:

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=ACCESSKEY
spark.hadoop.fs.s3a.secret.key=SECRETKEY

Or they can be set like this, where sc is an instance of SparkContext:

sc.hadoopConfiguration.set("fs.s3a.impl", "see above")
// ...

If you encounter an NPE while trying to write to an S3A URL:

re: Lost task 1.3 in stage 4.0 (TID 42, x.x.x.x): java.lang.NullPointerException
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:268)

then additional configuration is needed for the temporary scratch space. If your files are small enough you can instead just:

sc.hadoopConfiguration.set("fs.s3a.fast.upload", "true")

as this bypasses the temporary local disk scratch space.[2]

  1. https://issues.apache.org/jira/browse/HADOOP-12420
  2. https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
@chicagobuss

This comment has been minimized.

Copy link

chicagobuss commented Mar 1, 2016

Thanks for this - I struggled with the same thing for over a week until I found this gist. I've tried to leave an even more concise version of what you need on my own gist here: https://gist.github.com/chicagobuss/6557dbf1ad97e5a09709

@jobonilla

This comment has been minimized.

Copy link

jobonilla commented Sep 7, 2016

Thanks for this.
pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
also works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.