Skip to content

Instantly share code, notes, and snippets.

@thekensta
Created October 14, 2015 20:27
Show Gist options
  • Save thekensta/21068ef1b6f4af08eb09 to your computer and use it in GitHub Desktop.
Save thekensta/21068ef1b6f4af08eb09 to your computer and use it in GitHub Desktop.
Set up Apache Spark 1.5+ with Hadoop 2.6+ s3a
# For a local environment
# Install hadoop and apache-spark via homebrew
# Apache Spark conf file
# libexec/conf/spark-defaults.conf
# Make the AWS jars available to Spark
spark.executor.extraClassPath /usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.1.jar
spark.driver.extraClassPath /usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.1.jar
# Add file
# libexec/conf/hdfs-site.xml
# http://stackoverflow.com/questions/30262567/unable-to-load-aws-credentials-when-using-spark-sql-through-beeline
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>xxx</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>xxx</value>
</property>
</configuration>
@cfeduke
Copy link

cfeduke commented Dec 20, 2015

A couple of important things:

aws-java-sdk-1.7.4 is required as of Dec 19, 2015. Even though its a 2014 built JAR, the API changed.[1]

In spark.properties the HDFS configurations can be set:

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=ACCESSKEY
spark.hadoop.fs.s3a.secret.key=SECRETKEY

Or they can be set like this, where sc is an instance of SparkContext:

sc.hadoopConfiguration.set("fs.s3a.impl", "see above")
// ...

If you encounter an NPE while trying to write to an S3A URL:

re: Lost task 1.3 in stage 4.0 (TID 42, x.x.x.x): java.lang.NullPointerException
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:268)

then additional configuration is needed for the temporary scratch space. If your files are small enough you can instead just:

sc.hadoopConfiguration.set("fs.s3a.fast.upload", "true")

as this bypasses the temporary local disk scratch space.[2]

  1. https://issues.apache.org/jira/browse/HADOOP-12420
  2. https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

@chicagobuss
Copy link

Thanks for this - I struggled with the same thing for over a week until I found this gist. I've tried to leave an even more concise version of what you need on my own gist here: https://gist.github.com/chicagobuss/6557dbf1ad97e5a09709

@jobonilla
Copy link

Thanks for this.
pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
also works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment