Set up Apache Spark 1.5+ with Hadoop 2.6+ s3a
# For a local environment | |
# Install hadoop and apache-spark via homebrew | |
# Apache Spark conf file | |
# libexec/conf/spark-defaults.conf | |
# Make the AWS jars available to Spark | |
spark.executor.extraClassPath /usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.1.jar | |
spark.driver.extraClassPath /usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.1.jar | |
# Add file | |
# libexec/conf/hdfs-site.xml | |
# http://stackoverflow.com/questions/30262567/unable-to-load-aws-credentials-when-using-spark-sql-through-beeline | |
<?xml version="1.0"?> | |
<configuration> | |
<property> | |
<name>fs.s3a.access.key</name> | |
<value>xxx</value> | |
</property> | |
<property> | |
<name>fs.s3a.secret.key</name> | |
<value>xxx</value> | |
</property> | |
</configuration> |
This comment has been minimized.
This comment has been minimized.
Thanks for this - I struggled with the same thing for over a week until I found this gist. I've tried to leave an even more concise version of what you need on my own gist here: https://gist.github.com/chicagobuss/6557dbf1ad97e5a09709 |
This comment has been minimized.
This comment has been minimized.
Thanks for this. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This comment has been minimized.
A couple of important things:
aws-java-sdk-1.7.4
is required as of Dec 19, 2015. Even though its a 2014 built JAR, the API changed.[1]In
spark.properties
the HDFS configurations can be set:Or they can be set like this, where
sc
is an instance ofSparkContext
:If you encounter an NPE while trying to write to an S3A URL:
then additional configuration is needed for the temporary scratch space. If your files are small enough you can instead just:
as this bypasses the temporary local disk scratch space.[2]