# For a local environment | |
# Install hadoop and apache-spark via homebrew | |
# Apache Spark conf file | |
# libexec/conf/spark-defaults.conf | |
# Make the AWS jars available to Spark | |
spark.executor.extraClassPath /usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.1.jar | |
spark.driver.extraClassPath /usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/tools/lib/hadoop-aws-2.7.1.jar | |
# Add file | |
# libexec/conf/hdfs-site.xml | |
# http://stackoverflow.com/questions/30262567/unable-to-load-aws-credentials-when-using-spark-sql-through-beeline | |
<?xml version="1.0"?> | |
<configuration> | |
<property> | |
<name>fs.s3a.access.key</name> | |
<value>xxx</value> | |
</property> | |
<property> | |
<name>fs.s3a.secret.key</name> | |
<value>xxx</value> | |
</property> | |
</configuration> |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
cfeduke
commented
Dec 20, 2015
A couple of important things:
In spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=ACCESSKEY
spark.hadoop.fs.s3a.secret.key=SECRETKEY Or they can be set like this, where sc.hadoopConfiguration.set("fs.s3a.impl", "see above")
// ... If you encounter an NPE while trying to write to an S3A URL:
then additional configuration is needed for the temporary scratch space. If your files are small enough you can instead just: sc.hadoopConfiguration.set("fs.s3a.fast.upload", "true") as this bypasses the temporary local disk scratch space.[2] |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
chicagobuss
Mar 1, 2016
Thanks for this - I struggled with the same thing for over a week until I found this gist. I've tried to leave an even more concise version of what you need on my own gist here: https://gist.github.com/chicagobuss/6557dbf1ad97e5a09709
chicagobuss
commented
Mar 1, 2016
Thanks for this - I struggled with the same thing for over a week until I found this gist. I've tried to leave an even more concise version of what you need on my own gist here: https://gist.github.com/chicagobuss/6557dbf1ad97e5a09709 |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
jobonilla
Sep 7, 2016
Thanks for this.
pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
also works
jobonilla
commented
Sep 7, 2016
Thanks for this. |
A couple of important things:
aws-java-sdk-1.7.4
is required as of Dec 19, 2015. Even though its a 2014 built JAR, the API changed.[1]In
spark.properties
the HDFS configurations can be set:Or they can be set like this, where
sc
is an instance ofSparkContext
:If you encounter an NPE while trying to write to an S3A URL:
then additional configuration is needed for the temporary scratch space. If your files are small enough you can instead just:
as this bypasses the temporary local disk scratch space.[2]