Skip to content

Instantly share code, notes, and snippets.

@eddies
Created July 29, 2016 08:00
Show Gist options
  • Star 33 You must be signed in to star a gist
  • Fork 10 You must be signed in to fork a gist
  • Save eddies/f37d696567f15b33029277ee9084c4a0 to your computer and use it in GitHub Desktop.
Save eddies/f37d696567f15b33029277ee9084c4a0 to your computer and use it in GitHub Desktop.
Spark 2.0.0 and Hadoop 2.7 with s3a setup

Standalone Spark 2.0.0 with s3

###Tested with:

  • Spark 2.0.0 pre-built for Hadoop 2.7
  • Mac OS X 10.11
  • Python 3.5.2

Goal

Use s3 within pyspark with minimal hassle.

Load required libraries

If $SPARK_HOME/conf/spark-defaults.conf does not exist, create a copy from $SPARK_HOME/conf/spark-defaults.conf.template

In $SPARK_HOME/conf/spark-defaults.conf include:

spark.jars.packages                com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2

AWS Credentials

In $SPARK_HOME/conf/hdfs-site.xml include:

<?xml version="1.0"?>
<configuration>
<property>
  <name>fs.s3a.access.key</name>
  <value>YOUR_KEY_HERE</value>
</property>
<property>
  <name>fs.s3a.secret.key</name>
  <value>YOUR_SECRET_HERE</value>
</property>
<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>YOUR_KEY_HERE </value>
</property>
<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value> YOUR_SECRET_HERE </value>
</property>
</configuration>

Notes

Per HADOOP-12420:

the rule for s3a work now and in future "use a consistent version of the amazon libraries with which hadoop was built with"

With a future version of Spark with Hadoop 2.8, you should be able to use aws-sdk-s3.

Things that didn't work

  1. Defining aws_access_key_id and aws_secret_access_key in ~/.aws/credentials, e.g.:

    [default]
    aws_access_key_id=YOUR_KEY_HERE
    aws_secret_access_key=YOUR_SECRET_HERE
    
    [profile_foo]
    aws_access_key_id=YOUR_KEY_HERE
    aws_secret_access_key=YOUR_SECRET_HERE
    
  2. Setting the PYSPARK_SUBMIT_ARGS environment variable, e.g.

    export PYSPARK_SUBMIT_ARGS="--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell" 

Things that also worked but were less optimal

  1. Calling pyspark with --packages argument:

    pyspark --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
  2. Defining AWS credentials in code, e.g.:

    sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "YOUR_KEY_HERE")
    sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "YOUR_SECRET_HERE")
@adam-phillipps
Copy link

Thanks! That was very helpful for me.

@baoGia1404
Copy link

Thanks! It saves my first Spark project.

@Raghavsalotra
Copy link

Thanks a lot! it was really helpful.

@wrschneider
Copy link

You can use the ~/.aws/credentials file, it just takes some additional config. By default, Spark/Hadoop does not include the ProfileCredentialsProvider in its chain, as the AWS SDK does. This is confusing because the standard environment variables "just work". But this is fixable.

@sithara
Copy link

sithara commented Aug 8, 2019

Not sure why I get FileSystem:2639 - Cannot load filesystem java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated when I try to run this locally 😞 . Appreciate any help

Thanks

@absognety
Copy link

Along with the above aws credentials, some additional configuration parameters are required. I spent about 1 hr and finally read csv file from S3 successfully. Please find the final list of parameters used by me:

sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "YOUR_KEY_HERE")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "YOUR_SECRET_HERE")
sc._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com")

and obviously you have to add fs.s3a.impl and fs.s3a.path.style.access and also fs.s3a.endpoint to hdfs-site.xml and call pyspark with packages argument like stated above.

@LiningZheng
Copy link

Thank you!

@youzouu
Copy link

youzouu commented Oct 26, 2022

hello, i have a Problem to load file from Aws S3 on Jupyter. I Think is caused because librairies aws-java-sdk and Hadoop-aws are missed. So i want to follow your tuto to Install them. When i create the copy spark-defaults.conf from $SPARK_HOME/conf/spark-defaults.conf.template (like this sudo cp spark-defaults.conf spark-defaults.conf.template) after i can't write inside to include : spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2.
With each step that I take I encounter a problems and to solve the second problem I encounter another problem. So if someone can help me please. Thank a lot

@Littlehhao
Copy link

我配置成
fs.s3a.endpoint:http://minio-hs:9000 无法解析。这是k8s 中的地址,但是我配置成ip就可以连接
image

谁能告诉我为什么呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment