Skip to content

Instantly share code, notes, and snippets.

@tobilg
Last active May 22, 2023 14:57
Show Gist options
  • Star 23 You must be signed in to star a gist
  • Fork 11 You must be signed in to fork a gist
  • Save tobilg/e03dbc474ba976b9f235 to your computer and use it in GitHub Desktop.
Save tobilg/e03dbc474ba976b9f235 to your computer and use it in GitHub Desktop.
Description on how to use a custom S3 endpoint (like Rados Gateway for Ceph)

Custom S3 endpoints with Spark

To be able to use custom endpoints with the latest Spark distribution, one needs to add an external package (hadoop-aws). Then, custum endpoints can be configured according to docs.

Use the hadoop-aws package

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.2

SparkContext configuration

Add this to your application, or in the spark-shell:

sc.hadoopConfiguration.set("fs.s3a.endpoint", "<<ENDPOINT>>");
sc.hadoopConfiguration.set("fs.s3a.access.key","<<ACCESS_KEY>>");
sc.hadoopConfiguration.set("fs.s3a.secret.key","<<SECRET_KEY>>");

If your endpoint doesn't support HTTPS, then you'll need the following:

sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false");

S3 url usage

You can use s3a urls like this:

s3a://<<BUCKET>>/<<FOLDER>>/<<FILE>>

Also, it is possible to use the credentials in the path:

s3a://<<ACCESS_KEY>>:<<SECRET_KEY>>@<<BUCKET>>/<<FOLDER>>/<<FILE>>
@anirtek
Copy link

anirtek commented Dec 3, 2019

I am using spark structured streaming. So when should I set the values using sc.hadoopConfiguration()? Although I am new to spark, I am not sure if we can set these values after the spark session is created. Can someone help me our here?

Copy link

ghost commented Jan 10, 2020

Idk how common this use case is, but I have a pipeline that would be greatly simplified if I was able to use multiple S3 endpoints in a single spark submit. That is, I have a minio endpoint along with a standard aws s3 endpoint. Is there a way to configure the session on the fly such that I can use both?

@anirtek
Copy link

anirtek commented Apr 6, 2020

@jamesthegiantpeach were you able to resolve your issue?

@ysennoun
Copy link

Hello, fantastic work !
It works for me when I use a custom s3 (minio on Kubernetes) without TLS. But when I use my S3 with self-signed certificate TLS, event if I set the following value:

spark.hadoop.fs.s3a.endpoint=https://<my-endpoint-name>:9000

I get this exception:

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on bucket:
com.amazonaws.SdkClientException: Unable to execute HTTP request: sun.security.validator.ValidatorException:
PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException:
unable to find valid certification path to requested target: 
Unable to execute HTTP request: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: 
unable to find valid certification path to requested target at org.apache.hadoop.fs.s3a.S3AUtils.translateException

I really don't find in Internet how to set the path to the self-signed certificate. Do you ?

@minyk
Copy link

minyk commented Apr 20, 2020

@ysennoun IMO, the JVM(which runs spark) needs a custom ca cert. Googling like this: "custom certs into jvm"

@shubhamprasad0
Copy link

Idk how common this use case is, but I have a pipeline that would be greatly simplified if I was able to use multiple S3 endpoints in a single spark submit. That is, I have a minio endpoint along with a standard aws s3 endpoint. Is there a way to configure the session on the fly such that I can use both?

@jamesthegiantpeach: I have a similar use case. How did you solve this problem in your case?

@amityadav-alphonso
Copy link

amityadav-alphonso commented Feb 1, 2023

I am getting this error when I am reading S3 local instance.

The confs I have set are -

val sparkConf = new SparkConf().setAppName("testing") .set("fs.s3a.endpoint", "http://127.0.0.1:9090") .set("fs.s3a.multipart.size", "104857600) .set("fs.s3a.connection.ssl.enabled", "false") .set("fs.s3a.access.key", "adfadf") .set("fs.s3a.secret.key", "qerqer")

java.lang.IllegalArgumentException
at java.base/java.util.concurrent.ThreadPoolExecutor.(ThreadPoolExecutor.java:1293)
at java.base/java.util.concurrent.ThreadPoolExecutor.(ThreadPoolExecutor.java:1215)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:280)

@BKRami
Copy link

BKRami commented May 22, 2023

You have to set the style because it is a different from retrieving from AWS directly
.set("fs.s3a.path.style.access", "true")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment