Skip to content

Instantly share code, notes, and snippets.

@smartnose
Last active May 22, 2021 11:30
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save smartnose/f9a88bcc3a8b0404bce156ae5c92679d to your computer and use it in GitHub Desktop.
Save smartnose/f9a88bcc3a8b0404bce156ae5c92679d to your computer and use it in GitHub Desktop.

Custom credential provider for AWS EMRFS and Spark applications

Background

Frequently, our EMR applications need to perform cross-account read/write, i.e., the cluster is created under one AWS billing account, but the data lives under another (let's call it "guest account"). Because security concerns, we cannot grant blank S3 access to the guest account. Instead, we should rely on assume-role function of AWS STS to provide ephemeral authentication for read/write transactions. The basic logic for calling STS service is not difficult, but there are some pitfalls when you want to integrate the assume-role authentication with EMRFS.

Custom credential provider

For hadoop/Spark, the authentication process is handled within the file system itself, so the application code can write to a S3 file without worrying about the underlying nitty-gritty details. EMRFS is an implementation of S3 file system, and it provides an extension point so you can plug in your custom credential provider Now, you can enable cross account read/write access from Hadoop/Spark by the following steps:

  1. Create a role in your client account that has read/write permission
  2. Make sure your EMR cluster can assume this role
  3. Write a custom credential provider that assumes the role through STS REST API
  4. Copy credential provider jar to the CLASSPATH of your EMR application (e.g. /usr/shared/aws/emrfs/auxlib)
  5. Update emrfs-site as described here

Limitation

There is a serious limitation to the custom credential provider due to its way of caching the credential provider. Specifically, S3 will try a chain of credential providers (at least it will try custom credential provider and default aws credential provider), and cache the last working credential provider for the following S3 access until the credentials expire. This means, you can not use two different credential providers within the same EMR application. Say, if you want to use one credential for s3://one_bucket/data and the other for s3://another_bucket/.., there simply is no way to do that, because hadoop S3 file system will always use the same credential that succeeded before.

One way to fix this is to allow S3 URI to carry the assume role name, and have the custom credential provider to assume different roles for different URIs. For example, we may have s3://one_bucket/data?use-role=role1 and s3://another_bucket/data?use-role=role2.

@mnoumanshahzad
Copy link

Thank you for sharing these insights...!
I see that the open sourced zillow/aws-custom-credentials-provider was developed by you.
I am curious about some details, as I am not able to find the answers myself.

In the credentials-provider, you implemented the refresh method.
How does this refresh method gets invoked by the underlying Hadoop services?
Is it safe to assume that having an implementation for the refresh mechanism is sufficient and the Hadoop service will invoke this method?

I found the following explanation with the Hadoop 3's AssumedRoleCredentialsProvider:

This AWS Credential provider will read in the fs.s3a.assumed.role options needed to connect to the 
Security Token Service Assumed Role API, first authenticating with the full credentials, then assuming the
specific role specified. It will then refresh this login at the configured rate of fs.s3a.assumed.role.session.duration

and the description in the configuration for timeout duration is:

<property>
  <name>fs.s3a.assumed.role.session.duration</name>
  <value>30m</value>
  <description>
    Duration of assumed roles before a refresh is attempted.
    Only used if AssumedRoleCredentialProvider is the AWS credential provider.
    Range: 15m to 1h
  </description>
</property>

I tried to scan through the Hadoop code base to figure out how this refresh is possible, but so far, fail to pinpoint the relevant code block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment