Skip to content

Instantly share code, notes, and snippets.

@jaceklaskowski
Last active April 22, 2021 11:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jaceklaskowski/6fb58f2e28bcd6c1d493573e7bb5accc to your computer and use it in GitHub Desktop.
Save jaceklaskowski/6fb58f2e28bcd6c1d493573e7bb5accc to your computer and use it in GitHub Desktop.
Hadoop Properties for Spark in Cloud (s3a, buckets)

Hadoop Properties for Spark in Cloud

The following is a list of Hadoop properties for Spark to use HDFS more effective.

spark.hadoop.-prefixed Spark properties are used to configure a Hadoop Configuration that Spark broadcast to tasks. Use spark.sparkContext.hadoopConfiguration to review the properties.

  • spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2

Google Cloud Storage

Read Google Cloud Storage Connector for Spark and Hadoop

Amazon S3

Read Hadoop-AWS module

  • fs.s3a.impl = org.apache.hadoop.fs.s3a.S3AFileSystem
  • fs.s3a.multiobjectdelete.enable = false
  • fs.s3a.fast.upload = true
  • fs.s3a.endpoint
  • fs.s3a.access.key
  • fs.s3a.secret.key
  • fs.s3a.path.style.access = true
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment