moradology/spark-config.md

## spark-config.md

      
    Raw
  

              spark-config.md
            
          
    Spark Configuration Flags

Configurations


Default Parallelism

spark.default.parallelism
Suggested Value: 32 to 5000
Description: Sets the default level of parallelism. 32 indicates a moderate level suitable for medium-sized clusters. This is just a starting point. You can crank this number way up and spark documentation suggests 1 per cpu across all executors. This default will not prevent a higher level of parallelism for any APIs other than RDD. This value may be one which changes according to expected job size and setting it via heuristics evaluated in infrastructure which kicks off EMR jobs is perhaps wise.


Driver Extra Java Options

spark.driver.extraJavaOptions
Suggested Value: -Djts.overlay=ng
Description: Extra Java options for the driver. The -Djts.overlay=ng option tells JTS to use its next generation overlay algorithm. This value is highly recommended, as it will avoid issues commonly encountered with geometries that are related to floating point precision. Essentially: try the fast overlay strategies and fall back on slower, more forgiving strategies.


Executor Extra Java Options

spark.executor.extraJavaOptions
Suggested Value: -Djts.overlay=ng
Description: Extra Java options for the executors. The -Djts.overlay=ng option tells JTS to use its next generation overlay algorithm. This value is highly recommended, as it will avoid issues commonly encountered with geometries that are related to floating point precision. Essentially: try the fast overlay strategies and fall back on slower, more forgiving strategies.


Hive Metastore Configuration

spark.hadoop.hive.metastore.client.factory.class
Suggested Value: com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
Description: Configures Spark to use AWS Glue Data Catalog for the Hive metastore. This is highly recommended, as it is another bit of infrastructure AWS will manage such that downtime and debugging are less of a concern.


Executor Cores

spark.executor.cores
Suggested Value: 4
Description: Number of CPU cores allocated for each executor. 4 cores balance task parallelism and performance. 4 is max on serverless and it appears to be highly performant.


Executor Memory

spark.executor.memory
Suggested Value: 10g
Description: Memory allocated per executor, here 10GB, suitable for memory-intensive tasks. This is a relatively high value. If optimization is desired, try tuning this down first. 5g or lower might even be reasonable.


Executor Memory Overhead

spark.executor.memoryOverhead
Suggested Value: 1g
Description: Additional overhead memory for each executor, set to 1g. This is potentially worth playing with. 1g is on the higher end of 'normal' values one encounters here, and that might be overly cautious. 512m is perhaps as low as this value should go; there may be some benefit to experimenting with lower values here.


Driver Memory

spark.driver.memory
Suggested Value: 10g
Description: Memory allocation for the Spark driver. Set to a relatively liberal value of 10g. There's only one of them, so it probably isn't necessary to be too stingy here.


Driver Memory Overhead

spark.driver.memoryOverhead
Suggested Value: 2g
Description: Additional non-heap memory allocation for the driver, set to 2g. Again, there's only one driver and everything is going to be painful if it dies. May as well give it some room to breathe.


Shuffle Compression

spark.shuffle.compress
Suggested Value: false
Description: Compresses data during shuffle to save disk space. This is especially useful when disk and network are constraining factors. That's likely not the case on EMR Serverless which, anyway, is billed according to vCPU and memory usage.


RDD Compression

spark.rdd.compress
Suggested Value: false
Description: Compresses serialized RDD partitions to save disk space. Like above, this is best left off with an aim towards jobs which keeps as much data in 'working memory' (memory) and out of 'long term memory' (disk) as possible.


Driver Max Result Size

spark.driver.maxResultSize
Suggested Value: 5g
Description: Sets the maximum size of results returned by the driver, here 5GB. This value can be played with, but given the scale of these jobs, it is sensible to bump things up from the default of 1g. A word of warning: higher values can cause OOM errors if the driver's memory and overhead are not sufficient to the size of returned values.


Off-Heap Memory

spark.memory.offHeap.enabled
Suggested Value: true
Description: Enables the use of off-heap memory storage. Essential for applications like GDAL that utilize native memory (off-heap), ensuring there is sufficient off-heap memory allocated to avoid hard-to-debug memory errors.


Off-Heap Memory Size

spark.memory.offHeap.size
Suggested Value: 512m
Description: Sets the size of off-heap memory, here 512m, to support GDAL's memory requirements. Try increasing this value as something of a last resort for dying workers and memory-related problems.


Executor Environment AWS Request Payer

spark.executorEnv.AWS_REQUEST_PAYER
Suggested Value: requester
Description: Sets the AWS Request Payer to 'requester' in the executor environment. This configuration indicates that the requester (i.e., the user running the Spark job) will bear the costs of the AWS requests made by the executors. It is primarily (exclusively?) used by GDAL.


EMR Serverless Driver Environment AWS Request Payer

spark.emr-serverless.driverEnv.AWS_REQUEST_PAYER
Suggested Value: requester
Description: Similar to the executor setting, this applies to the EMR Serverless driver environment, setting the AWS Request Payer to 'requester' for AWS requests made by the Spark driver. It is primarily (exclusively?) used by GDAL.


S3 Use Requester Pays Header

spark.hadoop.fs.s3.useRequesterPaysHeader
Suggested Value: true
Description: Enables the use of the Requester Pays header for S3 requests. When enabled, it signifies that the requester is responsible for the cost of data transfer and requests to Amazon S3.


S3 Enable Server-Side Encryption

spark.hadoop.fs.s3.enableServerSideEncryption
Suggested Value: true
Description: Turns on server-side encryption for data stored in S3. This setting ensures that data is encrypted at rest within S3.


S3 Server-Side Encryption Algorithm

spark.hadoop.fs.s3.serverSideEncryptionAlgorithm
Suggested Value: AES256
Description: Specifies the encryption algorithm used for server-side encryption in S3. AES256 indicates the use of the AES-256 encryption algorithm, providing strong encryption for data at rest.


Inspecting GC

spark.driver.extraJavaOptions
Suggested Value: -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Description: If attempting to dig in and tune garbage collection to maximize performance and especially to minimize any garbage collection bottlenecks, these flags are likely to be useful. See here for more details