Skip to content

Instantly share code, notes, and snippets.

@moradology
Created February 29, 2024 15:36
Show Gist options
  • Save moradology/fbaca905235d653cf45ffeaa309ed265 to your computer and use it in GitHub Desktop.
Save moradology/fbaca905235d653cf45ffeaa309ed265 to your computer and use it in GitHub Desktop.

Spark Configuration Flags

Configurations

  1. Default Parallelism

    • spark.default.parallelism
    • Suggested Value: 32 to 5000
    • Description: Sets the default level of parallelism. 32 indicates a moderate level suitable for medium-sized clusters. This is just a starting point. You can crank this number way up and spark documentation suggests 1 per cpu across all executors. This default will not prevent a higher level of parallelism for any APIs other than RDD. This value may be one which changes according to expected job size and setting it via heuristics evaluated in infrastructure which kicks off EMR jobs is perhaps wise.
  2. Driver Extra Java Options

    • spark.driver.extraJavaOptions
    • Suggested Value: -Djts.overlay=ng
    • Description: Extra Java options for the driver. The -Djts.overlay=ng option tells JTS to use its next generation overlay algorithm. This value is highly recommended, as it will avoid issues commonly encountered with geometries that are related to floating point precision. Essentially: try the fast overlay strategies and fall back on slower, more forgiving strategies.
  3. Executor Extra Java Options

    • spark.executor.extraJavaOptions
    • Suggested Value: -Djts.overlay=ng
    • Description: Extra Java options for the executors. The -Djts.overlay=ng option tells JTS to use its next generation overlay algorithm. This value is highly recommended, as it will avoid issues commonly encountered with geometries that are related to floating point precision. Essentially: try the fast overlay strategies and fall back on slower, more forgiving strategies.
  4. Hive Metastore Configuration

    • spark.hadoop.hive.metastore.client.factory.class
    • Suggested Value: com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
    • Description: Configures Spark to use AWS Glue Data Catalog for the Hive metastore. This is highly recommended, as it is another bit of infrastructure AWS will manage such that downtime and debugging are less of a concern.
  5. Executor Cores

    • spark.executor.cores
    • Suggested Value: 4
    • Description: Number of CPU cores allocated for each executor. 4 cores balance task parallelism and performance. 4 is max on serverless and it appears to be highly performant.
  6. Executor Memory

    • spark.executor.memory
    • Suggested Value: 10g
    • Description: Memory allocated per executor, here 10GB, suitable for memory-intensive tasks. This is a relatively high value. If optimization is desired, try tuning this down first. 5g or lower might even be reasonable.
  7. Executor Memory Overhead

    • spark.executor.memoryOverhead
    • Suggested Value: 1g
    • Description: Additional overhead memory for each executor, set to 1g. This is potentially worth playing with. 1g is on the higher end of 'normal' values one encounters here, and that might be overly cautious. 512m is perhaps as low as this value should go; there may be some benefit to experimenting with lower values here.
  8. Driver Memory

    • spark.driver.memory
    • Suggested Value: 10g
    • Description: Memory allocation for the Spark driver. Set to a relatively liberal value of 10g. There's only one of them, so it probably isn't necessary to be too stingy here.
  9. Driver Memory Overhead

    • spark.driver.memoryOverhead
    • Suggested Value: 2g
    • Description: Additional non-heap memory allocation for the driver, set to 2g. Again, there's only one driver and everything is going to be painful if it dies. May as well give it some room to breathe.
  10. Shuffle Compression

    • spark.shuffle.compress
    • Suggested Value: false
    • Description: Compresses data during shuffle to save disk space. This is especially useful when disk and network are constraining factors. That's likely not the case on EMR Serverless which, anyway, is billed according to vCPU and memory usage.
  11. RDD Compression

    • spark.rdd.compress
    • Suggested Value: false
    • Description: Compresses serialized RDD partitions to save disk space. Like above, this is best left off with an aim towards jobs which keeps as much data in 'working memory' (memory) and out of 'long term memory' (disk) as possible.
  12. Driver Max Result Size

    • spark.driver.maxResultSize
    • Suggested Value: 5g
    • Description: Sets the maximum size of results returned by the driver, here 5GB. This value can be played with, but given the scale of these jobs, it is sensible to bump things up from the default of 1g. A word of warning: higher values can cause OOM errors if the driver's memory and overhead are not sufficient to the size of returned values.
  13. Off-Heap Memory

    • spark.memory.offHeap.enabled
    • Suggested Value: true
    • Description: Enables the use of off-heap memory storage. Essential for applications like GDAL that utilize native memory (off-heap), ensuring there is sufficient off-heap memory allocated to avoid hard-to-debug memory errors.
  14. Off-Heap Memory Size

    • spark.memory.offHeap.size
    • Suggested Value: 512m
    • Description: Sets the size of off-heap memory, here 512m, to support GDAL's memory requirements. Try increasing this value as something of a last resort for dying workers and memory-related problems.
  15. Executor Environment AWS Request Payer

    • spark.executorEnv.AWS_REQUEST_PAYER
    • Suggested Value: requester
    • Description: Sets the AWS Request Payer to 'requester' in the executor environment. This configuration indicates that the requester (i.e., the user running the Spark job) will bear the costs of the AWS requests made by the executors. It is primarily (exclusively?) used by GDAL.
  16. EMR Serverless Driver Environment AWS Request Payer

    • spark.emr-serverless.driverEnv.AWS_REQUEST_PAYER
    • Suggested Value: requester
    • Description: Similar to the executor setting, this applies to the EMR Serverless driver environment, setting the AWS Request Payer to 'requester' for AWS requests made by the Spark driver. It is primarily (exclusively?) used by GDAL.
  17. S3 Use Requester Pays Header

    • spark.hadoop.fs.s3.useRequesterPaysHeader
    • Suggested Value: true
    • Description: Enables the use of the Requester Pays header for S3 requests. When enabled, it signifies that the requester is responsible for the cost of data transfer and requests to Amazon S3.
  18. S3 Enable Server-Side Encryption

    • spark.hadoop.fs.s3.enableServerSideEncryption
    • Suggested Value: true
    • Description: Turns on server-side encryption for data stored in S3. This setting ensures that data is encrypted at rest within S3.
  19. S3 Server-Side Encryption Algorithm

    • spark.hadoop.fs.s3.serverSideEncryptionAlgorithm
    • Suggested Value: AES256
    • Description: Specifies the encryption algorithm used for server-side encryption in S3. AES256 indicates the use of the AES-256 encryption algorithm, providing strong encryption for data at rest.
  20. Inspecting GC

    • spark.driver.extraJavaOptions
    • Suggested Value: -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
    • Description: If attempting to dig in and tune garbage collection to maximize performance and especially to minimize any garbage collection bottlenecks, these flags are likely to be useful. See here for more details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment