-
Default Parallelism
spark.default.parallelism
- Suggested Value:
32
to5000
- Description: Sets the default level of parallelism.
32
indicates a moderate level suitable for medium-sized clusters. This is just a starting point. You can crank this number way up and spark documentation suggests 1 per cpu across all executors. This default will not prevent a higher level of parallelism for any APIs other than RDD. This value may be one which changes according to expected job size and setting it via heuristics evaluated in infrastructure which kicks off EMR jobs is perhaps wise.
-
Driver Extra Java Options
spark.driver.extraJavaOptions
- Suggested Value:
-Djts.overlay=ng
- Description: Extra Java options for the driver. The
-Djts.overlay=ng
option tells JTS to use its next generation overlay algorithm. This value is highly recommended, as it will avoid issues commonly encountered with geometries that are related to floating point precision. Essentially: try the fast overlay strategies and fall back on slower, more forgiving strategies.
-
Executor Extra Java Options
spark.executor.extraJavaOptions
- Suggested Value:
-Djts.overlay=ng
- Description: Extra Java options for the executors. The
-Djts.overlay=ng
option tells JTS to use its next generation overlay algorithm. This value is highly recommended, as it will avoid issues commonly encountered with geometries that are related to floating point precision. Essentially: try the fast overlay strategies and fall back on slower, more forgiving strategies.
-
Hive Metastore Configuration
spark.hadoop.hive.metastore.client.factory.class
- Suggested Value:
com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
- Description: Configures Spark to use AWS Glue Data Catalog for the Hive metastore. This is highly recommended, as it is another bit of infrastructure AWS will manage such that downtime and debugging are less of a concern.
-
Executor Cores
spark.executor.cores
- Suggested Value:
4
- Description: Number of CPU cores allocated for each executor.
4
cores balance task parallelism and performance. 4 is max on serverless and it appears to be highly performant.
-
Executor Memory
spark.executor.memory
- Suggested Value:
10g
- Description: Memory allocated per executor, here 10GB, suitable for memory-intensive tasks. This is a relatively high value. If optimization is desired, try tuning this down first. 5g or lower might even be reasonable.
-
Executor Memory Overhead
spark.executor.memoryOverhead
- Suggested Value:
1g
- Description: Additional overhead memory for each executor, set to 1g. This is potentially worth playing with. 1g is on the higher end of 'normal' values one encounters here, and that might be overly cautious. 512m is perhaps as low as this value should go; there may be some benefit to experimenting with lower values here.
-
Driver Memory
spark.driver.memory
- Suggested Value:
10g
- Description: Memory allocation for the Spark driver. Set to a relatively liberal value of 10g. There's only one of them, so it probably isn't necessary to be too stingy here.
-
Driver Memory Overhead
spark.driver.memoryOverhead
- Suggested Value:
2g
- Description: Additional non-heap memory allocation for the driver, set to 2g. Again, there's only one driver and everything is going to be painful if it dies. May as well give it some room to breathe.
-
Shuffle Compression
spark.shuffle.compress
- Suggested Value:
false
- Description: Compresses data during shuffle to save disk space. This is especially useful when disk and network are constraining factors. That's likely not the case on EMR Serverless which, anyway, is billed according to vCPU and memory usage.
-
RDD Compression
spark.rdd.compress
- Suggested Value:
false
- Description: Compresses serialized RDD partitions to save disk space. Like above, this is best left off with an aim towards jobs which keeps as much data in 'working memory' (memory) and out of 'long term memory' (disk) as possible.
-
Driver Max Result Size
spark.driver.maxResultSize
- Suggested Value:
5g
- Description: Sets the maximum size of results returned by the driver, here 5GB. This value can be played with, but given the scale of these jobs, it is sensible to bump things up from the default of 1g. A word of warning: higher values can cause OOM errors if the driver's memory and overhead are not sufficient to the size of returned values.
-
Off-Heap Memory
spark.memory.offHeap.enabled
- Suggested Value:
true
- Description: Enables the use of off-heap memory storage. Essential for applications like GDAL that utilize native memory (off-heap), ensuring there is sufficient off-heap memory allocated to avoid hard-to-debug memory errors.
-
Off-Heap Memory Size
spark.memory.offHeap.size
- Suggested Value:
512m
- Description: Sets the size of off-heap memory, here 512m, to support GDAL's memory requirements. Try increasing this value as something of a last resort for dying workers and memory-related problems.
-
Executor Environment AWS Request Payer
spark.executorEnv.AWS_REQUEST_PAYER
- Suggested Value:
requester
- Description: Sets the AWS Request Payer to 'requester' in the executor environment. This configuration indicates that the requester (i.e., the user running the Spark job) will bear the costs of the AWS requests made by the executors. It is primarily (exclusively?) used by GDAL.
-
EMR Serverless Driver Environment AWS Request Payer
spark.emr-serverless.driverEnv.AWS_REQUEST_PAYER
- Suggested Value:
requester
- Description: Similar to the executor setting, this applies to the EMR Serverless driver environment, setting the AWS Request Payer to 'requester' for AWS requests made by the Spark driver. It is primarily (exclusively?) used by GDAL.
-
S3 Use Requester Pays Header
spark.hadoop.fs.s3.useRequesterPaysHeader
- Suggested Value:
true
- Description: Enables the use of the Requester Pays header for S3 requests. When enabled, it signifies that the requester is responsible for the cost of data transfer and requests to Amazon S3.
-
S3 Enable Server-Side Encryption
spark.hadoop.fs.s3.enableServerSideEncryption
- Suggested Value:
true
- Description: Turns on server-side encryption for data stored in S3. This setting ensures that data is encrypted at rest within S3.
-
S3 Server-Side Encryption Algorithm
spark.hadoop.fs.s3.serverSideEncryptionAlgorithm
- Suggested Value:
AES256
- Description: Specifies the encryption algorithm used for server-side encryption in S3.
AES256
indicates the use of the AES-256 encryption algorithm, providing strong encryption for data at rest.
-
Inspecting GC
spark.driver.extraJavaOptions
- Suggested Value:
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
- Description: If attempting to dig in and tune garbage collection to maximize performance and especially to minimize any garbage collection bottlenecks, these flags are likely to be useful. See here for more details
Created
February 29, 2024 15:36
-
-
Save moradology/fbaca905235d653cf45ffeaa309ed265 to your computer and use it in GitHub Desktop.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment