Skip to content

Instantly share code, notes, and snippets.

@amuraru
Last active April 13, 2022 10:32
Show Gist options
  • Save amuraru/93f458bea79f47ff14d86ee48225acad to your computer and use it in GitHub Desktop.
Save amuraru/93f458bea79f47ff14d86ee48225acad to your computer and use it in GitHub Desktop.
graph TD;
    A-->B;
    A-->C;
    B-->D;
    C-->D;
Loading

Spark Memory Tuning

  • Increase driver heap to accommodate large DAGs
  • Avoid too granular executors (use larger heaps) and configure multi-threading (cores)
  • Set memory.fraction=0.6 to leave the rest to executor working memory (shuffle, etc)
  • 60% of instance CPUs allocated to executors, leave headroom for other tasks
  • Disable offHeap memory – not stable in our tests
# instance i3.8xlarge | 244GiB | 32CPU | 4*2TiB SSD | 10Gbps
driver-memory 32g
spark.driver.maxResultSize=10g

executor-memory 32g
executor-cores 6
num-executors INSTANCES*4

spark.memory.offHeap.enabled=false
spark.executor.memoryOverhead=12g
spark.memory.fraction=0.6

spark.dynamicAllocation.enabled=false
spark.shuffle.service.enabled=false

RDD Persistence

  • Use disk persistence only, when running on SSDs
  • Leave heap memory to Spark executor
--conf spark.driver.extraJavaOptions= 
-Dspark.persistence.useDisk=true 
-Dspark.persistence.useOnHeapMemory=false 
-Dspark.persistence.useOffHeapMemory=false 
-Dspark.persistence.keepDeserialized=false 
-Dspark.persistence.replication=2

Shuffle Tuning

  • Fine tune the shuffle partitions based on your number of executors/cores.
  • Increase the split size when reading data from blob store (e.g. S3)
spark.sql.shuffle.partitions=SPARK_NUM_EXECUTORS * SPARK_EXECUTOR_CORES*2

spark.sql.files.maxPartitionBytes= 268435456
spark.files.maxPartitionBytes=268435456

Increase tasks execution resilience

  • Increase network timeouts to cope with network flukes (e.g. EMR)
  • Enable blacklisting of executors and increase number of task retries to cope with degraded instances
spark.sql.broadcastTimeout=36000
spark.network.timeout=120

spark.task.maxFailures=20
spark.blacklist.enabled=true
spark.blacklist.timeout=99h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment