Skip to content

Instantly share code, notes, and snippets.

@oza
Last active October 9, 2022 08:53
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save oza/e0df39adc3aac081c4bf to your computer and use it in GitHub Desktop.
Save oza/e0df39adc3aac081c4bf to your computer and use it in GitHub Desktop.
How to run Spark on YARN with dynamic resource allocation
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
# Example:
# spark.master spark://master:7077
spark.eventLog.enabled true
spark.eventLog.dir file:///home/ozawa/sparkeventlogs
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.shuffle.manager SORT
spark.shuffle.consolidateFiles true
spark.shuffle.spill true
spark.shuffle.memoryFraction 0.75
spark.storage.memoryFraction 0.45
spark.shuffle.spill.compress false
spark.shuffle.compress false
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
#!/usr/bin/env bash
# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_LIBRARY, to point to your libmesos.so if you use Mesos
# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ?~@~Xdefault?~@~Y)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.
HADOOP_CONF_DIR=/home/ubuntu/hadoop/etc/hadoop/
SPARK_EXECUTOR_INSTANCES=14
SPARK_EXECUTOR_MEMORY=4G
SPARK_DRIVER_MEMORY=2G
# Options for the daemons used in the standalone deploy mode:
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

YARN

  1. General resource management layer on HDFS
  2. A part of Hadoop

Spark

  1. In memory processing framework

Spark on YARN

How does it work?

Dynamic resource allocation

  1. Releasing containers if no tasks
  2. Launching containers if all tasks have more tasks

How does it work?

  • Figure

How to run Spark on YARN with dynamic resource allocation

  1. Copy external shuffle plugin.
  2. Edit yarn-site.xml to use spark_shuffle
  3. Restarting node manager.

Configuration

spark.dynamicAllocation.enabled true

Conclusion

@harjeet88
Copy link

Hi, your post is great... I have created a video on same topic. Please share your comments and views
https://youtu.be/-9bh_Oue9GM
https://youtu.be/V9E-bWarMNw

@alexvorobiev
Copy link

Thanks for the post! Is is possible to install multiple versions of Spark on YARN and use dynamic resource allocation for all of them? The shuffle plugins seems to be version-specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment