Skip to content
Create a gist now

Instantly share code, notes, and snippets.

Running custom Spark build on a YARN cluster (for PySpark)

Building Spark for PySpark use on top of YARN

Build Spark on local machine (only if using PySpark; otherwise, remote machine works) (

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package

Copy the assembly/target/scala-2.10/...jar to the corresponding directory on the cluster node and also into a location in HDFS.

Set the Spark JAR HDFS location

export SPARK_JAR=hdfs:///user/laserson/tmp/spark-assembly-1.2.0-SNAPSHOT-hadoop2.4.0.jar

On the cluster node, start up the shell

# by following where `which java` actually is
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-
export HADOOP_CONF_DIR=/etc/hadoop/conf
export IPYTHON=1
bin/pyspark --master yarn-client --num-executors 6 --executor-memory 4g --executor-cores 12

If you just want to use the Scala spark-shell, you can build Spark on the cluster too.


Instead of SPARK_JAR, you can now pass --conf spark.yarn.jar hdfs:///path/to/jar to the Spark command

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.