Skip to content

Instantly share code, notes, and snippets.

@laserson
Last active August 29, 2015 14:02
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save laserson/1d1185b412b41057810b to your computer and use it in GitHub Desktop.
Save laserson/1d1185b412b41057810b to your computer and use it in GitHub Desktop.
Running custom Spark build on a YARN cluster (for PySpark)

Building Spark for PySpark use on top of YARN

Build Spark on local machine (only if using PySpark; otherwise, remote machine works) (http://spark.apache.org/docs/latest/building-with-maven.html)

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package

Copy the assembly/target/scala-2.10/...jar to the corresponding directory on the cluster node and also into a location in HDFS.

Set the Spark JAR HDFS location

export SPARK_JAR=hdfs:///user/laserson/tmp/spark-assembly-1.2.0-SNAPSHOT-hadoop2.4.0.jar

On the cluster node, start up the shell

# by following where `which java` actually is
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51.x86_64/jre
export HADOOP_CONF_DIR=/etc/hadoop/conf
export IPYTHON=1
bin/pyspark --master yarn-client --num-executors 6 --executor-memory 4g --executor-cores 12

If you just want to use the Scala spark-shell, you can build Spark on the cluster too.

@laserson
Copy link
Author

Instead of SPARK_JAR, you can now pass --conf spark.yarn.jar hdfs:///path/to/jar to the Spark command

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment