Skip to content

Instantly share code, notes, and snippets.

@visualskyrim
Created March 7, 2020 03:06
Show Gist options
  • Save visualskyrim/332b4420ce32f701d4f668fc32927492 to your computer and use it in GitHub Desktop.
Save visualskyrim/332b4420ce32f701d4f668fc32927492 to your computer and use it in GitHub Desktop.
A template to run spark job with possible options
#########################################################
# The purpose of this script is <----------->
#
# Arguments:
# VAR_1
# VAR_2
#########################################################
if [ $# != 2 ]
then
echo "Usage: ./run-report.sh <var1> <var2>"
echo "Example: ./run-report.sh val1 val2"
exit 1
fi
VAR_1=$1
VAR_2=$2
#########################################################
# Script execution
#########################################################
PROJECT_HOME=`pwd`
JAVA_LIBRARY_PATH=/usr/hdp/current/hadoop/lib/native:/usr/
cd $PROJECT_HOME
SPARK_PATH=<----------->
HIVE_SITE=<----------->/conf/hive-site.xml
PROJECT_JAR=$PROJECT_HOME/target/scala-2.11/<----------->-assembly-1.0.0.jar
export SPARK_HOME=$SPARK_PATH
$SPARK_PATH/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue <-----------> \
--driver-memory 14g \
--executor-memory 10g \
--executor-cores 8 \
--num-executors 80 \
--name "<----------->" \
--conf spark.app.name="<----------->" \
--conf spark.eventLog.dir=hdfs://<-----------> \
--conf spark.eventLog.enabled=true \
--conf spark.yarn.executor.memoryOverhead=4096 \
--conf spark.yarn.driver.memoryOverhead=8192 \
--conf spark.driver.extraJavaOptions="-Djava.library.path=$JAVA_LIBRARY_PATH -XX:OnOutOfMemoryError=\"kill -9 %p\" -XX:+UseG1GC" \
--conf spark.driver.maxResultSize=3g \
--conf spark.executor.extraJavaOptions="-Djava.library.path=$JAVA_LIBRARY_PATH -XX:+UseG1GC -XX:OnOutOfMemoryError=\"kill -9 %p\" -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
--conf mapred.output.compress=false \
--conf spark.yarn.max.executor.failures=128 \
--conf spark.memory.fraction=0.2 \
--conf spark.memory.storageFraction=0.2 \
--conf spark.rdd.compress=true \
--conf spark.shuffle.compress=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.shuffle.spill.compress=true \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.speculation=true \
--conf spark.speculation.interval=5000 \
--conf spark.speculation.multiplier=20.0 \
--conf spark.speculation.quantile=0.95 \
--conf spark.task.maxFailures=1000 \
--conf spark.sql.codegen.wholeStage=true \
--conf spark.sql.files.maxPartitionBytes=1000000000 \
--conf spark.sql.hive.filesourcePartitionFileCacheSize=524288000 \
--conf spark.scheduler.listenerbus.eventqueue.size=120000 \
--conf spark.shuffle.service.enabled=false \
--conf "spark.hadoop.yarn.timeline-service.enabled"=false \
--conf spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER \
--files $HIVE_SITE \
--driver-class-path $HIVE_SITE,/home/<----------->/hadoop-<----------->/share/hadoop/common/lib/hadoop-lzo-<----------->.jar \
--jars $PROJECT_JAR,/home/<----------->/hadoop-<----------->/share/hadoop/common/lib/hadoop-lzo-<----------->.jar,/home/<----------->/hadoop-lzo-<----------->-SNAPSHOT.jar \
--class com.<-----main class-----> $PROJECT_JAR \
--var1="$VAR_1" \
--var2="$VAR_2"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment