Skip to content

Instantly share code, notes, and snippets.

@trisberg
Last active March 16, 2018 15:55
Show Gist options
  • Save trisberg/79d50dc46c11c0fb7020c8ccc72d1234 to your computer and use it in GitHub Desktop.
Save trisberg/79d50dc46c11c0fb7020c8ccc72d1234 to your computer and use it in GitHub Desktop.
Running a Spark application on YARN using Spring Cloud Data Flow

Running a Spark application on YARN using Spring Cloud Data Flow

Deploy Spring Cloud Data Flow on YARN

Download the Spark distribution and copy assembly jar file to HDFS

For Spark 1.6.1 do the following

wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
tar xvzf spark-1.6.1-bin-hadoop2.6.tgz
cd spark-1.6.1-bin-hadoop2.6
hadoop fs -mkdir -p /app/spark
hadoop fs -chmod 777 /app/spark
hadoop fs -copyFromLocal lib/spark-assembly-1.6.1-hadoop2.6.0.jar /app/spark/spark-assembly-1.6.1-hadoop2.6.0.jar

Get a test app

We are using a sample app from Spark distribution that we have compiled as part of a test suite. We are copying this file to HDFS.

wget https://repo.spring.io/snapshot/org/springframework/cloud/task/module/sparkapp-client-task/1.0.0.BUILD-SNAPSHOT/sparkapp-client-task-1.0.0.BUILD-SNAPSHOT-tests.jar
hadoop fs -copyFromLocal sparkapp-client-task-1.0.0.BUILD-SNAPSHOT-tests.jar /app/spark/sparkapp-client-task-1.0.0.BUILD-SNAPSHOT-tests.jar

From the Spring Cloud Data Flow shell register the spark-yarn-task app and create a task using it

app register --type task --name spark-yarn --uri maven://org.springframework.cloud.task.app:spark-yarn-task:1.0.0.BUILD-SNAPSHOT
task create spark1 --definition "spark-yarn --appName=my-test-pi --appClass=org.apache.spark.examples.JavaSparkPi --appJar=hdfs:///app/spark/sparkapp-client-task-1.0.0.BUILD-SNAPSHOT-tests.jar --sparkAssemblyJar=hdfs:///app/spark/spark-assembly-1.6.1-hadoop2.6.0.jar --appArgs=10"

The following properties must be specified for the test app:

  • appName

  • appClass

  • appJar

  • appArgs

We set them to the values corresponding to the test app we downloaded above

We also need to specify where the Spark assembly jar file is using the property

  • sparkAssemblyJar

Note
If you run this on a local server then you need to provide the Hadoop connection properties, here is an example for a Hadoop server named "borneo": --spring.hadoop.fsUri=hdfs://borneo:8020 --spring.hadoop.resourceManagerHost=borneo --spring.hadoop.resourceManagerPort=8032 --spring.hadoop.jobHistoryAddress=borneo:10020

Launch the task

task launch spark1

Check results once task completes

task execution list
╔═════════╤══╤════════════════════════════╤════════════════════════════╤═════════╗
║Task Name│ID│         Start Time         │          End Time          │Exit Code║
╠═════════╪══╪════════════════════════════╪════════════════════════════╪═════════╣
║spark1   │1 │Tue Jun 07 19:32:27 EDT 2016│Tue Jun 07 19:32:52 EDT 2016│0        ║
╚═════════╧══╧════════════════════════════╧════════════════════════════╧═════════╝

We can now list the log for the container that ran the Spark app using comands on the Hadoop server.

$ more logs/userlogs/application_1465327822589_0008/container_1465327822589_0008_01_000001/stdout
Pi is roughly 3.146104

If log aggregation is enabled use the yarn logs command to accomplish the same.

Note
If you are using a small single-node Hadoop cluster for testing you might have to increase the scheduler settings for max memory per app. We modified the capacity-scheduler.xml file and set the yarn.scheduler.capacity.maximum-am-resource-percent to 0.5. We also modified yarn-site.xml and set yarn.nodemanager.resource.memory-mb to 8192.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment