trisberg/scdf-yarn-spark-task.adoc

## scdf-yarn-spark-task.adoc

      
    Raw
  

              scdf-yarn-spark-task.adoc
            
          
Running a Spark application on YARN using Spring Cloud Data Flow


Deploy Spring Cloud Data Flow on YARN


Follow instructions in Spring Cloud Data Flow Runtime - Deploying on YARN


Download the Spark distribution and copy assembly jar file to HDFS


For Spark 1.6.1 do the following


wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
tar xvzf spark-1.6.1-bin-hadoop2.6.tgz
cd spark-1.6.1-bin-hadoop2.6
hadoop fs -mkdir -p /app/spark
hadoop fs -chmod 777 /app/spark
hadoop fs -copyFromLocal lib/spark-assembly-1.6.1-hadoop2.6.0.jar /app/spark/spark-assembly-1.6.1-hadoop2.6.0.jar


Get a test app


We are using a sample app from Spark distribution that we have compiled as part of a test suite. We are copying this file to HDFS.


wget https://repo.spring.io/snapshot/org/springframework/cloud/task/module/sparkapp-client-task/1.0.0.BUILD-SNAPSHOT/sparkapp-client-task-1.0.0.BUILD-SNAPSHOT-tests.jar
hadoop fs -copyFromLocal sparkapp-client-task-1.0.0.BUILD-SNAPSHOT-tests.jar /app/spark/sparkapp-client-task-1.0.0.BUILD-SNAPSHOT-tests.jar


From the Spring Cloud Data Flow shell register the spark-yarn-task app and create a task using it


app register --type task --name spark-yarn --uri maven://org.springframework.cloud.task.app:spark-yarn-task:1.0.0.BUILD-SNAPSHOT
task create spark1 --definition "spark-yarn --appName=my-test-pi --appClass=org.apache.spark.examples.JavaSparkPi --appJar=hdfs:///app/spark/sparkapp-client-task-1.0.0.BUILD-SNAPSHOT-tests.jar --sparkAssemblyJar=hdfs:///app/spark/spark-assembly-1.6.1-hadoop2.6.0.jar --appArgs=10"


The following properties must be specified for the test app:


appName


appClass


appJar


appArgs


We set them to the values corresponding to the test app we downloaded above


We also need to specify where the Spark assembly jar file is using the property


sparkAssemblyJar


Note


If you run this on a local server then you need to provide the Hadoop connection properties, here is an example for a Hadoop server named "borneo": --spring.hadoop.fsUri=hdfs://borneo:8020 --spring.hadoop.resourceManagerHost=borneo --spring.hadoop.resourceManagerPort=8032 --spring.hadoop.jobHistoryAddress=borneo:10020


Launch the task


task launch spark1


Check results once task completes


task execution list
╔═════════╤══╤════════════════════════════╤════════════════════════════╤═════════╗
║Task Name│ID│         Start Time         │          End Time          │Exit Code║
╠═════════╪══╪════════════════════════════╪════════════════════════════╪═════════╣
║spark1   │1 │Tue Jun 07 19:32:27 EDT 2016│Tue Jun 07 19:32:52 EDT 2016│0        ║
╚═════════╧══╧════════════════════════════╧════════════════════════════╧═════════╝


We can now list the log for the container that ran the Spark app using comands on the Hadoop server.


$ more logs/userlogs/application_1465327822589_0008/container_1465327822589_0008_01_000001/stdout
Pi is roughly 3.146104


If log aggregation is enabled use the yarn logs command to accomplish the same.


Note


If you are using a small single-node Hadoop cluster for testing you might have to increase the scheduler settings for max memory per app. We modified the capacity-scheduler.xml file and set the yarn.scheduler.capacity.maximum-am-resource-percent to 0.5. We also modified yarn-site.xml and set yarn.nodemanager.resource.memory-mb to 8192.