Skip to content

Instantly share code, notes, and snippets.

@samklr
Last active August 29, 2015 14:16
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save samklr/75486c2d9e31c5998443 to your computer and use it in GitHub Desktop.
Save samklr/75486c2d9e31c5998443 to your computer and use it in GitHub Desktop.
Spark Standalone Cluster
--------Spark Standalone---------
Prerequistes :
* Set JAVA_HOME env variable
* Configure ssh so master and workers can talk without pw
- $ ssh-keygen -- enter
- Copy the SSH Public Key (id_rsa.pub) to the root account on your target hosts.
.ssh/id_rsa .ssh/id_rsa.pub
- Add the SSH Public Key to the authorized_keys file on your target hosts.
$ cat id_rsa.pub >> authorized_keys
* disable iptables
- run $ /etc/init.d/iptables stop
- run $ chkconfig iptables off
* disable selinux
- run $ setenforce 0
- run $ vim /etc/sysconfig/selinux
change SELINUX = DISABLED
- sestatus - check for selinux status
* add hostname to /etc/hosts
- run $ vim /etc/hosts
add host ip hostname
Installation :
* Download the spark in /opt
- $ cd /opt
- $ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz
or http://apache.mirrors.ovh.net/ftp.apache.org/dist/spark/spark-1.3.0/spark-1.3.0-bin-hadoop2.4.tgz
* Untar the folder
- $ tar -xzvf spark-1.2.1-bin-hadoop2.4.tgz
* Create a symbolic link
- $ ln -s spark-1.2.1-bin-hadoop2.4.tgz spark
* Edit the conf/slaves file on your master and fill in the workers’ hostnames.
* run sbin/start-all.sh on your master (it is important to run it there rather than on a worker). If everything started, you should get no prompts for a password, and the cluster manager’s web UI should appear at http://masternode:8080 and should show all your workers.
* run sbin/stop-all.sh to stop the cluster
OR
* deploy master
bin/spark-class org.apache.spark.deploy.master.Master
* deploy workers
bin/spark-class org.apache.spark.deploy.worker.Worker spark://[masternode[FQDN]]:7077
If all is good to go
Run sample jobs
- run $ ./bin/spark-shell --master spark://[masternode[FQDN]]:7077
You can also pass an option --total-executor-cores <numCores> to control the number of cores that spark-shell uses on the cluster.
Submitting Application
* Run on a Spark Standalone cluster in client deploy mode
* Running Pi program
- run $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://[masternode[FQDN]]:7077 --executor-memory 2G --total-executor-cores 10 /opt/spark/lib/spark-examples-1.2.1-hadoop2.4.0.jar 1000
* Run on a Spark Standalone cluster in cluster deploy mode
* Running Pi Program
- run $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://[masternode[FQDN]]:7077 --deploy-mode cluster --executor-memory 2G --total-executor-cores 10 /opt/spark/lib/spark-examples-1.2.1-hadoop2.4.0.jar 1000
Check the web ui history server
https://[spark-master - IP]:8080 --- check for Running applications and Completed Application
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment