OXPHOS/setup_spark_manually.md

## setup_spark_manually.md

      
    Raw
  

              setup_spark_manually.md
            
          
    Setup Spark manually - gap filling of Insight Spark Intro

1. log on to control machine, usually ssh dd-control

2. spin up clusters.

~/aws-ops-insight/terraform$ terraform apply

3. Find the public IP addresses for masters and workers in terraform/terraform.tfstates and grep public_ip,

OR from browser: EC2 dashboard - instances - IPv4 public IP
4. Connect to working nodes

(Make sure to chmod 400 the .pem key file)
ssh -i .ssh/username-IAM-keypair.pem ubuntu@public-ip-address
5. Follow the instructions in Setup Spark standalone session

(Note: if the node cannot connect to the internet, the possible reason is no outbound rule is set. To solve the problem:
go to the ECdashboard - security groups - locate the sg in use - go to the outbound tab at the bottom - edit - add all trafic / anywhere. Creited to Steven)
(Another solution: paste the following code to terraform/main.tf after ingress_with_cidr_blocks in module "open_all_sg")
  egress_cidr_blocks      = ["10.0.0.0/26"]
  egress_with_cidr_blocks = [
    {
      rule        = "all-all"
      cidr_blocks = "0.0.0.0/0"
    }
  ]

5.1. On all master and worker machines:

sudo apt-get update
sudo apt-get install openjdk-7-jdk scala

wget https://dl.bintray.com/sbt/debian/sbt-0.13.7.deb -P ~/Downloads
sudo dpkg -i ~/Downloads/sbt-0.13.7.deb
sudo apt-get install sbt

wget http://mirrors.advancedhosters.com/apache/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz -P ~/Downloads
sudo tar zxvf ~/Downloads/spark-2.3.0-bin-hadoop2.7.tgz -C /usr/local
sudo mv /usr/local/spark-2.3.0-bin-hadoop2.7 /usr/local/spark
sudo chown -R ubuntu /usr/local/spark

sudo nano ~/.profile

Paste to .profile:
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

. ~/.profile
cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
sudo nano $SPARK_HOME/conf/spark-env.sh

Paste to spark-env.sh:
export JAVA_HOME=/usr
export SPARK_PUBLIC_DNS="**<public-dns-need-replacement>**"
export SPARK_WORKER_CORES=$(echo $(nproc)*3 | bc)

5.2. Config master machine:

master information can be found at:


EC2 dashboard - instaces - tag


terraform/terraform.tfstate, grep aws_instance.cluster_master and use the id to trace back to public ip.


ssh -i ~/.ssh/personal_aws.pem ubuntu@master-public-dns
touch $SPARK_HOME/conf/slaves
echo **<slave-public-dns-need-replacement>** | cat >> $SPARK_HOME/conf/slaves

6. Start Spark session on master

Copy username-iam-keypair to master machine, then
cp username-iam-keypair ~/.ssh/id_rsa
chmod 400 ~/.ssh/id_rsa

(What is id_rsa)
Finally, start Spark:
master-node$ $SPARK_HOME/sbin/start-all.sh
7. Monitor Spark session with web UI

In the web broswer of local machine, navigate to http://master-node-public-ip:8080/