Skip to content

Instantly share code, notes, and snippets.

@OXPHOS
Last active April 26, 2018 19:33
Show Gist options
  • Save OXPHOS/2fabcb5f7ff89d927975338ea5144593 to your computer and use it in GitHub Desktop.
Save OXPHOS/2fabcb5f7ff89d927975338ea5144593 to your computer and use it in GitHub Desktop.
log in to nodes and install spark

Setup Spark manually - gap filling of Insight Spark Intro

1. log on to control machine, usually ssh dd-control

~/aws-ops-insight/terraform$ terraform apply

3. Find the public IP addresses for masters and workers in terraform/terraform.tfstates and grep public_ip,

OR from browser: EC2 dashboard - instances - IPv4 public IP

4. Connect to working nodes

(Make sure to chmod 400 the .pem key file)

ssh -i .ssh/username-IAM-keypair.pem ubuntu@public-ip-address

5. Follow the instructions in Setup Spark standalone session

(Note: if the node cannot connect to the internet, the possible reason is no outbound rule is set. To solve the problem: go to the ECdashboard - security groups - locate the sg in use - go to the outbound tab at the bottom - edit - add all trafic / anywhere. Creited to Steven)

(Another solution: paste the following code to terraform/main.tf after ingress_with_cidr_blocks in module "open_all_sg")

  egress_cidr_blocks      = ["10.0.0.0/26"]
  egress_with_cidr_blocks = [
    {
      rule        = "all-all"
      cidr_blocks = "0.0.0.0/0"
    }
  ]
5.1. On all master and worker machines:
sudo apt-get update
sudo apt-get install openjdk-7-jdk scala

wget https://dl.bintray.com/sbt/debian/sbt-0.13.7.deb -P ~/Downloads
sudo dpkg -i ~/Downloads/sbt-0.13.7.deb
sudo apt-get install sbt

wget http://mirrors.advancedhosters.com/apache/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz -P ~/Downloads
sudo tar zxvf ~/Downloads/spark-2.3.0-bin-hadoop2.7.tgz -C /usr/local
sudo mv /usr/local/spark-2.3.0-bin-hadoop2.7 /usr/local/spark
sudo chown -R ubuntu /usr/local/spark

sudo nano ~/.profile

Paste to .profile:

export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

. ~/.profile

cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
sudo nano $SPARK_HOME/conf/spark-env.sh

Paste to spark-env.sh:

export JAVA_HOME=/usr
export SPARK_PUBLIC_DNS="**<public-dns-need-replacement>**"
export SPARK_WORKER_CORES=$(echo $(nproc)*3 | bc)
5.2. Config master machine:

master information can be found at:

  • EC2 dashboard - instaces - tag

  • terraform/terraform.tfstate, grep aws_instance.cluster_master and use the id to trace back to public ip.

ssh -i ~/.ssh/personal_aws.pem ubuntu@master-public-dns

touch $SPARK_HOME/conf/slaves
echo **<slave-public-dns-need-replacement>** | cat >> $SPARK_HOME/conf/slaves

6. Start Spark session on master

Copy username-iam-keypair to master machine, then

cp username-iam-keypair ~/.ssh/id_rsa
chmod 400 ~/.ssh/id_rsa

(What is id_rsa)

Finally, start Spark:

master-node$ $SPARK_HOME/sbin/start-all.sh

7. Monitor Spark session with web UI

In the web broswer of local machine, navigate to http://master-node-public-ip:8080/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment