Skip to content

Instantly share code, notes, and snippets.

@lieuzhenghong
Last active September 16, 2019 14:16
Show Gist options
  • Save lieuzhenghong/c062aa2c5544d6b1a0fa5139e10441ad to your computer and use it in GitHub Desktop.
Save lieuzhenghong/c062aa2c5544d6b1a0fa5139e10441ad to your computer and use it in GitHub Desktop.
Setting up Apache Spark on the Raspberry Pi Cluster

Preliminaries

Have the spark folder in a directory of your choice.

For my master it was home/lieu/dev/spark and for my slaves it was /home/pirate/spark.

Do the following export on master because the slaves and the master have their spark folder in different directories (we'll make use of this later)

export $SLAVE_SPARK_HOME=/home/pirate/spark/spark-2.4.4-bin-without-hadoop

On the slaves, one has to install openjdk-8-jre and unzip

Master node

Folder structure should look like this:

ll /home/lieu/dev/spark

drwxr-xr-x  5 lieu lieu 4096 Sep  5 14:38 ./
drwxr-xr-x 15 lieu lieu 4096 Sep  5 14:38 ../
drwxr-xr-x  2 lieu lieu 4096 Sep  5 12:57 bin/
-rw-r--r--  1 lieu lieu 6148 Sep  5 13:00 .DS_Store
-rwxr-xr-x  1 lieu lieu  300 Sep  5 13:52 env.sh*
drwxr-xr-x  9 lieu lieu 4096 Jan 29  2019 hadoop-3.1.2/
drwxr-xr-x 15 lieu lieu 4096 Sep  5 14:36 spark-2.4.4-bin-without-hadoop/

Check the Hadoop core-site.xml looks like this

hadoop-3.1.2/etc/hadoop/core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.s3a.endpoint</name>
    <description>AWS S3 endpoint to connect to. An up-to-date list is
      provided in the AWS Documentation: regions and endpoints. Without this
      property, the standard region (s3.amazonaws.com) is assumed.
    </description>
    <value>http://192.168.72.156:9000</value>
  </property>

  <property>
    <name>fs.s3a.access.key</name>
    <description>AWS access key ID.</description>
    <value>N0262R8RT8...</value>
  </property>

  <property>
    <name>fs.s3a.secret.key</name>
    <description>AWS secret key.</description>
    <value>tOdQZa6tMCSGPE/1aVK8Sn6...</value>
  </property>

  <property>
    <name>fs.s3a.path.style.access</name>
    <value>true</value>
    <description>Enable S3 path style access ie disabling the default virtual hosting behaviour.
      Useful for S3A-compliant storage providers as it removes the need to set up DNS for virtual hosting.
    </description>
  </property>

  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    <description>The implementation class of the S3A Filesystem</description>
  </property>
</configuration>

spark-2.4.4-bin-without-hadoop/run.sh

#!/bin/bash

export DD_HOME=/home/lieu/dev/spark
export SPARK_HOME=$DD_HOME/spark-2.4.4-bin-without-hadoop
export PATH=$PATH:$SPARK_HOME/bin
export HADOOP_HOME=$DD_HOME/hadoop-3.1.2
export PATH=$PATH:$HADOOP_HOME/bin
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export TERM=xterm-color

./bin/spark-shell --master spark://192.168.10.66:7077 --jars ../bin/slf4j-api-1.7.25.jar,../bin/slf4j-log4j12-1.7.25.jar,../bin/aws-java-sdk-1.11.624.jar,../bin/aws-java-sdk-core-1.11.624.jar,../bin/aws-java-sdk-dynamodb-1.11.624.jar,../bin/aws-java-sdk-kms-1.11.624.jar,../bin/aws-java-sdk-s3-1.11.624.jar,../bin/hadoop-aws-3.1.2.jar,../bin/httpclient-4.5.9.jar,../bin/joda-time-2.10.3.jar
#./bin/spark-shell --master local[4] --jars $(echo ../bin/*.jar | tr ' ' ',')

conf/slaves

pirate@192.168.10.83

conf/spark-env.sh

SPARK_MASTER_HOST=192.168.10.66
SPARK_LOCAL_IP=192.168.10.66

Additionally, we need to edit the following scripts in /sbin/, because the shell script assumes that the slaves' $SPARK_HOME is the same as the master's.

sbin/start-slaves.sh

Replace the last line with the following:

# Launch the slaves
"${SPARK_HOME}/sbin/slaves.sh" cd "${SLAVE_SPARK_HOME}" \; "${SLAVE_SPARK_HOME}/sbin/start-slave.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"

sbin/stop-slaves.sh

Replace the last line with the following:

"${SPARK_HOME}/sbin/slaves.sh" cd "${SLAVE_SPARK_HOME}" \; "${SLAVE_SPARK_HOME}/sbin"/stop-slave.sh

Slave nodes

conf/spark-env.sh

export DD_HOME=/home/pirate/spark
export HADOOP_HOME=$DD_HOME/hadoop-3.1.2
export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)
SPARK_MASTER_HOST=192.168.10.66
#!/bin/bash
# Download Hadoop 3.1.2
wget http://mirror.ox.ac.uk/sites/rsync.apache.org/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz
# Download Apache Spark 2.4.4 without Hadoop
wget http://mirror.ox.ac.uk/sites/rsync.apache.org/spark/spark-2.4.4/spark-2.4.4-bin-without-hadoop.tgz
# Unzip both files into a directory (call it spark)
mkdir spark
tar -C spark -zxf hadoop-3.1.2.tar.gz
tar -C spark -zxf spark-2.4.4-bin-without-hadoop.tgz
# Download the required .jar files
cd spark
mkdir bin
cd bin
# Files required for logging (Ask Chris about this)
wget https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.25/slf4j-api-1.7.25.jar
wget https://repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar
# AWS SDK for Java aws-java-sdk-1.11.624.jar
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.624/aws-java-sdk-1.11.624.jar
# AWS SDK for Java Core 1.11.624
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.11.624/aws-java-sdk-core-1.11.624.jar
# AWS Java SDK for Amazon DynamoDB 1.11.624
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-dynamodb/1.11.624/aws-java-sdk-dynamodb-1.11.624.jar
# AWS Java SDK for AWS KMS 1.11.624
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.624/aws-java-sdk-1.11.624.jar
# AWS Java SDK for Amazon S3 1.11.624
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.11.624/aws-java-sdk-s3-1.11.624.jar
# Apache Hadoop Amazon Web Services Support 3.1.2
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.2/hadoop-aws-3.1.2.jar
# Apache HttpClient 4.5.9
wget https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.9/httpclient-4.5.9.jar
# Joda Time 2.10.3
wget https://repo1.maven.org/maven2/joda-time/joda-time/2.10.3/joda-time-2.10.3.jar
cd ../../
# Install Java 8
sudo apt install -y openjdk-8-jre
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf
# Set up the certificates
sudo keytool -import -trustcacerts -keystore $JAVA_HOME/jre/lib/security/cacerts -storepass changeit -noprompt -alias mycert -file /home/pirate/spark/public.crt
# Now set the environment variables
cd spark
pwd | read DD_HOME
export SPARK_HOME=$DD_HOME/spark-2.4.4-bin-without-hadoop
# Edit conf/spark-env.sh
cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
chmod +x $SPARK_HOME/conf/spark-env.sh
echo "export DD_HOME=$DD_HOME" >> $SPARK_HOME/conf/spark-env.sh
echo "export SPARK_HOME=$DD_HOME/spark-2.4.4-bin-without-hadoop" >> $SPARK_HOME/conf/spark-env.sh
echo "export PATH=$PATH:$SPARK_HOME/bin" >> $SPARK_HOME/conf/spark-env.sh
echo "export HADOOP_HOME=$DD_HOME/hadoop-3.1.2" >> $SPARK_HOME/conf/spark-env.sh
echo "export PATH=$PATH:$HADOOP_HOME/bin" >> $SPARK_HOME/conf/spark-env.sh
echo "export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native" >> $SPARK_HOME/conf/spark-env.sh
echo "export SPARK_DIST_CLASSPATH=$(hadoop classpath)" >> $SPARK_HOME/conf/spark-env.sh
echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf" >> $SPARK_HOME/conf/spark-env.sh
echo "SPARK_MASTER_HOST=192.168.10.66" >> $SPARK_HOME/conf/spark-env.sh
# Edit $HADOOP_HOME/etc/hadoop/core-site.xml #TODO don't know how to make this a bash command
# I think I'll host it on GitHub?? cannot it's private leh the keys
export HADOOP_HOME=$DD_HOME/hadoop-3.1.2
# TODO
# Install prerequisites
sudo apt install zip
sudo apt install unzip
sudo apt install openjdk-8-jre
# Unzip spark.zip
unzip spark.zip
# Add public certificate
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf
sudo keytool -import -trustcacerts -keystore $JAVA_HOME/jre/lib/security/cacerts -storepass changeit -noprompt -alias mycert -file /home/pirate/spark/public.crt
# TODO edit /etc/hosts/ to make the hostname of 192.168.10.67 minio.inzura.local
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment