Step by step tuts to setup apache spark ( pyspark ) on linux and setup environment for deep learning with Apache Spark using Deep-Learning-Pipelines.
Step 1 : Install Python 3 and Jupyter Notebook
Run following command. Someone may need to install pip first or any missing packages may need to download.
sudo apt install python3-pip
sudo pip3 install jupyter
We can start jupyter, just by running following command on the cmd : jupyter-notebook
. However, I already installed Anaconda, so for me It's unncessary to install jupyter like this.
Step 2 : Install Java
Run the following command. After installation, we can check it by running; java -version
.
sudo apt-get install default-jre
Step 3 : Install Scala
Run the following command. After installation, we can check by running; scala -version
sudo apt-get install scala
Step 4 : Install Py4j
Run the following command. It connects java and scala with python.
sudo pip3 install py4j
Step 5 : Install Apache Spark
Download Apache Spark. Currently I've downloaded spark-2.4.0-bin-hadoop2.7.tgz. After the download has finished, go to that downloaded directory and unzip it by the following command. However, I've saved the file on the home directory.
sudo tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz
Step 6 : Set Path
Now, we need to tell python from where it actually find spark. Type the following command sequentially. Carefully about the location of the file directory and related stuff.
export SPARK_HOME='home/ubuntu/spark-2.4.0-bin-hadoop2.7'
export PATH=$SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
sudo chmod 777 spark-2.4.0-bin-hadoop2.7
From now we normally face a problem, we can only import pyspark
from 'spark-2.4.0-bin-hadoop2.7/python' this directory. In order to import this globall, we need another module called findspark
. Run the following command.
pip3 install findspark
After installation is complete, import pyspark from globally like following.
import findspark
findspark.init('/home/i/spark-2.4.0-bin-hadoop2.7')
import pyspark
That's all. In order to use Deep Learning Pipelines provided by Databricks with Apache Spark, follow the below steps. It's may be optional for you.
Step 7 : Integrating Deep Learning Pipelines with Apache Spark
Deep Learning Pipelines aims at enabling everyone to easily integrate scalable deep learning into their workflows, from machine learning practitioners to business analysts. It builds on Apache Spark's ML Pipelines for training, and on Spark DataFrames and SQL for deploying models. It includes high-level APIs for common aspects of deep learning so they can be efficiently done in a few lines of code.
Go to this page and download deep learning library for spark. It's a zip file. So, unzip it. File name is some random generated, so I prefer to rename it as deep-learning:1.5.0-spark2.4-s_2.1
as because I've downloaed spark2.4.0 ..... Now, run the following command. Before pressing enter consider some following issue.
$SPARK_HOME/home/i/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --packages databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11
I shifted spark-2.4.0-bin-hadoop2.7
in the home directory. So, I set $SPARK_HOME/home/i/spark-2.4.0-bin-hadoop2.7/bin/spark-shel
and then I run the whole command above from where I saved deep-learning library for spark. Here, the rest of the part databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11
is the unzipped folder name.
After Completing the installation process, we'll end up with this. Congrats!
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 10.0.2)
While using Deep Learnin Piepelines another case may occur, jar file path missng. We can set though, I manually added paths to the downloaded jars into sys.path variable (which is equivalent to PYTHONPATH). To make everything working, run the following code:
import sys, glob, os
sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), ".ivy2/jars/*.jar")))
import findspark
findspark.init('/home/innat/spark-2.4.0-bin-hadoop2.7')
import pyspark # pyspark library
import sparkdl # deep learning library : tensorflow backend
from keras.applications import InceptionV3 # transfer learning using pyspark
However, Web UI (aka Application UI or webUI or Spark UI) is the web interface of a Spark application to monitor and inspect Spark job executions in a web browser.
That's it.
Thanks. It works like a charm. Kudos. 👍