Step by step tuts to setup apache spark ( pyspark ) on linux and setup environment for deep learning with Apache Spark using Deep-Learning-Pipelines.
Run following command. Someone may need to install pip first or any missing packages may need to download.
sudo apt install python3-pip
sudo pip3 install jupyter
We can start jupyter, just by running following command on the cmd :
jupyter-notebook. However, I already installed Anaconda, so for me It's unncessary to install jupyter like this.
Run the following command. After installation, we can check it by running;
sudo apt-get install default-jre
Run the following command. After installation, we can check by running;
sudo apt-get install scala
Run the following command. It connects java and scala with python.
sudo pip3 install py4j
Download Apache Spark. Currently I've downloaded spark-2.4.0-bin-hadoop2.7.tgz. After the download has finished, go to that downloaded directory and unzip it by the following command. However, I've saved the file on the home directory.
sudo tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz
Now, we need to tell python from where it actually find spark. Type the following command sequentially. Carefully about the location of the file directory and related stuff.
export SPARK_HOME='home/ubuntu/spark-2.4.0-bin-hadoop2.7' export PATH=$SPARK_HOME:$PATH export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook" export PYSPARK_PYTHON=python3 sudo chmod 777 spark-2.4.0-bin-hadoop2.7
From now we normally face a problem, we can only import
pyspark from 'spark-2.4.0-bin-hadoop2.7/python' this directory. In order to import this globall, we need another module called
findspark. Run the following command.
pip3 install findspark
After installation is complete, import pyspark from globally like following.
import findspark findspark.init('/home/i/spark-2.4.0-bin-hadoop2.7') import pyspark
That's all. In order to use Deep Learning Pipelines provided by Databricks with Apache Spark, follow the below steps. It's may be optional for you.
Deep Learning Pipelines aims at enabling everyone to easily integrate scalable deep learning into their workflows, from machine learning practitioners to business analysts. It builds on Apache Spark's ML Pipelines for training, and on Spark DataFrames and SQL for deploying models. It includes high-level APIs for common aspects of deep learning so they can be efficiently done in a few lines of code.
Go to this page and download deep learning library for spark. It's a zip file. So, unzip it. File name is some random generated, so I prefer to rename it as
deep-learning:1.5.0-spark2.4-s_2.1 as because I've downloaed spark2.4.0 ..... Now, run the following command. Before pressing enter consider some following issue.
$SPARK_HOME/home/i/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --packages databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11
spark-2.4.0-bin-hadoop2.7 in the home directory. So, I set
$SPARK_HOME/home/i/spark-2.4.0-bin-hadoop2.7/bin/spark-shel and then I run the whole command above from where I saved deep-learning library for spark. Here, the rest of the part
databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11 is the unzipped folder name.
After Completing the installation process, we'll end up with this. Congrats!
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 10.0.2)
While using Deep Learnin Piepelines another case may occur, jar file path missng. We can set though, I manually added paths to the downloaded jars into sys.path variable (which is equivalent to PYTHONPATH). To make everything working, run the following code:
import sys, glob, os sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), ".ivy2/jars/*.jar"))) import findspark findspark.init('/home/innat/spark-2.4.0-bin-hadoop2.7') import pyspark # pyspark library import sparkdl # deep learning library : tensorflow backend from keras.applications import InceptionV3 # transfer learning using pyspark
However, Web UI (aka Application UI or webUI or Spark UI) is the web interface of a Spark application to monitor and inspect Spark job executions in a web browser.