Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?

Step by step tuts to setup apache spark ( pyspark ) on linux and setup environment for deep learning with Apache Spark using Deep-Learning-Pipelines.

Step 1 : Install Python 3 and Jupyter Notebook

Run following command. Someone may need to install pip first or any missing packages may need to download.

sudo apt install python3-pip sudo pip3 install jupyter

We can start jupyter, just by running following command on the cmd : jupyter-notebook. However, I already installed Anaconda, so for me It's unncessary to install jupyter like this.

Step 2 : Install Java

Run the following command. After installation, we can check it by running; java -version.

sudo apt-get install default-jre

Step 3 : Install Scala

Run the following command. After installation, we can check by running; scala -version

sudo apt-get install scala

Step 4 : Install Py4j

Run the following command. It connects java and scala with python.

sudo pip3 install py4j

Step 5 : Install Apache Spark

Download Apache Spark. Currently I've downloaded spark-2.4.0-bin-hadoop2.7.tgz. After the download has finished, go to that downloaded directory and unzip it by the following command. However, I've saved the file on the home directory.

sudo tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz

Step 6 : Set Path

Now, we need to tell python from where it actually find spark. Type the following command sequentially. Carefully about the location of the file directory and related stuff.

export SPARK_HOME='home/ubuntu/spark-2.4.0-bin-hadoop2.7'
export PATH=$SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
sudo chmod 777 spark-2.4.0-bin-hadoop2.7

From now we normally face a problem, we can only import pyspark from 'spark-2.4.0-bin-hadoop2.7/python' this directory. In order to import this globall, we need another module called findspark. Run the following command.

pip3 install findspark

After installation is complete, import pyspark from globally like following.

import findspark
findspark.init('/home/i/spark-2.4.0-bin-hadoop2.7')
import pyspark

That's all. In order to use Deep Learning Pipelines provided by Databricks with Apache Spark, follow the below steps. It's may be optional for you.

Step 7 : Integrating Deep Learning Pipelines with Apache Spark

Deep Learning Pipelines aims at enabling everyone to easily integrate scalable deep learning into their workflows, from machine learning practitioners to business analysts. It builds on Apache Spark's ML Pipelines for training, and on Spark DataFrames and SQL for deploying models. It includes high-level APIs for common aspects of deep learning so they can be efficiently done in a few lines of code.

Go to this page and download deep learning library for spark. It's a zip file. So, unzip it. File name is some random generated, so I prefer to rename it as deep-learning:1.5.0-spark2.4-s_2.1 as because I've downloaed spark2.4.0 ..... Now, run the following command. Before pressing enter consider some following issue.

$SPARK_HOME/home/i/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --packages databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11

I shifted spark-2.4.0-bin-hadoop2.7 in the home directory. So, I set $SPARK_HOME/home/i/spark-2.4.0-bin-hadoop2.7/bin/spark-shel and then I run the whole command above from where I saved deep-learning library for spark. Here, the rest of the part databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11 is the unzipped folder name.

After Completing the installation process, we'll end up with this. Congrats!

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 10.0.2)

While using Deep Learnin Piepelines another case may occur, jar file path missng. We can set though, I manually added paths to the downloaded jars into sys.path variable (which is equivalent to PYTHONPATH). To make everything working, run the following code:

import sys, glob, os
sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), ".ivy2/jars/*.jar")))

import findspark
findspark.init('/home/innat/spark-2.4.0-bin-hadoop2.7')
import pyspark # pyspark library
import sparkdl # deep learning library : tensorflow backend 
from keras.applications import InceptionV3 # transfer learning using pyspark

However, Web UI (aka Application UI or webUI or Spark UI) is the web interface of a Spark application to monitor and inspect Spark job executions in a web browser.

That's it.

@Suzan009
Copy link

Suzan009 commented Mar 21, 2019

Thanks. It works like a charm. Kudos. 👍

@innat
Copy link
Author

innat commented Mar 21, 2019

Glad to know. :)

@jai-dewani
Copy link

jai-dewani commented Apr 22, 2020

I am trying the same but in google colab as I have to train on 4 GB of data which my potato laptop can't handle.
When I run the step-7 I am left with a scala interpreter and I am unable to use import sparkdl.

I also tried installing sparkdl with pip, but the package on pip also has some problems so no success there too.
Can anyone help or suggest some workarounds to use CNN for image classification in pyspark

@innat
Copy link
Author

innat commented Apr 23, 2020

The implementation was mainly done in a Linux based system. Back then, I also tried to run sparkdl in Colab as well as GCP, but got an installation error of sparkdl. Though I didn't want to spend time to fix the issue. It would be great if you create an issue to get some help from the package developer.
FYI I tried on Multi-class image classification by using it.

@FahdHabib
Copy link

FahdHabib commented Sep 2, 2020

Hi @innat. Can you kindly share the exact packages and their versions (tensorflow, keras etc.) used for this example? Thanks.

@FahdHabib
Copy link

FahdHabib commented Sep 2, 2020

@Suzan009, if you can also share the exact packages you have used for running this example, I would be grateful. Thanks

@innat
Copy link
Author

innat commented Sep 2, 2020

@FahHabib
The only important package version you need to full fill is the Spark and Hadoop. The rest is good to go.

@innat
Copy link
Author

innat commented Sep 2, 2020

And yes, please use TF < 2.x

@FahdHabib
Copy link

FahdHabib commented Sep 2, 2020

Yes, i figured it out. But thanks anyways @innat. :)

@binakotiyal
Copy link

binakotiyal commented Jan 14, 2021

I am trying to do it in windows 10. How can I integrated the deep learning pipeline in windows. Please give an equivalent code for this "$SPARK_HOME/home/i/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --packages databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11" or I can write it in jupyter notebook.

@innat
Copy link
Author

innat commented Mar 1, 2021

@binakotiyal
I didn't do it myself. This instruction is only for Linux-based OS. For other OS, you may need to search for it. I think that should not be too different from this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment