Skip to content

Instantly share code, notes, and snippets.

@innat
Last active November 1, 2021 15:29
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save innat/b0ab252c954eb2a28a984774e3ee1f2d to your computer and use it in GitHub Desktop.
Save innat/b0ab252c954eb2a28a984774e3ee1f2d to your computer and use it in GitHub Desktop.

Step by step tuts to setup apache spark ( pyspark ) on linux and setup environment for deep learning with Apache Spark using Deep-Learning-Pipelines.

Step 1 : Install Python 3 and Jupyter Notebook

Run following command. Someone may need to install pip first or any missing packages may need to download.

sudo apt install python3-pip sudo pip3 install jupyter

We can start jupyter, just by running following command on the cmd : jupyter-notebook. However, I already installed Anaconda, so for me It's unncessary to install jupyter like this.

Step 2 : Install Java

Run the following command. After installation, we can check it by running; java -version.

sudo apt-get install default-jre

Step 3 : Install Scala

Run the following command. After installation, we can check by running; scala -version

sudo apt-get install scala

Step 4 : Install Py4j

Run the following command. It connects java and scala with python.

sudo pip3 install py4j

Step 5 : Install Apache Spark

Download Apache Spark. Currently I've downloaded spark-2.4.0-bin-hadoop2.7.tgz. After the download has finished, go to that downloaded directory and unzip it by the following command. However, I've saved the file on the home directory.

sudo tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz

Step 6 : Set Path

Now, we need to tell python from where it actually find spark. Type the following command sequentially. Carefully about the location of the file directory and related stuff.

export SPARK_HOME='home/ubuntu/spark-2.4.0-bin-hadoop2.7'
export PATH=$SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
sudo chmod 777 spark-2.4.0-bin-hadoop2.7

From now we normally face a problem, we can only import pyspark from 'spark-2.4.0-bin-hadoop2.7/python' this directory. In order to import this globall, we need another module called findspark. Run the following command.

pip3 install findspark

After installation is complete, import pyspark from globally like following.

import findspark
findspark.init('/home/i/spark-2.4.0-bin-hadoop2.7')
import pyspark

That's all. In order to use Deep Learning Pipelines provided by Databricks with Apache Spark, follow the below steps. It's may be optional for you.

Step 7 : Integrating Deep Learning Pipelines with Apache Spark

Deep Learning Pipelines aims at enabling everyone to easily integrate scalable deep learning into their workflows, from machine learning practitioners to business analysts. It builds on Apache Spark's ML Pipelines for training, and on Spark DataFrames and SQL for deploying models. It includes high-level APIs for common aspects of deep learning so they can be efficiently done in a few lines of code.

Go to this page and download deep learning library for spark. It's a zip file. So, unzip it. File name is some random generated, so I prefer to rename it as deep-learning:1.5.0-spark2.4-s_2.1 as because I've downloaed spark2.4.0 ..... Now, run the following command. Before pressing enter consider some following issue.

$SPARK_HOME/home/i/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --packages databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11

I shifted spark-2.4.0-bin-hadoop2.7 in the home directory. So, I set $SPARK_HOME/home/i/spark-2.4.0-bin-hadoop2.7/bin/spark-shel and then I run the whole command above from where I saved deep-learning library for spark. Here, the rest of the part databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11 is the unzipped folder name.

After Completing the installation process, we'll end up with this. Congrats!

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 10.0.2)

While using Deep Learnin Piepelines another case may occur, jar file path missng. We can set though, I manually added paths to the downloaded jars into sys.path variable (which is equivalent to PYTHONPATH). To make everything working, run the following code:

import sys, glob, os
sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), ".ivy2/jars/*.jar")))

import findspark
findspark.init('/home/innat/spark-2.4.0-bin-hadoop2.7')
import pyspark # pyspark library
import sparkdl # deep learning library : tensorflow backend 
from keras.applications import InceptionV3 # transfer learning using pyspark

However, Web UI (aka Application UI or webUI or Spark UI) is the web interface of a Spark application to monitor and inspect Spark job executions in a web browser.

That's it.

@innat
Copy link
Author

innat commented Sep 2, 2020

@FahHabib
The only important package version you need to full fill is the Spark and Hadoop. The rest is good to go.

@innat
Copy link
Author

innat commented Sep 2, 2020

And yes, please use TF < 2.x

@FahdHabib
Copy link

Yes, i figured it out. But thanks anyways @innat. :)

@binakotiyal
Copy link

I am trying to do it in windows 10. How can I integrated the deep learning pipeline in windows. Please give an equivalent code for this "$SPARK_HOME/home/i/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --packages databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11" or I can write it in jupyter notebook.

@innat
Copy link
Author

innat commented Mar 1, 2021

@binakotiyal
I didn't do it myself. This instruction is only for Linux-based OS. For other OS, you may need to search for it. I think that should not be too different from this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment