Skip to content

Instantly share code, notes, and snippets.

@datlife
Last active May 25, 2022 14:55
Show Gist options
  • Save datlife/9d36a2832b358e81daf1d3f6bcb768af to your computer and use it in GitHub Desktop.
Save datlife/9d36a2832b358e81daf1d3f6bcb768af to your computer and use it in GitHub Desktop.
Setup Apache Spark/ Jupyter Notebook on MacOS

Spark Setup MacOS Guide

Reference: https://spark.apache.org/docs/latest/

Overview:

  • Install Java 8+, then add to PATH and JAVA_HOME

1. Install Java.

  • Download and install Java 8 through brew.
brew cask install java8
  • Validate Java version
brew cask info java8
java8: 1.8.0_192-b12,750e1c8617c5452694857ad95c3ee230
https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
/usr/local/Caskroom/java8/1.8.0_192-b12,750e1c8617c5452694857ad95c3ee230 (246.2MB)
From: https://github.com/Homebrew/homebrew-cask-versions/blob/master/Casks/java8.rb
==> Name
Java Standard Edition Development Kit
==> Artifacts
JDK 8 Update 192.pkg (Pkg)
...
  • Add java environment variables to terminal

    • Open Vim
    # Depending on terminal one might be using
    vim ~/.zshrc
    • Add the following to .zshrc
    # For Apache Spark
    if which java > /dev/null; then export JAVA_HOME=$(/usr/libexec/java_home); fi

2. Install Apache Spark

Brew is Mac OS Package Manager, similar to apt (http://brew.sh/)

brew update 
brew install scala
brew install apache-spark

3. Setup Variables

  • Assump Current spark version == 2.4.0.
# For a ipython notebook and pyspark integration
if which pyspark > /dev/null; then
  export SPARK_HOME="/usr/local/Cellar/apache-spark/2.4.0/libexec/"
  export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
  export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
fi
  • Up to this point, you should be able to launch pyspark and scala-shell from terminal.

4. Integrate Spark and Jupyter Notebook

  • Install Python Env through pyenv, a python versioning manager.
pyenv install 3.6.7 

# Set Python 3.6.7 as main python interpreter
pyenv global 3.6.7

# Update new python
source ~/.zshrc

# Update pip from 10.01 to 18.1
pip install --upgrade pip
  • (Optional) If you received pyspark 2.4.0 requires py4j==0.10.7, which is not installed., fixed by:
pip install py4j==0.10.7
  • Install Jupyter, Apache Toree
pip install jupyter toree
  • Create a kernel in Jupyter for managing Spark
jupyter kernelspec list
# Available kernels:
#  python3   /Users/dat/.pyenv/versions/3.6.7/share/jupyter/kernels/python3
jupyter toree install --replace --spark_home=$SPARK_HOME
jupyter kernelspec list

# Available kernels:
#  apache_toree_scala    /Users/dat/Library/Jupyter/kernels/apache_toree_scala
#  python3               /Users/dat/.pyenv/versions/3.6.7/share/jupyter/kernels/python3

Launch Jupyter Notebook and Test Our First Spark Application

jupyter notebook
  • Remember to Select Toree as your main kernel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment