Skip to content

Instantly share code, notes, and snippets.

@Bill0412
Forked from datlife/notes.md
Last active July 27, 2020 00:43
Show Gist options
  • Save Bill0412/b5b429cd7c4f636e4ea0872791c275ab to your computer and use it in GitHub Desktop.
Save Bill0412/b5b429cd7c4f636e4ea0872791c275ab to your computer and use it in GitHub Desktop.
Setup Apache Spark/ Jupyter Notebook on MacOS

Spark Setup MacOS Guide

Reference: https://spark.apache.org/docs/latest/

Overview:

  • Install Java 8+, then add to PATH and JAVA_HOME

1. Install Java.

  • Download and install Java 8 through brew.
brew cask install adoptopenjdk/openjdk/adoptopenjdk8
  • Validate Java version
brew cask info adoptopenjdk8
adoptopenjdk8: 8,262:b10
https://adoptopenjdk.net/
/usr/local/Caskroom/adoptopenjdk8/8,262:b10 (100.2MB)
From: https://github.com/adoptopenjdk/homebrew-openjdk/blob/HEAD/Casks/adoptopenjdk8.rb
==> Name
AdoptOpenJDK 8
==> Artifacts
OpenJDK8U-jdk_x64_mac_hotspot_8u262b10.pkg (Pkg)
...
  • Add java environment variables to terminal

    • Open Vim
    # Depending on terminal one might be using, for zsh, the file does not exist by default, you have to create it.
    vim ~/.zshrc
    • Add the following to .zshrc
    # For Apache Spark
    if which java > /dev/null; then export JAVA_HOME=$(/usr/libexec/java_home); fi

2. Install Apache Spark

Brew is Mac OS Package Manager, similar to apt (http://brew.sh/)

brew update 
brew install scala
brew install apache-spark

3. Setup Variables

  • Assump Current spark version == 2.4.0.
# For a ipython notebook and pyspark integration
if which pyspark > /dev/null; then
  export SPARK_HOME="/usr/local/Cellar/apache-spark/2.4.0/libexec/"
  export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
  export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
fi
  • Up to this point, you should be able to launch pyspark and scala-shell from terminal.

4. Integrate Spark and Jupyter Notebook

  • Install Python Env through pyenv, a python versioning manager.
pyenv install 3.6.7 

# Set Python 3.6.7 as main python interpreter
pyenv global 3.6.7

# Update new python
source ~/.zshrc

# Update pip from 10.01 to 18.1
pip install --upgrade pip
  • (Optional) If you received pyspark 2.4.0 requires py4j==0.10.7, which is not installed., fixed by:
pip install py4j==0.10.7
  • Install Jupyter, Apache Toree
pip install jupyter toree
  • Create a kernel in Jupyter for managing Spark
jupyter kernelspec list
# Available kernels:
#  python3   /Users/dat/.pyenv/versions/3.6.7/share/jupyter/kernels/python3
jupyter toree install --replace --spark_home=$SPARK_HOME
jupyter kernelspec list

# Available kernels:
#  apache_toree_scala    /Users/dat/Library/Jupyter/kernels/apache_toree_scala
#  python3               /Users/dat/.pyenv/versions/3.6.7/share/jupyter/kernels/python3

Launch Jupyter Notebook and Test Our First Spark Application

jupyter notebook
  • Remember to Select Toree as your main kernel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment