rocket-ron/PySpark1.6.1-Jupyter4.md

## PySpark1.6.1-Jupyter4.md

      
    Raw
  

              PySpark1.6.1-Jupyter4.md
            
          
    Configuring Spark 1.6.1 to work with Jupyter 4.x Notebooks on Mac OS X with Homebrew

I've looked around in a number of places and I have found several blog entries on setting up IPython notebooks to work with Spark. However since most of the blog posts have been written both IPython and Spark have been updated. Today, IPython has been transformed into Jupyter, and Spark is near release 1.6.2. Most of the information is out there to get things working, but I thought I'd capture this point in time with a working configuration and how I set it up.
I rely completely on Homebrew to manage packages on my Mac. So Spark, Jupyter, Python, Jenv and other things are installed via Homebrew. You should be able to achieve the same thing with Anaconda but I don't know that package manager.
Install Java

Make sure your Java installation is up to date. I use jEnv to manage Java installations on my Mac, so that adds another layer to make sure is set up correctly. You can download/update Java from Oracle, have Homebrew install it... whatever works.
I use Jenv to manage Java environments since I have multiple Java installations. It is similar to the alternatives manager in some Linux distributions. Whatever you use, make sure you are pointing to the correct Java verson.
Install Scala

brew install scala

This should install Scala 2.11.8 as of this writing.
Homebrew will install scala at /usr/local/Cellar/scala/ with each version under that directory: /usr/local/Cellar/scala/2.11.8/ . You can check by executing
brew list scala

/usr/local/Cellar/scala/2.11.8/bin/fsc
/usr/local/Cellar/scala/2.11.8/bin/scala
/usr/local/Cellar/scala/2.11.8/bin/scalac
/usr/local/Cellar/scala/2.11.8/bin/scaladoc
/usr/local/Cellar/scala/2.11.8/bin/scalap
/usr/local/Cellar/scala/2.11.8/etc/bash_completion.d/scala
/usr/local/Cellar/scala/2.11.8/libexec/bin/ (5 files)
/usr/local/Cellar/scala/2.11.8/libexec/lib/ (14 files)
/usr/local/Cellar/scala/2.11.8/share/doc/ (20 files)
/usr/local/Cellar/scala/2.11.8/share/man/ (5 files)

Set your SCALA_HOME environment variable in ~/.bash_profile to point to the Scala directory:
export SCALA_HOME=/usr/local/Cellar/scala/2.11.8/

Make sure to source your updated profile to take effect in the current shell:
. ~/.bash_profile

Install Spark

brew install apache-spark

Set SPARK_HOME in your ~/.bash_profile
export SPARK_HOME=/usr/local/Cellar/apache-spark/1.6.1/libexec

And source your updated profile
. ~/.bash_profile

A quick check to see how things look:
$ pyspark
Python 2.7.11 (default, Dec  5 2015, 21:59:29) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/03/20 19:23:11 INFO SparkContext: Running Spark version 1.6.1
16/03/20 19:23:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/20 19:23:12 INFO SecurityManager: Changing view acls to: rcordell
16/03/20 19:23:12 INFO SecurityManager: Changing modify acls to: rcordell
16/03/20 19:23:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(rcordell); users with modify permissions: Set(rcordell)
16/03/20 19:23:12 INFO Utils: Successfully started service 'sparkDriver' on port 58673.
16/03/20 19:23:12 INFO Slf4jLogger: Slf4jLogger started
16/03/20 19:23:12 INFO Remoting: Starting remoting
16/03/20 19:23:13 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.0.1.2:58674]
16/03/20 19:23:13 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 58674.
16/03/20 19:23:13 INFO SparkEnv: Registering MapOutputTracker
16/03/20 19:23:13 INFO SparkEnv: Registering BlockManagerMaster
16/03/20 19:23:13 INFO DiskBlockManager: Created local directory at /private/var/folders/z_/rfp5q2cd6db13d19v6yw0n8w0000gn/T/blockmgr-af45c961-d645-481c-81ce-370c8afd6999
16/03/20 19:23:13 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
16/03/20 19:23:13 INFO SparkEnv: Registering OutputCommitCoordinator
16/03/20 19:23:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/03/20 19:23:13 INFO SparkUI: Started SparkUI at http://10.0.1.2:4040
16/03/20 19:23:13 INFO Executor: Starting executor ID driver on host localhost
16/03/20 19:23:13 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 58675.
16/03/20 19:23:13 INFO NettyBlockTransferService: Server created on 58675
16/03/20 19:23:13 INFO BlockManagerMaster: Trying to register BlockManager
16/03/20 19:23:13 INFO BlockManagerMasterEndpoint: Registering block manager localhost:58675 with 511.1 MB RAM, BlockManagerId(driver, localhost, 58675)
16/03/20 19:23:13 INFO BlockManagerMaster: Registered BlockManager
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.11 (default, Dec  5 2015 21:59:29)
SparkContext available as sc, HiveContext available as sqlContext.
>>> 

Looks good!
Install or update Jupyter

This installation setup is for Jupyter, not IPython. The configurations have changed from one to the other and it can be a bit confusing to dig through what works and what doesn't. Anyway, we'll install the latest Jupyter and configure from there.
sudo pip install jupyter --upgrade

This assumes you already have Jupyter installed; if not, just omit the --upgrade.
Check to make sure you can start a notebook
jupyter notebook

...should open a browser with the Jupyter UI.
Setup the PySpark kernel

Thanks to Jacek Wasilewski for his blog post about setting this up - it helped me tie some of the "loose ends" together...  Jacek's guide works for IPython, so we need to change a few things to get it to work with Jupyter.
Create the kernel directory
mkdir -p ~/Library/Jupyter/kernels/pyspark

Create the kernel file
touch ~/Library/Jupyter/kernels/pyspark/kernel.json

Put the following into the file
{
 "display_name": "pySpark (Spark 1.6.1)",
 "language": "python",
 "argv": [
  "/usr/local/bin/python",
  "-m",
  "ipykernel",
  "--profile=pyspark",
  "-f",
  "{connection_file}"
 ]
}

Here's a bit of hackyness at the junction of IPython and Jupyter configurations. I may not have this quite right but it works. If someone let's me know if there's a better way to do this, please let me know.
Create the follwing file:
touch ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py

Put the following inside the file:
# Configure the necessary Spark environment
import os
import sys

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

If your version of py4j is different then you need to update accordingly.
Test

Start Jupyter notebook
jupyter notebook

You should see the PySpark kernel show up in the kernel list (upper right drop down where you would normally select Python or another kernel) - select it.
Watch the command line from where you launched the Jupyter notebook - errors will appear here. If all goes well, you should be able to
In [1]: sc
<pyspark.context.SparkContext at 0x10cc4a3d0>
In [2]:

Done!