Skip to content

Instantly share code, notes, and snippets.

@Mohitsharma44
Last active January 13, 2017 20:05
Show Gist options
  • Save Mohitsharma44/11620c81eb2c92f557251b89112eff3f to your computer and use it in GitHub Desktop.
Save Mohitsharma44/11620c81eb2c92f557251b89112eff3f to your computer and use it in GitHub Desktop.
pyspark-on-mac

The best way to run pyspark on a machine in a virtualized environment is to use docker. Docker is a container technology that allows developers to 'package' their software and ship it so that it takes away the headache of things like setting up environment properly, configuring logging, setting proper options, breaking the machine etc.. basically removes the excuse It works on my machine I'll stop babbling about docker and containers. If you're interested to know more, head here: http://unix.stackexchange.com/questions/254956/what-is-the-difference-between-docker-lxd-and-lxc

Step1: Installing docker on Mac

We will be using Kitematic (which comes bundled with docker toolbox) for managing docker environment

Step2:

  • Click on the docker toolbar on mac and select Kitematic
  • Sign up on docker hub (so that you can push your containers to be used on other machines.. remember the It works on my machine excuse?)
  • Search for pyspark-notebook and click on the image provided by jupyter (it is imperative to use the one from jupyter)
  • Once it is downloaded, it will start an ipython-kernel and on the right half of kitematic will say something like web preview.

something similar to this: kitematic_ex

Click on it and it will take you to the browser which will ask for a token id.

  • Go back to the kitematic and check the left bottom of terminal-like screen for string that says: token?=. Copy the text after that and put it into the jupyter notebook page.

Step3: (only for using python2)

  • Workers by default start with python3 kernel (regardless of how you start the notebook kernel). To change this behavior and strictly use python2, you should always execute following command as the first line in the notebook:
import os
os.environ['PYSPARK_PYTHON'] = 'python2'

Step4:

  • Finally a reminder that we are running everything locally. we don't have separate clusters to perform execution. Hence, when creating spark context, you should do:
import pyspark
sc = pyspark.SparkContext('local[*]')

Example:

  • Try running this:
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment