Skip to content

Instantly share code, notes, and snippets.

@fperez
Last active December 21, 2015 23:48
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save fperez/6384491 to your computer and use it in GitHub Desktop.
Save fperez/6384491 to your computer and use it in GitHub Desktop.
HowTo for starting an IPython Notebook server
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Setting up the IPython Notebook with PySpark on AMPCamp EC2 clusters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note** this HowTo assumes that you are using a cluster provided by the AMPLab team, where port 8888 has already been opened. If you have spun up your own cluster using their AMI, you need to go to the Security Groups tab and open that port for traffic. [See here for full details](https://gist.github.com/iamatypeofwalrus/5183133).\n",
"\n",
"You can run the IPython Notebook interface as a more friendly way to interact with your AMPCamp EC2 cluster. The detailed instructions on how to run a public IPython Notebook Server are [here](http://ipython.org/ipython-doc/stable/interactive/public_server.html#running-a-public-notebook-server), but the basics are:\n",
"\n",
"* Create a certificate file for your cluster by typing at the command line:\n",
"\n",
" cd /root\n",
" openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem\n",
" \n",
"* Let's make sure there's a default IPython profile ready for us to use:\n",
"\n",
" ipython profile create default\n",
"\n",
"\n",
"* You will need a hashes password next, which you can create with (note these two lines are a *single* command, copy and paste the whole thing in one shot.\n",
"\n",
"\n",
" python -c \"from IPython.lib import passwd; print passwd()\" \\\n",
" > /root/.ipython/profile_default/nbpasswd.txt\n",
"\n",
"* Verify the password file has a string like `sha1:16a8a30fb9b6:82c0...` in it (your actual value will differ). If you don't get this, repeat the prior step:\n",
"\n",
" cat /root/.ipython/profile_default/nbpasswd.txt\n",
" sha1:16a8a30fb9b6:82c030d3989b0069b9ed603822949a954a2beb21\n",
"\n",
"\n",
"* Put the following into the file `/root/.ipython/profile_default/ipython_notebook_config.py`:\n",
"\n",
" # Configuration file for ipython-notebook.\n",
" c = get_config()\n",
" \n",
" # Notebook config\n",
" c.NotebookApp.certfile = u'/root/mycert.pem'\n",
" c.NotebookApp.ip = '*'\n",
" c.NotebookApp.open_browser = False\n",
" # It is a good idea to put it on a known, fixed port\n",
" c.NotebookApp.port = 8888\n",
" \n",
" PWDFILE=\"/root/.ipython/profile_default/nbpasswd.txt\"\n",
" c.NotebookApp.password = open(PWDFILE).read().strip()\n",
"\n",
"* Put the following into the file `/root/.ipython/profile_default/startup/00-pyspark-setup.py`:\n",
"\n",
" # Configure the necessary Spark environment\n",
" import os\n",
" os.environ['SPARK_HOME'] = '/root/spark/'\n",
" \n",
" # And Python path\n",
" import sys\n",
" sys.path.insert(0, '/root/spark/python')\n",
" \n",
" # Detect the PySpark URL\n",
" CLUSTER_URL = open('/root/spark-ec2/cluster-url').read().strip()\n",
"\n",
"\n",
"* That's it! You can now start the notebook server by typing the following command:\n",
"\n",
" ipython notebook\n",
"\n",
"**Note:** I *strongly* recommend you do this inside a `screen` or `tmux` session so it's persistent. This will let it survive cleanly if you lose your connection to your cluster.\n",
"\n",
"\n",
"You can then connect to the server via `https://[YOUR INSTANCE URL HERE]:8888`. Once you type your password, you should be able to start running code!\n",
"\n",
"**Warning:** the URL for your notebook must start with `https`, not `http`.\n",
"\n",
"The config file above creates a variable called `CLUSTER_URL` which you can use to create your `SparkContext`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print CLUSTER_URL"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"spark://ec2-50-16-173-245.compute-1.amazonaws.com:7077\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's create the context:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from pyspark import SparkContext\n",
"sc = SarkContext( CLUSTER_URL, 'pyspark')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And test it by creating a trivial RDD:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sc.parallelize([1,2,3])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
"<pyspark.rdd.RDD at 0x1e16d90>"
]
}
],
"prompt_number": 3
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"WARNING: Shutdown this tutorial when you are done with it!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because of how PySpark works, the above context will hog all your cluster resources. If you are going to do new work and are done with this tutorial, remember to shut it down from the dashboard so you free the cluster for other work."
]
}
],
"metadata": {}
}
]
}
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment