fperez/00-Setup-IPython-PySpark.ipynb

## 00-Setup-IPython-PySpark.ipynb
{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Setting up the IPython Notebook with PySpark on AMPCamp EC2 clusters"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Note** this HowTo assumes that you are using a cluster provided by the AMPLab team, where port 8888 has already been opened. If you have spun up your own cluster using their AMI, you need to go to the Security Groups tab and open that port for traffic. [See here for full details](https://gist.github.com/iamatypeofwalrus/5183133).\n",
      "\n",
      "You can run the IPython Notebook interface as a more friendly way to interact with your AMPCamp EC2 cluster. The detailed instructions on how to run a public IPython Notebook Server are [here](http://ipython.org/ipython-doc/stable/interactive/public_server.html#running-a-public-notebook-server), but the basics are:\n",
      "\n",
      "* Create a certificate file for your cluster by typing at the command line:\n",
      "\n",
      "        cd /root\n",
      "        openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem\n",
      "        \n",
      "* Let's make sure there's a default IPython profile ready for us to use:\n",
      "\n",
      "        ipython profile create default\n",
      "\n",
      "\n",
      "* You will need a hashes password next, which you can create with (note these two lines are a *single* command, copy and paste the whole thing in one shot.\n",
      "\n",
      "\n",
      "        python -c \"from IPython.lib import passwd; print passwd()\" \\\n",
      "            > /root/.ipython/profile_default/nbpasswd.txt\n",
      "\n",
      "* Verify the password file has a string like `sha1:16a8a30fb9b6:82c0...` in it (your actual value will differ). If you don't get this, repeat the prior step:\n",
      "\n",
      "        cat /root/.ipython/profile_default/nbpasswd.txt\n",
      "        sha1:16a8a30fb9b6:82c030d3989b0069b9ed603822949a954a2beb21\n",
      "\n",
      "\n",
      "* Put the following into the file `/root/.ipython/profile_default/ipython_notebook_config.py`:\n",
      "\n",
      "        # Configuration file for ipython-notebook.\n",
      "        c = get_config()\n",
      "        \n",
      "        # Notebook config\n",
      "        c.NotebookApp.certfile = u'/root/mycert.pem'\n",
      "        c.NotebookApp.ip = '*'\n",
      "        c.NotebookApp.open_browser = False\n",
      "        # It is a good idea to put it on a known, fixed port\n",
      "        c.NotebookApp.port = 8888\n",
      "        \n",
      "        PWDFILE=\"/root/.ipython/profile_default/nbpasswd.txt\"\n",
      "        c.NotebookApp.password = open(PWDFILE).read().strip()\n",
      "\n",
      "* Put the following into the file `/root/.ipython/profile_default/startup/00-pyspark-setup.py`:\n",
      "\n",
      "        # Configure the necessary Spark environment\n",
      "        import os\n",
      "        os.environ['SPARK_HOME'] = '/root/spark/'\n",
      "        \n",
      "        # And Python path\n",
      "        import sys\n",
      "        sys.path.insert(0, '/root/spark/python')\n",
      "        \n",
      "        # Detect the PySpark URL\n",
      "        CLUSTER_URL = open('/root/spark-ec2/cluster-url').read().strip()\n",
      "\n",
      "\n",
      "* That's it! You can now start the notebook server by typing the following command:\n",
      "\n",
      "        ipython notebook\n",
      "\n",
      "**Note:** I *strongly* recommend you do this inside a `screen` or `tmux` session so it's persistent.  This will let it survive cleanly if you lose your connection to your cluster.\n",
      "\n",
      "\n",
      "You can then connect to the server via `https://[YOUR INSTANCE URL HERE]:8888`. Once you type your password, you should be able to start running code!\n",
      "\n",
      "**Warning:** the URL for your notebook must start with `https`, not `http`.\n",
      "\n",
      "The config file above creates a variable called `CLUSTER_URL` which you can use to create your `SparkContext`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print CLUSTER_URL"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "spark://ec2-50-16-173-245.compute-1.amazonaws.com:7077\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now let's create the context:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from pyspark import  SparkContext\n",
      "sc = SarkContext( CLUSTER_URL, 'pyspark')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And test it by creating a trivial RDD:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sc.parallelize([1,2,3])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "<pyspark.rdd.RDD at 0x1e16d90>"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "WARNING: Shutdown this tutorial when you are done with it!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Because of how PySpark works, the above context will hog all your cluster resources.  If you are going to do new work and are done with this tutorial, remember to shut it down from the dashboard so you free the cluster for other work."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}

## Data Exploration Using Spark.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              Data Exploration Using Spark.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
	{
	"metadata": {
	"name": ""
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "heading",
	"level": 1,
	"metadata": {},
	"source": [
	"Setting up the IPython Notebook with PySpark on AMPCamp EC2 clusters"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Note this HowTo assumes that you are using a cluster provided by the AMPLab team, where port 8888 has already been opened. If you have spun up your own cluster using their AMI, you need to go to the Security Groups tab and open that port for traffic. [See here for full details](https://gist.github.com/iamatypeofwalrus/5183133).\n",
	"\n",
	"You can run the IPython Notebook interface as a more friendly way to interact with your AMPCamp EC2 cluster. The detailed instructions on how to run a public IPython Notebook Server are [here](http://ipython.org/ipython-doc/stable/interactive/public_server.html#running-a-public-notebook-server), but the basics are:\n",
	"\n",
	"* Create a certificate file for your cluster by typing at the command line:\n",
	"\n",
	" cd /root\n",
	" openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem\n",
	" \n",
	"* Let's make sure there's a default IPython profile ready for us to use:\n",
	"\n",
	" ipython profile create default\n",
	"\n",
	"\n",
	"* You will need a hashes password next, which you can create with (note these two lines are a single command, copy and paste the whole thing in one shot.\n",
	"\n",
	"\n",
	" python -c \"from IPython.lib import passwd; print passwd()\" \\\n",
	" > /root/.ipython/profile_default/nbpasswd.txt\n",
	"\n",
	"* Verify the password file has a string like `sha1:16a8a30fb9b6:82c0...` in it (your actual value will differ). If you don't get this, repeat the prior step:\n",
	"\n",
	" cat /root/.ipython/profile_default/nbpasswd.txt\n",
	" sha1:16a8a30fb9b6:82c030d3989b0069b9ed603822949a954a2beb21\n",
	"\n",
	"\n",
	"* Put the following into the file `/root/.ipython/profile_default/ipython_notebook_config.py`:\n",
	"\n",
	" # Configuration file for ipython-notebook.\n",
	" c = get_config()\n",
	" \n",
	" # Notebook config\n",
	" c.NotebookApp.certfile = u'/root/mycert.pem'\n",
	" c.NotebookApp.ip = '*'\n",
	" c.NotebookApp.open_browser = False\n",
	" # It is a good idea to put it on a known, fixed port\n",
	" c.NotebookApp.port = 8888\n",
	" \n",
	" PWDFILE=\"/root/.ipython/profile_default/nbpasswd.txt\"\n",
	" c.NotebookApp.password = open(PWDFILE).read().strip()\n",
	"\n",
	"* Put the following into the file `/root/.ipython/profile_default/startup/00-pyspark-setup.py`:\n",
	"\n",
	" # Configure the necessary Spark environment\n",
	" import os\n",
	" os.environ['SPARK_HOME'] = '/root/spark/'\n",
	" \n",
	" # And Python path\n",
	" import sys\n",
	" sys.path.insert(0, '/root/spark/python')\n",
	" \n",
	" # Detect the PySpark URL\n",
	" CLUSTER_URL = open('/root/spark-ec2/cluster-url').read().strip()\n",
	"\n",
	"\n",
	"* That's it! You can now start the notebook server by typing the following command:\n",
	"\n",
	" ipython notebook\n",
	"\n",
	"Note: I strongly recommend you do this inside a `screen` or `tmux` session so it's persistent. This will let it survive cleanly if you lose your connection to your cluster.\n",
	"\n",
	"\n",
	"You can then connect to the server via `https://[YOUR INSTANCE URL HERE]:8888`. Once you type your password, you should be able to start running code!\n",
	"\n",
	"Warning: the URL for your notebook must start with `https`, not `http`.\n",
	"\n",
	"The config file above creates a variable called `CLUSTER_URL` which you can use to create your `SparkContext`:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"print CLUSTER_URL"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"spark://ec2-50-16-173-245.compute-1.amazonaws.com:7077\n"
	]
	}
	],
	"prompt_number": 1
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now let's create the context:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from pyspark import SparkContext\n",
	"sc = SarkContext( CLUSTER_URL, 'pyspark')"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 2
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"And test it by creating a trivial RDD:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"sc.parallelize([1,2,3])"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 3,
	"text": [
	"<pyspark.rdd.RDD at 0x1e16d90>"
	]
	}
	],
	"prompt_number": 3
	},
	{
	"cell_type": "heading",
	"level": 2,
	"metadata": {},
	"source": [
	"WARNING: Shutdown this tutorial when you are done with it!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Because of how PySpark works, the above context will hog all your cluster resources. If you are going to do new work and are done with this tutorial, remember to shut it down from the dashboard so you free the cluster for other work."
	]
	}
	],
	"metadata": {}
	}
	]
	}