psychemedia/election_odds.ipynb

## election_odds.ipynb
{
 "metadata": {
  "name": "",
  "signature": "sha256:e83697ab284a0e10d29ca1760ccf0340015de4f4b316a3e52e1e8e34be8ed200"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# A Quick Look at Election Betting Data\n",
      "\n",
      "For the hundred days or so leading up to the election, I was [scraping oddschecker listed betting odds](https://morph.io/psychemedia/electionodds) for most of the UK parliamentary constituencies (I didn't check that I was scraping them all; this was just a side-side-project...).\n",
      "\n",
      "I didn't look at the data at all in the run-up to the election (my original plan was to look at timeseries within each constituency to try to detect sudden changes in odds that might indicate some sort of major shift in sentiment in each constituency), but with things all settled now, I thought I'd look to see what tales - if any - the betting data might tell, at least at a high level. For example, what did the betting odds have to say about the likely number of seats taken by each party...?\n",
      "\n",
      "*This notebook was developed as part of an exploration into possible forms a student produced notebook could take as part of an assessment process for a course on data management and analysis using a dataset selected by a student. Intended student worktime for the assessment: <10 hours. Comments appreciated...*"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#I'm going to use the pandas library to analyse the data\n",
      "import pandas as pd"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 320
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Data Acquisition"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The data can be pulled directly from the scraper - https://morph.io/psychemedia/electionodds.\n",
      "Using the API, we can write a SQLite query to grab the odds for a particular day:\n",
      "\n",
      "`select * from 'Constituency2015GE' constituency where strftime('%d-%m-%Y',time)=\"06-05-2015\"`\n",
      "\n",
      "The data can be pulled directly into a pandas dataframe from the morph.io scraper API:"
     ]
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - IDENTIFY AN APPROPRIATE DATASET FOR USE IN A DATA INVESTIGATION\n",
      "ASSESSMENT - DEMONSTRATE HOW TO USE AN SQL STATEMENT TO RETRIEVE A DATASET\n",
      "\n",
      "NOTE: WE COULD USE pandasql TO RUN A SQL QUERY ON A DATAFRAME PUSHED INTO A SQLITE DATABASE"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#df=pd.read_csv('https://api.morph.io/psychemedia/electionodds/data.csv?key='+SECRETKEY+'&query=select%20*%20from%20%27Constituency2015GE%27%20constituency%20where%20strftime(%27%25d-%25m-%25Y%27%2Ctime)%3D%2206-05-2015%22)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 321
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Or you can download the electionodds data and pop it into a file...\n",
      "#Don't believe the filename - the data is data collected at some point on May 5th\n",
      "df=pd.read_csv('electionodds_thurs.csv')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 322
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMONSTRATE HOW TO LOAD IN FROM, OR SAVE DATA TO, A DATA FILE IN A RECOGNISED FORMAT "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Data Review\n",
      "\n",
      "Let's just get a quick view over the data to familiarise ourselves with what it contains."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Preview the data\n",
      "df.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>time</th>\n",
        "      <th>bookie</th>\n",
        "      <th>party</th>\n",
        "      <th>odds</th>\n",
        "      <th>oddsraw</th>\n",
        "      <th>constituency</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> LD</td>\n",
        "      <td>  green</td>\n",
        "      <td> 100.00</td>\n",
        "      <td>   100</td>\n",
        "      <td> aberavon</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> FB</td>\n",
        "      <td>     pc</td>\n",
        "      <td>  50.00</td>\n",
        "      <td>    50</td>\n",
        "      <td> aberavon</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> LD</td>\n",
        "      <td>     pc</td>\n",
        "      <td>  50.00</td>\n",
        "      <td>    50</td>\n",
        "      <td> aberavon</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>     pc</td>\n",
        "      <td>  50.00</td>\n",
        "      <td>    50</td>\n",
        "      <td> aberavon</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> FB</td>\n",
        "      <td> labour</td>\n",
        "      <td>   0.01</td>\n",
        "      <td> 1/100</td>\n",
        "      <td> aberavon</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 323,
       "text": [
        "                        time bookie   party    odds oddsraw constituency\n",
        "0  2015-05-05T09:54:19+00:00     LD   green  100.00     100     aberavon\n",
        "1  2015-05-05T09:54:19+00:00     FB      pc   50.00      50     aberavon\n",
        "2  2015-05-05T09:54:19+00:00     LD      pc   50.00      50     aberavon\n",
        "3  2015-05-05T09:54:19+00:00     WH      pc   50.00      50     aberavon\n",
        "4  2015-05-05T09:54:19+00:00     FB  labour    0.01   1/100     aberavon"
       ]
      }
     ],
     "prompt_number": 323
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The `odds` column is simply a decimalised version of the `oddsraw` value. Note that this is not strictly the decimal odds, which would be 1 greater."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#What bookies did we collect data for?\n",
      "df['bookie'].unique()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 324,
       "text": [
        "array(['LD', 'FB', 'WH', 'B3'], dtype=object)"
       ]
      }
     ],
     "prompt_number": 324
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#How many constituencies did we grab data for?\n",
      "len(df['constituency'].unique())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 325,
       "text": [
        "650"
       ]
      }
     ],
     "prompt_number": 325
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#What is the range of odds offered?\n",
      "(df['odds'].min(), df['odds'].max())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 326,
       "text": [
        "(0.001, 9999.0)"
       ]
      }
     ],
     "prompt_number": 326
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#What parties did we collect data for and in what numbers?\n",
      "#Note that we are likely to count the same party several times in each constituency,\n",
      "# once for each bookmaker offering odds on that party in that constituency\n",
      "df['party'].value_counts().head(15)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 327,
       "text": [
        "labour                          2103\n",
        "liberal democrats               2014\n",
        "ukip                            1977\n",
        "conservatives                   1945\n",
        "any other party or candidate     600\n",
        "greens                           580\n",
        "green                            273\n",
        "snp                              224\n",
        "conservative                     159\n",
        "tusc                             130\n",
        "green party                      105\n",
        "pc                                91\n",
        "liberal democrat                  85\n",
        "alliance                          57\n",
        "sinn fein                         57\n",
        "dtype: int64"
       ]
      }
     ],
     "prompt_number": 327
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The count of parties appearances in the dataset is not necessarily very useful, becuase the same party may count more than several times in the same constituency given the odds from different bookmakers. However, the count does clearly show us that there are multiple possible representations of what are presumably the same party name."
     ]
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMONSTRATE TWO OR MORE TECHNIQUES THAT PROVIDE AN OVERVIEW OF A NEW DATASET"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Cleaning the Data\n",
      "\n",
      "Note there are several opportunities in the `party` column at least for cleaning the data - for example, *green*, *greens* and *green party* are likely the same party, as are *conservative* and *conservatives*. A quick and dirty cleaning approach would be to right strip \"s\", replace occurrences of \"party\" at the end of the string, and then `strip()` just to clear away any whitespace."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df['party_clean']=df['party'].str.rstrip('s').str.replace(r'party$','').str.strip()\n",
      "df['party_clean'].value_counts()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 328,
       "text": [
        "conservative                    2104\n",
        "labour                          2103\n",
        "liberal democrat                2099\n",
        "ukip                            1977\n",
        "green                            958\n",
        "any other party or candidate     600\n",
        "snp                              224\n",
        "tusc                             130\n",
        "pc                                91\n",
        "alliance                          57\n",
        "sinn fein                         57\n",
        "sdlp                              56\n",
        "dup                               53\n",
        "uup                               46\n",
        "any other                         45\n",
        "...\n",
        "lorraine morgan-brinkhurst    1\n",
        "alfred okam                   1\n",
        "chaka artwell                 1\n",
        "john neville hobb             1\n",
        "any other independant         1\n",
        "les tallon-morri              1\n",
        "the whig                      1\n",
        "christopher tompson           1\n",
        "robin lambert                 1\n",
        "residents for uttlesford      1\n",
        "james kirkcaldy               1\n",
        "europeans                     1\n",
        "roy ivinson                   1\n",
        "christopher gray              1\n",
        "criag pond                    1\n",
        "Length: 309, dtype: int64"
       ]
      }
     ],
     "prompt_number": 328
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Inspect the full range of parties, if required, perhaps as basis for further cleaning\n",
      "#df['party_clean'].unique()\n",
      "#A more advanced approach might be to run the names through a clustering algorithm,\n",
      "#or partial string matcher, to see if there aare any near collisions that should be combined"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 329
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMONSTRATE TWO OR MORE TECHNIQUES THAT CAN BE APPLIED TO CLEAN A DATASET\n",
      "ASSESSMENT - DEMONSTRATE HOW TO SORT A DATASET"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If we wanted a strict decimal odds column, we could simply add 1 to the `odds` column:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df['decimal_odds'] = df['odds']+1\n",
      "df.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>time</th>\n",
        "      <th>bookie</th>\n",
        "      <th>party</th>\n",
        "      <th>odds</th>\n",
        "      <th>oddsraw</th>\n",
        "      <th>constituency</th>\n",
        "      <th>party_clean</th>\n",
        "      <th>decimal_odds</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> LD</td>\n",
        "      <td>  green</td>\n",
        "      <td> 100.00</td>\n",
        "      <td>   100</td>\n",
        "      <td> aberavon</td>\n",
        "      <td>  green</td>\n",
        "      <td> 101.00</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> FB</td>\n",
        "      <td>     pc</td>\n",
        "      <td>  50.00</td>\n",
        "      <td>    50</td>\n",
        "      <td> aberavon</td>\n",
        "      <td>     pc</td>\n",
        "      <td>  51.00</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> LD</td>\n",
        "      <td>     pc</td>\n",
        "      <td>  50.00</td>\n",
        "      <td>    50</td>\n",
        "      <td> aberavon</td>\n",
        "      <td>     pc</td>\n",
        "      <td>  51.00</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>     pc</td>\n",
        "      <td>  50.00</td>\n",
        "      <td>    50</td>\n",
        "      <td> aberavon</td>\n",
        "      <td>     pc</td>\n",
        "      <td>  51.00</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> FB</td>\n",
        "      <td> labour</td>\n",
        "      <td>   0.01</td>\n",
        "      <td> 1/100</td>\n",
        "      <td> aberavon</td>\n",
        "      <td> labour</td>\n",
        "      <td>   1.01</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 330,
       "text": [
        "                        time bookie   party    odds oddsraw constituency  \\\n",
        "0  2015-05-05T09:54:19+00:00     LD   green  100.00     100     aberavon   \n",
        "1  2015-05-05T09:54:19+00:00     FB      pc   50.00      50     aberavon   \n",
        "2  2015-05-05T09:54:19+00:00     LD      pc   50.00      50     aberavon   \n",
        "3  2015-05-05T09:54:19+00:00     WH      pc   50.00      50     aberavon   \n",
        "4  2015-05-05T09:54:19+00:00     FB  labour    0.01   1/100     aberavon   \n",
        "\n",
        "  party_clean  decimal_odds  \n",
        "0       green        101.00  \n",
        "1          pc         51.00  \n",
        "2          pc         51.00  \n",
        "3          pc         51.00  \n",
        "4      labour          1.01  "
       ]
      }
     ],
     "prompt_number": 330
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - ADD A NEW COLUMN TO A DATASET\n",
      "ASSESSMENT - GENERATE A NEW COLUMN FROM A PRE-EXISTING COLUMN"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Filtering / Reducing the Data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To start with, let's focus on the odds offered by a single bookmaker. To choose which bookie, let's see how many constituencies each of them have prices for:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for bookie in df['bookie'].unique():\n",
      "    print('{}: {}'.format(bookie,len(df[df['bookie']==bookie]['constituency'].unique())))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "LD: 627\n",
        "FB: 648\n",
        "WH: 648\n",
        "B3: 239\n"
       ]
      }
     ],
     "prompt_number": 331
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMO"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So a good candidate would be WH (William Hill) or LD (Ladbrokes). Let's go with the former..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Grab the data for a particular bookie into a separate dataframe\n",
      "df_wh=df[df['bookie']=='WH']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 332
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMONSTRATE A WAY OF SUBSETTING A DATASET BASED ON ONE OR MORE ROW BASED CRITERIA"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One thing we might want to do over the full dataset is see what the odds were for a partcular constituency offered by a particular bookmarker. We can write a simple convenience function to help us do that."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Write a convenience function to look up odds by constituency\n",
      "def oddsForConstituency(df,bookie,constituency):\n",
      "    ''' Function to find rows associated with a particular bookie in a particular constituency '''\n",
      "    filterView= df[(df['bookie'].str.upper()==bookie.upper()) & \n",
      "                   (df['constituency'].str.lower()==constituency.lower())] \n",
      "    return filterView\n",
      "\n",
      "oddsForConstituency(df,'WH','berwickshire-roxburgh-and-selkirk')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>time</th>\n",
        "      <th>bookie</th>\n",
        "      <th>party</th>\n",
        "      <th>odds</th>\n",
        "      <th>oddsraw</th>\n",
        "      <th>constituency</th>\n",
        "      <th>party_clean</th>\n",
        "      <th>decimal_odds</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>919</th>\n",
        "      <td> 2015-05-05T09:56:50+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>               snp</td>\n",
        "      <td>   2.25</td>\n",
        "      <td> 9/4</td>\n",
        "      <td> berwickshire-roxburgh-and-selkirk</td>\n",
        "      <td>              snp</td>\n",
        "      <td>   3.25</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>923</th>\n",
        "      <td> 2015-05-05T09:56:50+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>            labour</td>\n",
        "      <td> 150.00</td>\n",
        "      <td> 150</td>\n",
        "      <td> berwickshire-roxburgh-and-selkirk</td>\n",
        "      <td>           labour</td>\n",
        "      <td> 151.00</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>927</th>\n",
        "      <td> 2015-05-05T09:56:50+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> liberal democrats</td>\n",
        "      <td>   2.00</td>\n",
        "      <td>   2</td>\n",
        "      <td> berwickshire-roxburgh-and-selkirk</td>\n",
        "      <td> liberal democrat</td>\n",
        "      <td>   3.00</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>935</th>\n",
        "      <td> 2015-05-05T09:56:50+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>     conservatives</td>\n",
        "      <td>   1.20</td>\n",
        "      <td> 6/5</td>\n",
        "      <td> berwickshire-roxburgh-and-selkirk</td>\n",
        "      <td>     conservative</td>\n",
        "      <td>   2.20</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 354,
       "text": [
        "                          time bookie              party    odds oddsraw  \\\n",
        "919  2015-05-05T09:56:50+00:00     WH                snp    2.25     9/4   \n",
        "923  2015-05-05T09:56:50+00:00     WH             labour  150.00     150   \n",
        "927  2015-05-05T09:56:50+00:00     WH  liberal democrats    2.00       2   \n",
        "935  2015-05-05T09:56:50+00:00     WH      conservatives    1.20     6/5   \n",
        "\n",
        "                          constituency       party_clean  decimal_odds  \n",
        "919  berwickshire-roxburgh-and-selkirk               snp          3.25  \n",
        "923  berwickshire-roxburgh-and-selkirk            labour        151.00  \n",
        "927  berwickshire-roxburgh-and-selkirk  liberal democrat          3.00  \n",
        "935  berwickshire-roxburgh-and-selkirk      conservative          2.20  "
       ]
      }
     ],
     "prompt_number": 354
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEFINE AND APPLY A SIMPLE PYTHON FUNCTION THAT ACCEPTS ONE OR MORE PARAMETERS AND RETURNS ONE OR MORE VALUES"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Reshaping the Data\n",
      "\n",
      "The data as it stands is in a relatively tidy (Third Normal Form) *long* format. Each row contains a single observation that associates the odds for a single party with a particular bookmaker in each constituency.\n",
      "\n",
      "If we wanted to compare the odds for particular parties within each constituency, it might be more convenient to put the data into a wide format, with a separate column for the odds offered for each party, indexed by cosntituency:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dfp=df_wh.pivot('constituency','party_clean','odds')\n",
      "dfp.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th>party_clean</th>\n",
        "      <th>al murray - the pub landlord</th>\n",
        "      <th>alliance</th>\n",
        "      <th>bez</th>\n",
        "      <th>bnp</th>\n",
        "      <th>claire wright</th>\n",
        "      <th>conservative</th>\n",
        "      <th>dup</th>\n",
        "      <th>green</th>\n",
        "      <th>ian steven</th>\n",
        "      <th>john bercow</th>\n",
        "      <th>...</th>\n",
        "      <th>respect</th>\n",
        "      <th>robin scott</th>\n",
        "      <th>sdlp</th>\n",
        "      <th>sinn fein</th>\n",
        "      <th>snp</th>\n",
        "      <th>stephen picton</th>\n",
        "      <th>sylvia hermon</th>\n",
        "      <th>tuv</th>\n",
        "      <th>ukip</th>\n",
        "      <th>uup</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>constituency</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>aberavon</th>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td> 100.0</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>...</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>      NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>  16</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>aberconwy</th>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>   0.4</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>...</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>      NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>  80</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>aberdeen-north</th>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td> 150.0</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>...</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td> 0.100000</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td> 150</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>aberdeen-south</th>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>  50.0</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>...</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td> 0.142857</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td> 250</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>aberdeenshire-west-and-kincardine</th>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>   5.5</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>...</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td> 0.142857</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td> NaN</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 30 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 333,
       "text": [
        "party_clean                        al murray - the pub landlord  alliance  \\\n",
        "constituency                                                                \n",
        "aberavon                                                    NaN       NaN   \n",
        "aberconwy                                                   NaN       NaN   \n",
        "aberdeen-north                                              NaN       NaN   \n",
        "aberdeen-south                                              NaN       NaN   \n",
        "aberdeenshire-west-and-kincardine                           NaN       NaN   \n",
        "\n",
        "party_clean                        bez  bnp  claire wright  conservative  dup  \\\n",
        "constituency                                                                    \n",
        "aberavon                           NaN  NaN            NaN         100.0  NaN   \n",
        "aberconwy                          NaN  NaN            NaN           0.4  NaN   \n",
        "aberdeen-north                     NaN  NaN            NaN         150.0  NaN   \n",
        "aberdeen-south                     NaN  NaN            NaN          50.0  NaN   \n",
        "aberdeenshire-west-and-kincardine  NaN  NaN            NaN           5.5  NaN   \n",
        "\n",
        "party_clean                        green  ian steven  john bercow ...   \\\n",
        "constituency                                                      ...    \n",
        "aberavon                             NaN         NaN          NaN ...    \n",
        "aberconwy                            NaN         NaN          NaN ...    \n",
        "aberdeen-north                       NaN         NaN          NaN ...    \n",
        "aberdeen-south                       NaN         NaN          NaN ...    \n",
        "aberdeenshire-west-and-kincardine    NaN         NaN          NaN ...    \n",
        "\n",
        "party_clean                        respect  robin scott  sdlp  sinn fein  \\\n",
        "constituency                                                               \n",
        "aberavon                               NaN          NaN   NaN        NaN   \n",
        "aberconwy                              NaN          NaN   NaN        NaN   \n",
        "aberdeen-north                         NaN          NaN   NaN        NaN   \n",
        "aberdeen-south                         NaN          NaN   NaN        NaN   \n",
        "aberdeenshire-west-and-kincardine      NaN          NaN   NaN        NaN   \n",
        "\n",
        "party_clean                             snp  stephen picton  sylvia hermon  \\\n",
        "constituency                                                                 \n",
        "aberavon                                NaN             NaN            NaN   \n",
        "aberconwy                               NaN             NaN            NaN   \n",
        "aberdeen-north                     0.100000             NaN            NaN   \n",
        "aberdeen-south                     0.142857             NaN            NaN   \n",
        "aberdeenshire-west-and-kincardine  0.142857             NaN            NaN   \n",
        "\n",
        "party_clean                        tuv  ukip  uup  \n",
        "constituency                                       \n",
        "aberavon                           NaN    16  NaN  \n",
        "aberconwy                          NaN    80  NaN  \n",
        "aberdeen-north                     NaN   150  NaN  \n",
        "aberdeen-south                     NaN   250  NaN  \n",
        "aberdeenshire-west-and-kincardine  NaN   NaN  NaN  \n",
        "\n",
        "[5 rows x 30 columns]"
       ]
      }
     ],
     "prompt_number": 333
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Preview all the columns\n",
      "dfp.columns"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 334,
       "text": [
        "Index([u'al murray - the pub landlord', u'alliance', u'bez', u'bnp', u'claire wright', u'conservative', u'dup', u'green', u'ian steven', u'john bercow', u'john blackie', u'kerry smith', u'labour', u'lib dem', u'liberal democrat', u'national health action', u'ni conservative', u'pc', u'pirate', u'plaid cymru', u'respect', u'robin scott', u'sdlp', u'sinn fein', u'snp', u'stephen picton', u'sylvia hermon', u'tuv', u'ukip', u'uup'], dtype='object')"
       ]
      }
     ],
     "prompt_number": 334
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMONSTRATE HOW TO RESHAPE A DATASET, EG USING PIVOT, MELT, STACK OR UNSTACK OPERATORS"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If we wanted to limit the dataset to just major parties - that is, parties contesting a large number of seats - we could reduce the dataset as follows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "party_subset=['green','labour','liberal democrat','conservative','snp','ukip']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 335
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Limit rows in long dataset to selected parties\n",
      "example=df_wh[df_wh['party_clean'].isin(party_subset)]\n",
      "example['party_clean'].unique()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 336,
       "text": [
        "array(['labour', 'liberal democrat', 'ukip', 'conservative', 'snp', 'green'], dtype=object)"
       ]
      }
     ],
     "prompt_number": 336
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Limit columns in wide dataset to selected parties\n",
      "example=dfp[party_subset]\n",
      "example.columns"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 337,
       "text": [
        "Index([u'green', u'labour', u'liberal democrat', u'conservative', u'snp', u'ukip'], dtype='object')"
       ]
      }
     ],
     "prompt_number": 337
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMONSTRATE A WAY OF SUBSETTING A DATASET BASED ON ONE OR MORE COLUMN BASED CRITERIA"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "However, there are significant risks in taking this sort of approach. For example, where an individual is standing in a particular consituency, perhaps under their own name, who has a good chance of winning, their candidacy would not be captured in the reduced dataset.\n",
      "\n",
      "Perhaps a better dataset would be one that includes all major parties *and* any candidates who appear to have a reasonable chance of winning (say, 10 to 1 or better)."
     ]
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - CRITIQUE THE APPROPRIATENESS OF A PARTICULAR QUESTION ASKED OF THE DATA IN A PARTICULAR WAY"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Reducing the Dataset to Scottish Constituencies\n",
      "\n",
      "One reduced dataset we might be interested in working with are the Scottish constituencies. We can use the fact that the SNP had a candidate in the seat as a proxy for which constituencies should be included in this set.\n",
      "\n",
      "One way of identifying those seats is to filter the wide data set (which uses constituencies as the index values) to exclude rows in the SNP column with a null (`NaN`) value:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dfp['snp'].dropna().index.values"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 338,
       "text": [
        "array(['aberdeen-north', 'aberdeen-south',\n",
        "       'aberdeenshire-west-and-kincardine', 'airdrie-and-shotts', 'angus',\n",
        "       'argyll-and-bute', 'ayr-carrick-and-cumnock', 'ayrshire-central',\n",
        "       'ayrshire-north-and-arran', 'banff-and-buchan',\n",
        "       'berwickshire-roxburgh-and-selkirk',\n",
        "       'caithness-sutherland-and-easter-ross',\n",
        "       'coatbridge-chryston-and-bellshill',\n",
        "       'cumbernauld-kilsyth-and-kirkintill', 'dumfries-and-galloway',\n",
        "       'dumfriesshire-clydesdale-and-tweeddale', 'dunbartonshire-east',\n",
        "       'dunbartonshire-west', 'dundee-east', 'dundee-west',\n",
        "       'dunfermline-and-west-fife', 'east-kilbride-strathaven-and-lesma',\n",
        "       'east-lothian', 'edinburgh-east', 'edinburgh-north-and-leith',\n",
        "       'edinburgh-south', 'edinburgh-south-west', 'edinburgh-west',\n",
        "       'falkirk', 'fife-north-east', 'glasgow-central', 'glasgow-east',\n",
        "       'glasgow-north', 'glasgow-north-east', 'glasgow-north-west',\n",
        "       'glasgow-south', 'glasgow-south-west', 'glenrothes', 'gordon',\n",
        "       'inverclyde', 'inverness-nairn-badenoch-and-strathspey',\n",
        "       'kilmarnock-and-loudoun', 'kirkcaldy-and-cowdenbeath',\n",
        "       'lanark-and-hamilton-east', 'linlithgow-and-falkirk-ea',\n",
        "       'livingston', 'midlothian', 'moray', 'motherwell-and-wishaw',\n",
        "       'na-h-eileanan-an-iar', 'ochill-and-sth-perthshire',\n",
        "       'orkney-and-shetland', 'paisley-and-renf-north',\n",
        "       'paisley-and-renf-south', 'perth-and-n-perthshire',\n",
        "       'renfrewshire-east', 'ross-skye-and-lochaber',\n",
        "       'rutherglen-and-hamilton-west', 'stirling'], dtype=object)"
       ]
      }
     ],
     "prompt_number": 338
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMONSTRATE A TECHNIQUE FOR CLEANING OR REDUCING A DATASET BASED ON THE PRESENCE OF NULL VALUES "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Which Seats Were Safest?\n",
      "\n",
      "The safest seats are the seats with the shortest (smallest) odds. If we sort the table by increasing odds and select the top few, that should give us the most secure seats."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df_wh.sort('odds',ascending=True).head(10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>time</th>\n",
        "      <th>bookie</th>\n",
        "      <th>party</th>\n",
        "      <th>odds</th>\n",
        "      <th>oddsraw</th>\n",
        "      <th>constituency</th>\n",
        "      <th>party_clean</th>\n",
        "      <th>decimal_odds</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>8438 </th>\n",
        "      <td> 2015-05-05T10:14:59+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> conservatives</td>\n",
        "      <td> 0.005</td>\n",
        "      <td> 1/200</td>\n",
        "      <td>      runnymede-and-weybridge</td>\n",
        "      <td> conservative</td>\n",
        "      <td> 1.005</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5376 </th>\n",
        "      <td> 2015-05-05T10:07:32+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> conservatives</td>\n",
        "      <td> 0.005</td>\n",
        "      <td> 1/200</td>\n",
        "      <td>             hertfordshire-ne</td>\n",
        "      <td> conservative</td>\n",
        "      <td> 1.005</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5362 </th>\n",
        "      <td> 2015-05-05T10:07:27+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> conservatives</td>\n",
        "      <td> 0.005</td>\n",
        "      <td> 1/200</td>\n",
        "      <td>       hertford-and-stortford</td>\n",
        "      <td> conservative</td>\n",
        "      <td> 1.005</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5319 </th>\n",
        "      <td> 2015-05-05T10:07:20+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> conservatives</td>\n",
        "      <td> 0.005</td>\n",
        "      <td> 1/200</td>\n",
        "      <td>                       henley</td>\n",
        "      <td> conservative</td>\n",
        "      <td> 1.005</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1628 </th>\n",
        "      <td> 2015-05-05T09:58:35+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> conservatives</td>\n",
        "      <td> 0.005</td>\n",
        "      <td> 1/200</td>\n",
        "      <td>          brentwood-and-ongar</td>\n",
        "      <td> conservative</td>\n",
        "      <td> 1.005</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>10097</th>\n",
        "      <td> 2015-05-05T10:19:14+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>        labour</td>\n",
        "      <td> 0.005</td>\n",
        "      <td> 1/200</td>\n",
        "      <td>               tyneside-north</td>\n",
        "      <td>       labour</td>\n",
        "      <td> 1.005</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8941 </th>\n",
        "      <td> 2015-05-05T10:16:17+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>        labour</td>\n",
        "      <td> 0.005</td>\n",
        "      <td> 1/200</td>\n",
        "      <td>                south-shields</td>\n",
        "      <td>       labour</td>\n",
        "      <td> 1.005</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5029 </th>\n",
        "      <td> 2015-05-05T10:06:43+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> conservatives</td>\n",
        "      <td> 0.005</td>\n",
        "      <td> 1/200</td>\n",
        "      <td>               hampshire-east</td>\n",
        "      <td> conservative</td>\n",
        "      <td> 1.005</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4989 </th>\n",
        "      <td> 2015-05-05T10:06:39+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>        labour</td>\n",
        "      <td> 0.005</td>\n",
        "      <td> 1/200</td>\n",
        "      <td>                       halton</td>\n",
        "      <td>       labour</td>\n",
        "      <td> 1.005</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4916 </th>\n",
        "      <td> 2015-05-05T10:06:27+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>        labour</td>\n",
        "      <td> 0.005</td>\n",
        "      <td> 1/200</td>\n",
        "      <td> hackney-south-and-shoreditch</td>\n",
        "      <td>       labour</td>\n",
        "      <td> 1.005</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 339,
       "text": [
        "                            time bookie          party   odds oddsraw  \\\n",
        "8438   2015-05-05T10:14:59+00:00     WH  conservatives  0.005   1/200   \n",
        "5376   2015-05-05T10:07:32+00:00     WH  conservatives  0.005   1/200   \n",
        "5362   2015-05-05T10:07:27+00:00     WH  conservatives  0.005   1/200   \n",
        "5319   2015-05-05T10:07:20+00:00     WH  conservatives  0.005   1/200   \n",
        "1628   2015-05-05T09:58:35+00:00     WH  conservatives  0.005   1/200   \n",
        "10097  2015-05-05T10:19:14+00:00     WH         labour  0.005   1/200   \n",
        "8941   2015-05-05T10:16:17+00:00     WH         labour  0.005   1/200   \n",
        "5029   2015-05-05T10:06:43+00:00     WH  conservatives  0.005   1/200   \n",
        "4989   2015-05-05T10:06:39+00:00     WH         labour  0.005   1/200   \n",
        "4916   2015-05-05T10:06:27+00:00     WH         labour  0.005   1/200   \n",
        "\n",
        "                       constituency   party_clean  decimal_odds  \n",
        "8438        runnymede-and-weybridge  conservative         1.005  \n",
        "5376               hertfordshire-ne  conservative         1.005  \n",
        "5362         hertford-and-stortford  conservative         1.005  \n",
        "5319                         henley  conservative         1.005  \n",
        "1628            brentwood-and-ongar  conservative         1.005  \n",
        "10097                tyneside-north        labour         1.005  \n",
        "8941                  south-shields        labour         1.005  \n",
        "5029                 hampshire-east  conservative         1.005  \n",
        "4989                         halton        labour         1.005  \n",
        "4916   hackney-south-and-shoreditch        labour         1.005  "
       ]
      }
     ],
     "prompt_number": 339
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###In Which Seats Was The Uncertainty Largest?\n",
      "\n",
      "This is not simply a question of finding the constituency with the longest odds, but a question about finding the constituency with the longest (largest) favourite's odds.\n",
      "\n",
      "If we find the minimum value across each row in the wide dataset, ignoring the missing values, we get the odds of the favourite. We can then sort on that value in a descending fashion to find the constituencies with the longest favourite odds."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dfp.min(axis=1).order(ascending=False).head(10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 340,
       "text": [
        "constituency\n",
        "berwickshire-roxburgh-and-selkirk    1.200000\n",
        "dumfries-and-galloway                0.909091\n",
        "edinburgh-south                      0.909091\n",
        "northampton-north                    0.909091\n",
        "torbay                               0.909091\n",
        "st-ives                              0.833333\n",
        "halesowen-and-rowley-regis           0.833333\n",
        "pudsey                               0.833333\n",
        "finchley-and-golders-green           0.800000\n",
        "cornwall-north                       0.800000\n",
        "dtype: float64"
       ]
      }
     ],
     "prompt_number": 340
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMONSTRATE HOW TO PERFORM OPERATIONS ACROSS A ROW"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### In Which Seats Did the Green Party Have Odds of Better Than 10 to 1 ?\n",
      "\n",
      "We can ask this question by filtering the long dataset using two criteria combined using a Boolean operator:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df_wh[ (df_wh['party_clean']=='green') & (df_wh['odds']<=10) ]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>time</th>\n",
        "      <th>bookie</th>\n",
        "      <th>party</th>\n",
        "      <th>odds</th>\n",
        "      <th>oddsraw</th>\n",
        "      <th>constituency</th>\n",
        "      <th>party_clean</th>\n",
        "      <th>decimal_odds</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>1713</th>\n",
        "      <td> 2015-05-05T09:58:49+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> greens</td>\n",
        "      <td> 0.222222</td>\n",
        "      <td> 2/9</td>\n",
        "      <td> brighton-pavilion</td>\n",
        "      <td> green</td>\n",
        "      <td> 1.222222</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1788</th>\n",
        "      <td> 2015-05-05T09:58:56+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> greens</td>\n",
        "      <td> 4.500000</td>\n",
        "      <td> 9/2</td>\n",
        "      <td>      bristol-west</td>\n",
        "      <td> green</td>\n",
        "      <td> 5.500000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>7454</th>\n",
        "      <td> 2015-05-05T10:12:36+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> greens</td>\n",
        "      <td> 5.000000</td>\n",
        "      <td>   5</td>\n",
        "      <td>     norwich-south</td>\n",
        "      <td> green</td>\n",
        "      <td> 6.000000</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 341,
       "text": [
        "                           time bookie   party      odds oddsraw  \\\n",
        "1713  2015-05-05T09:58:49+00:00     WH  greens  0.222222     2/9   \n",
        "1788  2015-05-05T09:58:56+00:00     WH  greens  4.500000     9/2   \n",
        "7454  2015-05-05T10:12:36+00:00     WH  greens  5.000000       5   \n",
        "\n",
        "           constituency party_clean  decimal_odds  \n",
        "1713  brighton-pavilion       green      1.222222  \n",
        "1788       bristol-west       green      5.500000  \n",
        "7454      norwich-south       green      6.000000  "
       ]
      }
     ],
     "prompt_number": 341
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - USE A BOOLEAN OPERATOR TO FILTER A DATASET BASED ON TWO OR MORE CRITERIA"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### How Many Seats Were Each Party Favourite In?\n",
      "\n",
      "Trivially, we might think to sort the parties by each constituency in terms of increasing odds, then pick the one with the lowest odds.\n",
      "\n",
      "If we sort the data frame in order of increasing odds, and then group by constituency, the order of the rows within each group will be in increasing order of odds."
     ]
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - GENERATE ONE OR MORE QUESTIONS TO ASK OF A SELECTED DATASET"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df_wh.sort('odds', ascending=True).groupby('constituency', as_index=False).get_group(\"aberavon\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>time</th>\n",
        "      <th>bookie</th>\n",
        "      <th>party</th>\n",
        "      <th>odds</th>\n",
        "      <th>oddsraw</th>\n",
        "      <th>constituency</th>\n",
        "      <th>party_clean</th>\n",
        "      <th>decimal_odds</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>6 </th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>            labour</td>\n",
        "      <td>   0.01</td>\n",
        "      <td> 1/100</td>\n",
        "      <td> aberavon</td>\n",
        "      <td>           labour</td>\n",
        "      <td>   1.01</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>13</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>              ukip</td>\n",
        "      <td>  16.00</td>\n",
        "      <td>    16</td>\n",
        "      <td> aberavon</td>\n",
        "      <td>             ukip</td>\n",
        "      <td>  17.00</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3 </th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>                pc</td>\n",
        "      <td>  50.00</td>\n",
        "      <td>    50</td>\n",
        "      <td> aberavon</td>\n",
        "      <td>               pc</td>\n",
        "      <td>  51.00</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>10</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> liberal democrats</td>\n",
        "      <td> 100.00</td>\n",
        "      <td>   100</td>\n",
        "      <td> aberavon</td>\n",
        "      <td> liberal democrat</td>\n",
        "      <td> 101.00</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>16</th>\n",
        "      <td> 2015-05-05T09:54:19+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>     conservatives</td>\n",
        "      <td> 100.00</td>\n",
        "      <td>   100</td>\n",
        "      <td> aberavon</td>\n",
        "      <td>     conservative</td>\n",
        "      <td> 101.00</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 342,
       "text": [
        "                         time bookie              party    odds oddsraw  \\\n",
        "6   2015-05-05T09:54:19+00:00     WH             labour    0.01   1/100   \n",
        "13  2015-05-05T09:54:19+00:00     WH               ukip   16.00      16   \n",
        "3   2015-05-05T09:54:19+00:00     WH                 pc   50.00      50   \n",
        "10  2015-05-05T09:54:19+00:00     WH  liberal democrats  100.00     100   \n",
        "16  2015-05-05T09:54:19+00:00     WH      conservatives  100.00     100   \n",
        "\n",
        "   constituency       party_clean  decimal_odds  \n",
        "6      aberavon            labour          1.01  \n",
        "13     aberavon              ukip         17.00  \n",
        "3      aberavon                pc         51.00  \n",
        "10     aberavon  liberal democrat        101.00  \n",
        "16     aberavon      conservative        101.00  "
       ]
      }
     ],
     "prompt_number": 342
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMONSTRATE HOW TO GROUP A DATASET ACCORDING TO ONE OR MORE CRITERIA\n",
      "ASSESSMENT - DEMONSTRATE HOW TO ACCESS A PARTICULAR GROUP AS A GROUP"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If we pick the `first()` row in each group, we can generate a dataframe that contains the a single row for each consituency identifying a party with those best odds.\n",
      "\n",
      "We can then group these rows according to the cleaned party name, and count how many rows correspond to each party, ordering the result to show the most heavily favourited party first."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "likelyparty=df_wh.sort('odds', ascending=True).groupby('constituency', as_index=False).first()\n",
      "likelyparty.groupby('party_clean').size().order(ascending=False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 343,
       "text": [
        "party_clean\n",
        "conservative        278\n",
        "labour              263\n",
        "snp                  55\n",
        "liberal democrat     26\n",
        "dup                   9\n",
        "sinn fein             4\n",
        "ukip                  3\n",
        "sdlp                  3\n",
        "pc                    2\n",
        "sylvia hermon         1\n",
        "respect               1\n",
        "plaid cymru           1\n",
        "john bercow           1\n",
        "green                 1\n",
        "dtype: int64"
       ]
      }
     ],
     "prompt_number": 343
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Interpreting this naively, we see there are 278 seats with the Conservatives as favourite, 263 with Labour as favourite, and so on."
     ]
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMONSTRATE HOW TO PROCESS ELEMENTS IN A GROUP BY GROUP\n",
      "ASSESSMENT - INTERPRET THE RESULTS GENERATED BY ASKING A PARTICULAR QUESTION OF A SELECTED DATASET"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "However, this approach would incorrectly predict seats where there are *joint favourites*, if there are any. Let's see if we can identify constituency seats where low odds are tied..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Start by considering short odds\n",
      "#then group by odds in each constituency\n",
      "#count the rows in each group\n",
      "#order the result\n",
      "#and show the top few results\n",
      "df_wh[df_wh['odds']<=2] \\\n",
      ".groupby(['odds','constituency']) \\\n",
      ".size() \\\n",
      ".order(ascending=False) \\\n",
      ".head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 344,
       "text": [
        "odds      constituency              \n",
        "0.833333  pudsey                        2\n",
        "0.909091  northampton-north             2\n",
        "          torbay                        2\n",
        "0.833333  halesowen-and-rowley-regis    2\n",
        "0.015152  workington                    1\n",
        "dtype: int64"
       ]
      }
     ],
     "prompt_number": 344
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "My reading of this is that there are two parties tied on odds of 0.8333 in Pudsey. Let's check:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df_wh[df_wh['constituency']=='pudsey']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>time</th>\n",
        "      <th>bookie</th>\n",
        "      <th>party</th>\n",
        "      <th>odds</th>\n",
        "      <th>oddsraw</th>\n",
        "      <th>constituency</th>\n",
        "      <th>party_clean</th>\n",
        "      <th>decimal_odds</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>8003</th>\n",
        "      <td> 2015-05-05T10:13:53+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>            labour</td>\n",
        "      <td>   0.833333</td>\n",
        "      <td> 5/6</td>\n",
        "      <td> pudsey</td>\n",
        "      <td>           labour</td>\n",
        "      <td>   1.833333</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8009</th>\n",
        "      <td> 2015-05-05T10:13:53+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> liberal democrats</td>\n",
        "      <td> 100.000000</td>\n",
        "      <td> 100</td>\n",
        "      <td> pudsey</td>\n",
        "      <td> liberal democrat</td>\n",
        "      <td> 101.000000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8013</th>\n",
        "      <td> 2015-05-05T10:13:53+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>              ukip</td>\n",
        "      <td>  40.000000</td>\n",
        "      <td>  40</td>\n",
        "      <td> pudsey</td>\n",
        "      <td>             ukip</td>\n",
        "      <td>  41.000000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8017</th>\n",
        "      <td> 2015-05-05T10:13:53+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>     conservatives</td>\n",
        "      <td>   0.833333</td>\n",
        "      <td> 5/6</td>\n",
        "      <td> pudsey</td>\n",
        "      <td>     conservative</td>\n",
        "      <td>   1.833333</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 345,
       "text": [
        "                           time bookie              party        odds oddsraw  \\\n",
        "8003  2015-05-05T10:13:53+00:00     WH             labour    0.833333     5/6   \n",
        "8009  2015-05-05T10:13:53+00:00     WH  liberal democrats  100.000000     100   \n",
        "8013  2015-05-05T10:13:53+00:00     WH               ukip   40.000000      40   \n",
        "8017  2015-05-05T10:13:53+00:00     WH      conservatives    0.833333     5/6   \n",
        "\n",
        "     constituency       party_clean  decimal_odds  \n",
        "8003       pudsey            labour      1.833333  \n",
        "8009       pudsey  liberal democrat    101.000000  \n",
        "8013       pudsey              ukip     41.000000  \n",
        "8017       pudsey      conservative      1.833333  "
       ]
      }
     ],
     "prompt_number": 345
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Finding Rows in Constituencies Where Odds Are Tied\n",
      "\n",
      "If we wanted a long dataset containing rows where the odds are tied within a constituency, we could use the following filter command:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Limit rows to rows in constituencies where there are parties with the same odds\n",
      "#That is, where there is more than one member in groups of odds by constituency\n",
      "df_sameodds=df_wh.groupby(['odds','constituency']).filter(lambda x: len(x) > 1)\n",
      "df_sameodds.sort(['odds','constituency']).head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>time</th>\n",
        "      <th>bookie</th>\n",
        "      <th>party</th>\n",
        "      <th>odds</th>\n",
        "      <th>oddsraw</th>\n",
        "      <th>constituency</th>\n",
        "      <th>party_clean</th>\n",
        "      <th>decimal_odds</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>4933</th>\n",
        "      <td> 2015-05-05T10:06:29+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>        labour</td>\n",
        "      <td> 0.833333</td>\n",
        "      <td>   5/6</td>\n",
        "      <td> halesowen-and-rowley-regis</td>\n",
        "      <td>       labour</td>\n",
        "      <td> 1.833333</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4948</th>\n",
        "      <td> 2015-05-05T10:06:29+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> conservatives</td>\n",
        "      <td> 0.833333</td>\n",
        "      <td>   5/6</td>\n",
        "      <td> halesowen-and-rowley-regis</td>\n",
        "      <td> conservative</td>\n",
        "      <td> 1.833333</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8003</th>\n",
        "      <td> 2015-05-05T10:13:53+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>        labour</td>\n",
        "      <td> 0.833333</td>\n",
        "      <td>   5/6</td>\n",
        "      <td>                     pudsey</td>\n",
        "      <td>       labour</td>\n",
        "      <td> 1.833333</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8017</th>\n",
        "      <td> 2015-05-05T10:13:53+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> conservatives</td>\n",
        "      <td> 0.833333</td>\n",
        "      <td>   5/6</td>\n",
        "      <td>                     pudsey</td>\n",
        "      <td> conservative</td>\n",
        "      <td> 1.833333</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>7387</th>\n",
        "      <td> 2015-05-05T10:12:26+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>        labour</td>\n",
        "      <td> 0.909091</td>\n",
        "      <td> 10/11</td>\n",
        "      <td>          northampton-north</td>\n",
        "      <td>       labour</td>\n",
        "      <td> 1.909091</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 346,
       "text": [
        "                           time bookie          party      odds oddsraw  \\\n",
        "4933  2015-05-05T10:06:29+00:00     WH         labour  0.833333     5/6   \n",
        "4948  2015-05-05T10:06:29+00:00     WH  conservatives  0.833333     5/6   \n",
        "8003  2015-05-05T10:13:53+00:00     WH         labour  0.833333     5/6   \n",
        "8017  2015-05-05T10:13:53+00:00     WH  conservatives  0.833333     5/6   \n",
        "7387  2015-05-05T10:12:26+00:00     WH         labour  0.909091   10/11   \n",
        "\n",
        "                    constituency   party_clean  decimal_odds  \n",
        "4933  halesowen-and-rowley-regis        labour      1.833333  \n",
        "4948  halesowen-and-rowley-regis  conservative      1.833333  \n",
        "8003                      pudsey        labour      1.833333  \n",
        "8017                      pudsey  conservative      1.833333  \n",
        "7387           northampton-north        labour      1.909091  "
       ]
      }
     ],
     "prompt_number": 346
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMONSTRATE HOW TO PROCESS A GROUP BASED ON GROUP PROPERTIES"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Which party is \"second\" favourite, ordered by shortest odds first?\n",
      "\n",
      "If we want to get a feel for which party is second favourite (or tied on joint odds with the \"first\" favourite), we can use the `nth()` rather than `first()` method on the odds sorted, constituency grouped form of the dataset, noting that `nth()` starts counting with index 0, so `nth(1)` corresponds to the *second* item in the ordered list:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "secondparty=df_wh.sort('odds', ascending=True).groupby('constituency', as_index=False).nth(1)\n",
      "secondparty.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>time</th>\n",
        "      <th>bookie</th>\n",
        "      <th>party</th>\n",
        "      <th>odds</th>\n",
        "      <th>oddsraw</th>\n",
        "      <th>constituency</th>\n",
        "      <th>party_clean</th>\n",
        "      <th>decimal_odds</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>4933</th>\n",
        "      <td> 2015-05-05T10:06:29+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>            labour</td>\n",
        "      <td> 0.833333</td>\n",
        "      <td>   5/6</td>\n",
        "      <td> halesowen-and-rowley-regis</td>\n",
        "      <td>           labour</td>\n",
        "      <td> 1.833333</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8003</th>\n",
        "      <td> 2015-05-05T10:13:53+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>            labour</td>\n",
        "      <td> 0.833333</td>\n",
        "      <td>   5/6</td>\n",
        "      <td>                     pudsey</td>\n",
        "      <td>           labour</td>\n",
        "      <td> 1.833333</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4364</th>\n",
        "      <td> 2015-05-05T10:05:10+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>            labour</td>\n",
        "      <td> 0.909091</td>\n",
        "      <td> 10/11</td>\n",
        "      <td> finchley-and-golders-green</td>\n",
        "      <td>           labour</td>\n",
        "      <td> 1.909091</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>9960</th>\n",
        "      <td> 2015-05-05T10:18:59+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>      conservative</td>\n",
        "      <td> 0.909091</td>\n",
        "      <td> 10/11</td>\n",
        "      <td>                     torbay</td>\n",
        "      <td>     conservative</td>\n",
        "      <td> 1.909091</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2871</th>\n",
        "      <td> 2015-05-05T10:01:50+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> liberal democrats</td>\n",
        "      <td> 0.909091</td>\n",
        "      <td> 10/11</td>\n",
        "      <td>             cornwall-north</td>\n",
        "      <td> liberal democrat</td>\n",
        "      <td> 1.909091</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 347,
       "text": [
        "                           time bookie              party      odds oddsraw  \\\n",
        "4933  2015-05-05T10:06:29+00:00     WH             labour  0.833333     5/6   \n",
        "8003  2015-05-05T10:13:53+00:00     WH             labour  0.833333     5/6   \n",
        "4364  2015-05-05T10:05:10+00:00     WH             labour  0.909091   10/11   \n",
        "9960  2015-05-05T10:18:59+00:00     WH       conservative  0.909091   10/11   \n",
        "2871  2015-05-05T10:01:50+00:00     WH  liberal democrats  0.909091   10/11   \n",
        "\n",
        "                    constituency       party_clean  decimal_odds  \n",
        "4933  halesowen-and-rowley-regis            labour      1.833333  \n",
        "8003                      pudsey            labour      1.833333  \n",
        "4364  finchley-and-golders-green            labour      1.909091  \n",
        "9960                      torbay      conservative      1.909091  \n",
        "2871              cornwall-north  liberal democrat      1.909091  "
       ]
      }
     ],
     "prompt_number": 347
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "That result for *pudsey* looks a little odd? In the data frame above, the *conservative* entry had a higher index value, so why is the *labour* party listed as the second item? Let's check:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df_wh.sort('odds', ascending=True).groupby('constituency', as_index=False).get_group(\"pudsey\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>time</th>\n",
        "      <th>bookie</th>\n",
        "      <th>party</th>\n",
        "      <th>odds</th>\n",
        "      <th>oddsraw</th>\n",
        "      <th>constituency</th>\n",
        "      <th>party_clean</th>\n",
        "      <th>decimal_odds</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>8017</th>\n",
        "      <td> 2015-05-05T10:13:53+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>     conservatives</td>\n",
        "      <td>   0.833333</td>\n",
        "      <td> 5/6</td>\n",
        "      <td> pudsey</td>\n",
        "      <td>     conservative</td>\n",
        "      <td>   1.833333</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8003</th>\n",
        "      <td> 2015-05-05T10:13:53+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>            labour</td>\n",
        "      <td>   0.833333</td>\n",
        "      <td> 5/6</td>\n",
        "      <td> pudsey</td>\n",
        "      <td>           labour</td>\n",
        "      <td>   1.833333</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8013</th>\n",
        "      <td> 2015-05-05T10:13:53+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td>              ukip</td>\n",
        "      <td>  40.000000</td>\n",
        "      <td>  40</td>\n",
        "      <td> pudsey</td>\n",
        "      <td>             ukip</td>\n",
        "      <td>  41.000000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8009</th>\n",
        "      <td> 2015-05-05T10:13:53+00:00</td>\n",
        "      <td> WH</td>\n",
        "      <td> liberal democrats</td>\n",
        "      <td> 100.000000</td>\n",
        "      <td> 100</td>\n",
        "      <td> pudsey</td>\n",
        "      <td> liberal democrat</td>\n",
        "      <td> 101.000000</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 348,
       "text": [
        "                           time bookie              party        odds oddsraw  \\\n",
        "8017  2015-05-05T10:13:53+00:00     WH      conservatives    0.833333     5/6   \n",
        "8003  2015-05-05T10:13:53+00:00     WH             labour    0.833333     5/6   \n",
        "8013  2015-05-05T10:13:53+00:00     WH               ukip   40.000000      40   \n",
        "8009  2015-05-05T10:13:53+00:00     WH  liberal democrats  100.000000     100   \n",
        "\n",
        "     constituency       party_clean  decimal_odds  \n",
        "8017       pudsey      conservative      1.833333  \n",
        "8003       pudsey            labour      1.833333  \n",
        "8013       pudsey              ukip     41.000000  \n",
        "8009       pudsey  liberal democrat    101.000000  "
       ]
      }
     ],
     "prompt_number": 348
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Hmm, the ordering here does seem to put the *conservative* row first. *But I'm not sure why?*"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Comparing Joint/Close First and Second Favourites\n",
      "\n",
      "What happens if there is a swing from first to second favourite in constituencies with short odds across the joint first, or first and second, favourites? Are there particular swings likely from one party to another?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's start by creating a dataframe where each row is an observation for a constituency that shows the joint or first and second favorites, along with their odds:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "col_subset=['constituency','party_clean','odds']\n",
      "m=pd.merge(secondparty[col_subset],likelyparty[col_subset],on='constituency')\n",
      "m.columns = ['constituency','secondparty','odds_second','firstparty','odds_first']\n",
      "m.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>constituency</th>\n",
        "      <th>secondparty</th>\n",
        "      <th>odds_second</th>\n",
        "      <th>firstparty</th>\n",
        "      <th>odds_first</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> halesowen-and-rowley-regis</td>\n",
        "      <td>           labour</td>\n",
        "      <td> 0.833333</td>\n",
        "      <td>     conservative</td>\n",
        "      <td> 0.833333</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>                     pudsey</td>\n",
        "      <td>           labour</td>\n",
        "      <td> 0.833333</td>\n",
        "      <td>     conservative</td>\n",
        "      <td> 0.833333</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> finchley-and-golders-green</td>\n",
        "      <td>           labour</td>\n",
        "      <td> 0.909091</td>\n",
        "      <td>     conservative</td>\n",
        "      <td> 0.800000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>                     torbay</td>\n",
        "      <td>     conservative</td>\n",
        "      <td> 0.909091</td>\n",
        "      <td> liberal democrat</td>\n",
        "      <td> 0.909091</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td>             cornwall-north</td>\n",
        "      <td> liberal democrat</td>\n",
        "      <td> 0.909091</td>\n",
        "      <td>     conservative</td>\n",
        "      <td> 0.800000</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 349,
       "text": [
        "                 constituency       secondparty  odds_second  \\\n",
        "0  halesowen-and-rowley-regis            labour     0.833333   \n",
        "1                      pudsey            labour     0.833333   \n",
        "2  finchley-and-golders-green            labour     0.909091   \n",
        "3                      torbay      conservative     0.909091   \n",
        "4              cornwall-north  liberal democrat     0.909091   \n",
        "\n",
        "         firstparty  odds_first  \n",
        "0      conservative    0.833333  \n",
        "1      conservative    0.833333  \n",
        "2      conservative    0.800000  \n",
        "3  liberal democrat    0.909091  \n",
        "4      conservative    0.800000  "
       ]
      }
     ],
     "prompt_number": 349
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT - DEMONSTRATE HOW TO MERGE TWO OR MORE DATASETS\n",
      "ASSESSMENT - DEMONSTRATE HOW TO MANIPULATE THE MARGINAL PROPERTIES OF A DATA TABLE (EG INDICES, COLUMN HEADINGS)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can now look to see what the possible swings are by party away from the favourite to a joint of close second favourite."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Start by finding the rows where the odds are short for the second favourite\n",
      "#then count group sizes swinging from first to second party\n",
      "m[m['odds_second']<=1.5].groupby(['firstparty','secondparty']).size()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 350,
       "text": [
        "firstparty        secondparty     \n",
        "conservative      labour              13\n",
        "                  liberal democrat     2\n",
        "labour            conservative        10\n",
        "                  liberal democrat     1\n",
        "                  snp                  2\n",
        "liberal democrat  conservative         6\n",
        "                  labour               1\n",
        "                  plaid cymru          1\n",
        "plaid cymru       labour               1\n",
        "respect           labour               1\n",
        "snp               conservative         1\n",
        "                  labour               4\n",
        "ukip              conservative         1\n",
        "dtype: int64"
       ]
      }
     ],
     "prompt_number": 350
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This shows, for example, that there are good chances that up to 13 *conservative* seats could go to *labour*, and 3 seats from *liberal democrat* to *conservative*."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Conclusion\n",
      "\n",
      "In this notebook, I have investigated a data set containing election odds for the majority of UK constituencies on a single day prior to the UK General Election, 2015.\n",
      "\n",
      "The data predicited that the Conservatives would win the largest number of seats, though not a majority. The data predicted that the SNP would win a large number of Scottish constituency seats, and that Ed Ball's Pudsey constituency was unsafe for Labour.\n",
      "\n",
      "Analysis of constituencies with low-odds, joint or close first and second favourites suggested more possible \"swings\" from Conservative to Labour than vice versa and mosts swings from Liberal Democrats to the Conservatives."
     ]
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ASSESSMENT NOTES\n",
      "\n",
      "Sports such as gymnastics use criteria based scoring where participants must demonstrate several elements from different difficulty groups (eg http://www.british-gymnastics.org/technical-information/selection/womens-artistic/cat_view/334-regions-and-home-countries/467-south-east/578-event-info ).\n",
      "\n",
      "One approach to asssessing notebooks on an investigation around a free-data-choice activity might be to require students to demonstrate a range of technical skills (perhaps self-identifying them to reinforce reflection about their work) in an appropriate context. I have tried to identify - and abstract - assessment opportunities along the way; should students be required to do the same as part of the assessment as part of a critique of their own work?\n",
      "\n",
      "Looking back over the assessment points, many of the steps were included *becuase the data needed treating in some way in order to ask a particular question or perform a particular transformation*. How can we capture the relationship between questions asked of the data and how those questions prompt certain transformations of the data in order to answer them?\n",
      "\n",
      "Many data anlayses are likely to include false starts that still take time to explore. Students should be allowed to include 'false-start' components in their script if they derive from a plausible initial line of investigation and demonstrate a required element.\n",
      "\n",
      "This notebook has focussed on the demonstration of particular skills using a particular programming language (Python) and programming library within that language (pandas). Some (many? all?) of the questions could have been asked directly of the dataset using SQL.\n",
      "\n",
      "Should the notebook require students to demonstrate solutions to the same problem in different languages?\n",
      "\n",
      "What assessment points are missing?\n",
      "\n",
      "The notebook does not include any graphical representations of the data (no charts).\n",
      "\n",
      "The notebook does not include any statistical analyses, other than simple rankings, sorting and extrema detection.\n",
      "\n",
      "The notebook does not require the student to model any data form or ingest any data into, a database.\n",
      "\n",
      "The notebook does not require students to do any more than single line programming at each step. That is, the student is not required to develop any functions (other than one line lambda functions) at any stage.\n",
      "\n",
      "As a rule of thumb, I estimate that each question cell, code cell, intepretation cell combination will take of the order 5-15 minutes to produce."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}