Skip to content

Instantly share code, notes, and snippets.

@psychemedia
Last active April 13, 2016 22:20
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save psychemedia/11b1b8d15663a303f975 to your computer and use it in GitHub Desktop.
Save psychemedia/11b1b8d15663a303f975 to your computer and use it in GitHub Desktop.
Example of data investigation around election betting odds data
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:e83697ab284a0e10d29ca1760ccf0340015de4f4b316a3e52e1e8e34be8ed200"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# A Quick Look at Election Betting Data\n",
"\n",
"For the hundred days or so leading up to the election, I was [scraping oddschecker listed betting odds](https://morph.io/psychemedia/electionodds) for most of the UK parliamentary constituencies (I didn't check that I was scraping them all; this was just a side-side-project...).\n",
"\n",
"I didn't look at the data at all in the run-up to the election (my original plan was to look at timeseries within each constituency to try to detect sudden changes in odds that might indicate some sort of major shift in sentiment in each constituency), but with things all settled now, I thought I'd look to see what tales - if any - the betting data might tell, at least at a high level. For example, what did the betting odds have to say about the likely number of seats taken by each party...?\n",
"\n",
"*This notebook was developed as part of an exploration into possible forms a student produced notebook could take as part of an assessment process for a course on data management and analysis using a dataset selected by a student. Intended student worktime for the assessment: <10 hours. Comments appreciated...*"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#I'm going to use the pandas library to analyse the data\n",
"import pandas as pd"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 320
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Acquisition"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data can be pulled directly from the scraper - https://morph.io/psychemedia/electionodds.\n",
"Using the API, we can write a SQLite query to grab the odds for a particular day:\n",
"\n",
"`select * from 'Constituency2015GE' constituency where strftime('%d-%m-%Y',time)=\"06-05-2015\"`\n",
"\n",
"The data can be pulled directly into a pandas dataframe from the morph.io scraper API:"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - IDENTIFY AN APPROPRIATE DATASET FOR USE IN A DATA INVESTIGATION\n",
"ASSESSMENT - DEMONSTRATE HOW TO USE AN SQL STATEMENT TO RETRIEVE A DATASET\n",
"\n",
"NOTE: WE COULD USE pandasql TO RUN A SQL QUERY ON A DATAFRAME PUSHED INTO A SQLITE DATABASE"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#df=pd.read_csv('https://api.morph.io/psychemedia/electionodds/data.csv?key='+SECRETKEY+'&query=select%20*%20from%20%27Constituency2015GE%27%20constituency%20where%20strftime(%27%25d-%25m-%25Y%27%2Ctime)%3D%2206-05-2015%22)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 321
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Or you can download the electionodds data and pop it into a file...\n",
"#Don't believe the filename - the data is data collected at some point on May 5th\n",
"df=pd.read_csv('electionodds_thurs.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 322
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMONSTRATE HOW TO LOAD IN FROM, OR SAVE DATA TO, A DATA FILE IN A RECOGNISED FORMAT "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Review\n",
"\n",
"Let's just get a quick view over the data to familiarise ourselves with what it contains."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Preview the data\n",
"df.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>bookie</th>\n",
" <th>party</th>\n",
" <th>odds</th>\n",
" <th>oddsraw</th>\n",
" <th>constituency</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> LD</td>\n",
" <td> green</td>\n",
" <td> 100.00</td>\n",
" <td> 100</td>\n",
" <td> aberavon</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> FB</td>\n",
" <td> pc</td>\n",
" <td> 50.00</td>\n",
" <td> 50</td>\n",
" <td> aberavon</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> LD</td>\n",
" <td> pc</td>\n",
" <td> 50.00</td>\n",
" <td> 50</td>\n",
" <td> aberavon</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> WH</td>\n",
" <td> pc</td>\n",
" <td> 50.00</td>\n",
" <td> 50</td>\n",
" <td> aberavon</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> FB</td>\n",
" <td> labour</td>\n",
" <td> 0.01</td>\n",
" <td> 1/100</td>\n",
" <td> aberavon</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 323,
"text": [
" time bookie party odds oddsraw constituency\n",
"0 2015-05-05T09:54:19+00:00 LD green 100.00 100 aberavon\n",
"1 2015-05-05T09:54:19+00:00 FB pc 50.00 50 aberavon\n",
"2 2015-05-05T09:54:19+00:00 LD pc 50.00 50 aberavon\n",
"3 2015-05-05T09:54:19+00:00 WH pc 50.00 50 aberavon\n",
"4 2015-05-05T09:54:19+00:00 FB labour 0.01 1/100 aberavon"
]
}
],
"prompt_number": 323
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `odds` column is simply a decimalised version of the `oddsraw` value. Note that this is not strictly the decimal odds, which would be 1 greater."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#What bookies did we collect data for?\n",
"df['bookie'].unique()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 324,
"text": [
"array(['LD', 'FB', 'WH', 'B3'], dtype=object)"
]
}
],
"prompt_number": 324
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#How many constituencies did we grab data for?\n",
"len(df['constituency'].unique())"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 325,
"text": [
"650"
]
}
],
"prompt_number": 325
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#What is the range of odds offered?\n",
"(df['odds'].min(), df['odds'].max())"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 326,
"text": [
"(0.001, 9999.0)"
]
}
],
"prompt_number": 326
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#What parties did we collect data for and in what numbers?\n",
"#Note that we are likely to count the same party several times in each constituency,\n",
"# once for each bookmaker offering odds on that party in that constituency\n",
"df['party'].value_counts().head(15)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 327,
"text": [
"labour 2103\n",
"liberal democrats 2014\n",
"ukip 1977\n",
"conservatives 1945\n",
"any other party or candidate 600\n",
"greens 580\n",
"green 273\n",
"snp 224\n",
"conservative 159\n",
"tusc 130\n",
"green party 105\n",
"pc 91\n",
"liberal democrat 85\n",
"alliance 57\n",
"sinn fein 57\n",
"dtype: int64"
]
}
],
"prompt_number": 327
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The count of parties appearances in the dataset is not necessarily very useful, becuase the same party may count more than several times in the same constituency given the odds from different bookmakers. However, the count does clearly show us that there are multiple possible representations of what are presumably the same party name."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMONSTRATE TWO OR MORE TECHNIQUES THAT PROVIDE AN OVERVIEW OF A NEW DATASET"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Cleaning the Data\n",
"\n",
"Note there are several opportunities in the `party` column at least for cleaning the data - for example, *green*, *greens* and *green party* are likely the same party, as are *conservative* and *conservatives*. A quick and dirty cleaning approach would be to right strip \"s\", replace occurrences of \"party\" at the end of the string, and then `strip()` just to clear away any whitespace."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['party_clean']=df['party'].str.rstrip('s').str.replace(r'party$','').str.strip()\n",
"df['party_clean'].value_counts()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 328,
"text": [
"conservative 2104\n",
"labour 2103\n",
"liberal democrat 2099\n",
"ukip 1977\n",
"green 958\n",
"any other party or candidate 600\n",
"snp 224\n",
"tusc 130\n",
"pc 91\n",
"alliance 57\n",
"sinn fein 57\n",
"sdlp 56\n",
"dup 53\n",
"uup 46\n",
"any other 45\n",
"...\n",
"lorraine morgan-brinkhurst 1\n",
"alfred okam 1\n",
"chaka artwell 1\n",
"john neville hobb 1\n",
"any other independant 1\n",
"les tallon-morri 1\n",
"the whig 1\n",
"christopher tompson 1\n",
"robin lambert 1\n",
"residents for uttlesford 1\n",
"james kirkcaldy 1\n",
"europeans 1\n",
"roy ivinson 1\n",
"christopher gray 1\n",
"criag pond 1\n",
"Length: 309, dtype: int64"
]
}
],
"prompt_number": 328
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Inspect the full range of parties, if required, perhaps as basis for further cleaning\n",
"#df['party_clean'].unique()\n",
"#A more advanced approach might be to run the names through a clustering algorithm,\n",
"#or partial string matcher, to see if there aare any near collisions that should be combined"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 329
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMONSTRATE TWO OR MORE TECHNIQUES THAT CAN BE APPLIED TO CLEAN A DATASET\n",
"ASSESSMENT - DEMONSTRATE HOW TO SORT A DATASET"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we wanted a strict decimal odds column, we could simply add 1 to the `odds` column:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['decimal_odds'] = df['odds']+1\n",
"df.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>bookie</th>\n",
" <th>party</th>\n",
" <th>odds</th>\n",
" <th>oddsraw</th>\n",
" <th>constituency</th>\n",
" <th>party_clean</th>\n",
" <th>decimal_odds</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> LD</td>\n",
" <td> green</td>\n",
" <td> 100.00</td>\n",
" <td> 100</td>\n",
" <td> aberavon</td>\n",
" <td> green</td>\n",
" <td> 101.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> FB</td>\n",
" <td> pc</td>\n",
" <td> 50.00</td>\n",
" <td> 50</td>\n",
" <td> aberavon</td>\n",
" <td> pc</td>\n",
" <td> 51.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> LD</td>\n",
" <td> pc</td>\n",
" <td> 50.00</td>\n",
" <td> 50</td>\n",
" <td> aberavon</td>\n",
" <td> pc</td>\n",
" <td> 51.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> WH</td>\n",
" <td> pc</td>\n",
" <td> 50.00</td>\n",
" <td> 50</td>\n",
" <td> aberavon</td>\n",
" <td> pc</td>\n",
" <td> 51.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> FB</td>\n",
" <td> labour</td>\n",
" <td> 0.01</td>\n",
" <td> 1/100</td>\n",
" <td> aberavon</td>\n",
" <td> labour</td>\n",
" <td> 1.01</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 330,
"text": [
" time bookie party odds oddsraw constituency \\\n",
"0 2015-05-05T09:54:19+00:00 LD green 100.00 100 aberavon \n",
"1 2015-05-05T09:54:19+00:00 FB pc 50.00 50 aberavon \n",
"2 2015-05-05T09:54:19+00:00 LD pc 50.00 50 aberavon \n",
"3 2015-05-05T09:54:19+00:00 WH pc 50.00 50 aberavon \n",
"4 2015-05-05T09:54:19+00:00 FB labour 0.01 1/100 aberavon \n",
"\n",
" party_clean decimal_odds \n",
"0 green 101.00 \n",
"1 pc 51.00 \n",
"2 pc 51.00 \n",
"3 pc 51.00 \n",
"4 labour 1.01 "
]
}
],
"prompt_number": 330
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - ADD A NEW COLUMN TO A DATASET\n",
"ASSESSMENT - GENERATE A NEW COLUMN FROM A PRE-EXISTING COLUMN"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering / Reducing the Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To start with, let's focus on the odds offered by a single bookmaker. To choose which bookie, let's see how many constituencies each of them have prices for:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for bookie in df['bookie'].unique():\n",
" print('{}: {}'.format(bookie,len(df[df['bookie']==bookie]['constituency'].unique())))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"LD: 627\n",
"FB: 648\n",
"WH: 648\n",
"B3: 239\n"
]
}
],
"prompt_number": 331
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So a good candidate would be WH (William Hill) or LD (Ladbrokes). Let's go with the former..."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Grab the data for a particular bookie into a separate dataframe\n",
"df_wh=df[df['bookie']=='WH']"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 332
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMONSTRATE A WAY OF SUBSETTING A DATASET BASED ON ONE OR MORE ROW BASED CRITERIA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One thing we might want to do over the full dataset is see what the odds were for a partcular constituency offered by a particular bookmarker. We can write a simple convenience function to help us do that."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Write a convenience function to look up odds by constituency\n",
"def oddsForConstituency(df,bookie,constituency):\n",
" ''' Function to find rows associated with a particular bookie in a particular constituency '''\n",
" filterView= df[(df['bookie'].str.upper()==bookie.upper()) & \n",
" (df['constituency'].str.lower()==constituency.lower())] \n",
" return filterView\n",
"\n",
"oddsForConstituency(df,'WH','berwickshire-roxburgh-and-selkirk')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>bookie</th>\n",
" <th>party</th>\n",
" <th>odds</th>\n",
" <th>oddsraw</th>\n",
" <th>constituency</th>\n",
" <th>party_clean</th>\n",
" <th>decimal_odds</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>919</th>\n",
" <td> 2015-05-05T09:56:50+00:00</td>\n",
" <td> WH</td>\n",
" <td> snp</td>\n",
" <td> 2.25</td>\n",
" <td> 9/4</td>\n",
" <td> berwickshire-roxburgh-and-selkirk</td>\n",
" <td> snp</td>\n",
" <td> 3.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>923</th>\n",
" <td> 2015-05-05T09:56:50+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 150.00</td>\n",
" <td> 150</td>\n",
" <td> berwickshire-roxburgh-and-selkirk</td>\n",
" <td> labour</td>\n",
" <td> 151.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>927</th>\n",
" <td> 2015-05-05T09:56:50+00:00</td>\n",
" <td> WH</td>\n",
" <td> liberal democrats</td>\n",
" <td> 2.00</td>\n",
" <td> 2</td>\n",
" <td> berwickshire-roxburgh-and-selkirk</td>\n",
" <td> liberal democrat</td>\n",
" <td> 3.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>935</th>\n",
" <td> 2015-05-05T09:56:50+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservatives</td>\n",
" <td> 1.20</td>\n",
" <td> 6/5</td>\n",
" <td> berwickshire-roxburgh-and-selkirk</td>\n",
" <td> conservative</td>\n",
" <td> 2.20</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 354,
"text": [
" time bookie party odds oddsraw \\\n",
"919 2015-05-05T09:56:50+00:00 WH snp 2.25 9/4 \n",
"923 2015-05-05T09:56:50+00:00 WH labour 150.00 150 \n",
"927 2015-05-05T09:56:50+00:00 WH liberal democrats 2.00 2 \n",
"935 2015-05-05T09:56:50+00:00 WH conservatives 1.20 6/5 \n",
"\n",
" constituency party_clean decimal_odds \n",
"919 berwickshire-roxburgh-and-selkirk snp 3.25 \n",
"923 berwickshire-roxburgh-and-selkirk labour 151.00 \n",
"927 berwickshire-roxburgh-and-selkirk liberal democrat 3.00 \n",
"935 berwickshire-roxburgh-and-selkirk conservative 2.20 "
]
}
],
"prompt_number": 354
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEFINE AND APPLY A SIMPLE PYTHON FUNCTION THAT ACCEPTS ONE OR MORE PARAMETERS AND RETURNS ONE OR MORE VALUES"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Reshaping the Data\n",
"\n",
"The data as it stands is in a relatively tidy (Third Normal Form) *long* format. Each row contains a single observation that associates the odds for a single party with a particular bookmaker in each constituency.\n",
"\n",
"If we wanted to compare the odds for particular parties within each constituency, it might be more convenient to put the data into a wide format, with a separate column for the odds offered for each party, indexed by cosntituency:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"dfp=df_wh.pivot('constituency','party_clean','odds')\n",
"dfp.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>party_clean</th>\n",
" <th>al murray - the pub landlord</th>\n",
" <th>alliance</th>\n",
" <th>bez</th>\n",
" <th>bnp</th>\n",
" <th>claire wright</th>\n",
" <th>conservative</th>\n",
" <th>dup</th>\n",
" <th>green</th>\n",
" <th>ian steven</th>\n",
" <th>john bercow</th>\n",
" <th>...</th>\n",
" <th>respect</th>\n",
" <th>robin scott</th>\n",
" <th>sdlp</th>\n",
" <th>sinn fein</th>\n",
" <th>snp</th>\n",
" <th>stephen picton</th>\n",
" <th>sylvia hermon</th>\n",
" <th>tuv</th>\n",
" <th>ukip</th>\n",
" <th>uup</th>\n",
" </tr>\n",
" <tr>\n",
" <th>constituency</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>aberavon</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 100.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 16</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aberconwy</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 0.4</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 80</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aberdeen-north</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 150.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 0.100000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 150</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aberdeen-south</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 50.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 0.142857</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 250</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aberdeenshire-west-and-kincardine</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 5.5</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 0.142857</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows \u00d7 30 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 333,
"text": [
"party_clean al murray - the pub landlord alliance \\\n",
"constituency \n",
"aberavon NaN NaN \n",
"aberconwy NaN NaN \n",
"aberdeen-north NaN NaN \n",
"aberdeen-south NaN NaN \n",
"aberdeenshire-west-and-kincardine NaN NaN \n",
"\n",
"party_clean bez bnp claire wright conservative dup \\\n",
"constituency \n",
"aberavon NaN NaN NaN 100.0 NaN \n",
"aberconwy NaN NaN NaN 0.4 NaN \n",
"aberdeen-north NaN NaN NaN 150.0 NaN \n",
"aberdeen-south NaN NaN NaN 50.0 NaN \n",
"aberdeenshire-west-and-kincardine NaN NaN NaN 5.5 NaN \n",
"\n",
"party_clean green ian steven john bercow ... \\\n",
"constituency ... \n",
"aberavon NaN NaN NaN ... \n",
"aberconwy NaN NaN NaN ... \n",
"aberdeen-north NaN NaN NaN ... \n",
"aberdeen-south NaN NaN NaN ... \n",
"aberdeenshire-west-and-kincardine NaN NaN NaN ... \n",
"\n",
"party_clean respect robin scott sdlp sinn fein \\\n",
"constituency \n",
"aberavon NaN NaN NaN NaN \n",
"aberconwy NaN NaN NaN NaN \n",
"aberdeen-north NaN NaN NaN NaN \n",
"aberdeen-south NaN NaN NaN NaN \n",
"aberdeenshire-west-and-kincardine NaN NaN NaN NaN \n",
"\n",
"party_clean snp stephen picton sylvia hermon \\\n",
"constituency \n",
"aberavon NaN NaN NaN \n",
"aberconwy NaN NaN NaN \n",
"aberdeen-north 0.100000 NaN NaN \n",
"aberdeen-south 0.142857 NaN NaN \n",
"aberdeenshire-west-and-kincardine 0.142857 NaN NaN \n",
"\n",
"party_clean tuv ukip uup \n",
"constituency \n",
"aberavon NaN 16 NaN \n",
"aberconwy NaN 80 NaN \n",
"aberdeen-north NaN 150 NaN \n",
"aberdeen-south NaN 250 NaN \n",
"aberdeenshire-west-and-kincardine NaN NaN NaN \n",
"\n",
"[5 rows x 30 columns]"
]
}
],
"prompt_number": 333
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Preview all the columns\n",
"dfp.columns"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 334,
"text": [
"Index([u'al murray - the pub landlord', u'alliance', u'bez', u'bnp', u'claire wright', u'conservative', u'dup', u'green', u'ian steven', u'john bercow', u'john blackie', u'kerry smith', u'labour', u'lib dem', u'liberal democrat', u'national health action', u'ni conservative', u'pc', u'pirate', u'plaid cymru', u'respect', u'robin scott', u'sdlp', u'sinn fein', u'snp', u'stephen picton', u'sylvia hermon', u'tuv', u'ukip', u'uup'], dtype='object')"
]
}
],
"prompt_number": 334
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMONSTRATE HOW TO RESHAPE A DATASET, EG USING PIVOT, MELT, STACK OR UNSTACK OPERATORS"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we wanted to limit the dataset to just major parties - that is, parties contesting a large number of seats - we could reduce the dataset as follows:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"party_subset=['green','labour','liberal democrat','conservative','snp','ukip']"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 335
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Limit rows in long dataset to selected parties\n",
"example=df_wh[df_wh['party_clean'].isin(party_subset)]\n",
"example['party_clean'].unique()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 336,
"text": [
"array(['labour', 'liberal democrat', 'ukip', 'conservative', 'snp', 'green'], dtype=object)"
]
}
],
"prompt_number": 336
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Limit columns in wide dataset to selected parties\n",
"example=dfp[party_subset]\n",
"example.columns"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 337,
"text": [
"Index([u'green', u'labour', u'liberal democrat', u'conservative', u'snp', u'ukip'], dtype='object')"
]
}
],
"prompt_number": 337
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMONSTRATE A WAY OF SUBSETTING A DATASET BASED ON ONE OR MORE COLUMN BASED CRITERIA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, there are significant risks in taking this sort of approach. For example, where an individual is standing in a particular consituency, perhaps under their own name, who has a good chance of winning, their candidacy would not be captured in the reduced dataset.\n",
"\n",
"Perhaps a better dataset would be one that includes all major parties *and* any candidates who appear to have a reasonable chance of winning (say, 10 to 1 or better)."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - CRITIQUE THE APPROPRIATENESS OF A PARTICULAR QUESTION ASKED OF THE DATA IN A PARTICULAR WAY"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reducing the Dataset to Scottish Constituencies\n",
"\n",
"One reduced dataset we might be interested in working with are the Scottish constituencies. We can use the fact that the SNP had a candidate in the seat as a proxy for which constituencies should be included in this set.\n",
"\n",
"One way of identifying those seats is to filter the wide data set (which uses constituencies as the index values) to exclude rows in the SNP column with a null (`NaN`) value:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"dfp['snp'].dropna().index.values"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 338,
"text": [
"array(['aberdeen-north', 'aberdeen-south',\n",
" 'aberdeenshire-west-and-kincardine', 'airdrie-and-shotts', 'angus',\n",
" 'argyll-and-bute', 'ayr-carrick-and-cumnock', 'ayrshire-central',\n",
" 'ayrshire-north-and-arran', 'banff-and-buchan',\n",
" 'berwickshire-roxburgh-and-selkirk',\n",
" 'caithness-sutherland-and-easter-ross',\n",
" 'coatbridge-chryston-and-bellshill',\n",
" 'cumbernauld-kilsyth-and-kirkintill', 'dumfries-and-galloway',\n",
" 'dumfriesshire-clydesdale-and-tweeddale', 'dunbartonshire-east',\n",
" 'dunbartonshire-west', 'dundee-east', 'dundee-west',\n",
" 'dunfermline-and-west-fife', 'east-kilbride-strathaven-and-lesma',\n",
" 'east-lothian', 'edinburgh-east', 'edinburgh-north-and-leith',\n",
" 'edinburgh-south', 'edinburgh-south-west', 'edinburgh-west',\n",
" 'falkirk', 'fife-north-east', 'glasgow-central', 'glasgow-east',\n",
" 'glasgow-north', 'glasgow-north-east', 'glasgow-north-west',\n",
" 'glasgow-south', 'glasgow-south-west', 'glenrothes', 'gordon',\n",
" 'inverclyde', 'inverness-nairn-badenoch-and-strathspey',\n",
" 'kilmarnock-and-loudoun', 'kirkcaldy-and-cowdenbeath',\n",
" 'lanark-and-hamilton-east', 'linlithgow-and-falkirk-ea',\n",
" 'livingston', 'midlothian', 'moray', 'motherwell-and-wishaw',\n",
" 'na-h-eileanan-an-iar', 'ochill-and-sth-perthshire',\n",
" 'orkney-and-shetland', 'paisley-and-renf-north',\n",
" 'paisley-and-renf-south', 'perth-and-n-perthshire',\n",
" 'renfrewshire-east', 'ross-skye-and-lochaber',\n",
" 'rutherglen-and-hamilton-west', 'stirling'], dtype=object)"
]
}
],
"prompt_number": 338
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMONSTRATE A TECHNIQUE FOR CLEANING OR REDUCING A DATASET BASED ON THE PRESENCE OF NULL VALUES "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Which Seats Were Safest?\n",
"\n",
"The safest seats are the seats with the shortest (smallest) odds. If we sort the table by increasing odds and select the top few, that should give us the most secure seats."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df_wh.sort('odds',ascending=True).head(10)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>bookie</th>\n",
" <th>party</th>\n",
" <th>odds</th>\n",
" <th>oddsraw</th>\n",
" <th>constituency</th>\n",
" <th>party_clean</th>\n",
" <th>decimal_odds</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>8438 </th>\n",
" <td> 2015-05-05T10:14:59+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservatives</td>\n",
" <td> 0.005</td>\n",
" <td> 1/200</td>\n",
" <td> runnymede-and-weybridge</td>\n",
" <td> conservative</td>\n",
" <td> 1.005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5376 </th>\n",
" <td> 2015-05-05T10:07:32+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservatives</td>\n",
" <td> 0.005</td>\n",
" <td> 1/200</td>\n",
" <td> hertfordshire-ne</td>\n",
" <td> conservative</td>\n",
" <td> 1.005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5362 </th>\n",
" <td> 2015-05-05T10:07:27+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservatives</td>\n",
" <td> 0.005</td>\n",
" <td> 1/200</td>\n",
" <td> hertford-and-stortford</td>\n",
" <td> conservative</td>\n",
" <td> 1.005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5319 </th>\n",
" <td> 2015-05-05T10:07:20+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservatives</td>\n",
" <td> 0.005</td>\n",
" <td> 1/200</td>\n",
" <td> henley</td>\n",
" <td> conservative</td>\n",
" <td> 1.005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1628 </th>\n",
" <td> 2015-05-05T09:58:35+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservatives</td>\n",
" <td> 0.005</td>\n",
" <td> 1/200</td>\n",
" <td> brentwood-and-ongar</td>\n",
" <td> conservative</td>\n",
" <td> 1.005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10097</th>\n",
" <td> 2015-05-05T10:19:14+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.005</td>\n",
" <td> 1/200</td>\n",
" <td> tyneside-north</td>\n",
" <td> labour</td>\n",
" <td> 1.005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8941 </th>\n",
" <td> 2015-05-05T10:16:17+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.005</td>\n",
" <td> 1/200</td>\n",
" <td> south-shields</td>\n",
" <td> labour</td>\n",
" <td> 1.005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5029 </th>\n",
" <td> 2015-05-05T10:06:43+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservatives</td>\n",
" <td> 0.005</td>\n",
" <td> 1/200</td>\n",
" <td> hampshire-east</td>\n",
" <td> conservative</td>\n",
" <td> 1.005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4989 </th>\n",
" <td> 2015-05-05T10:06:39+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.005</td>\n",
" <td> 1/200</td>\n",
" <td> halton</td>\n",
" <td> labour</td>\n",
" <td> 1.005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4916 </th>\n",
" <td> 2015-05-05T10:06:27+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.005</td>\n",
" <td> 1/200</td>\n",
" <td> hackney-south-and-shoreditch</td>\n",
" <td> labour</td>\n",
" <td> 1.005</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 339,
"text": [
" time bookie party odds oddsraw \\\n",
"8438 2015-05-05T10:14:59+00:00 WH conservatives 0.005 1/200 \n",
"5376 2015-05-05T10:07:32+00:00 WH conservatives 0.005 1/200 \n",
"5362 2015-05-05T10:07:27+00:00 WH conservatives 0.005 1/200 \n",
"5319 2015-05-05T10:07:20+00:00 WH conservatives 0.005 1/200 \n",
"1628 2015-05-05T09:58:35+00:00 WH conservatives 0.005 1/200 \n",
"10097 2015-05-05T10:19:14+00:00 WH labour 0.005 1/200 \n",
"8941 2015-05-05T10:16:17+00:00 WH labour 0.005 1/200 \n",
"5029 2015-05-05T10:06:43+00:00 WH conservatives 0.005 1/200 \n",
"4989 2015-05-05T10:06:39+00:00 WH labour 0.005 1/200 \n",
"4916 2015-05-05T10:06:27+00:00 WH labour 0.005 1/200 \n",
"\n",
" constituency party_clean decimal_odds \n",
"8438 runnymede-and-weybridge conservative 1.005 \n",
"5376 hertfordshire-ne conservative 1.005 \n",
"5362 hertford-and-stortford conservative 1.005 \n",
"5319 henley conservative 1.005 \n",
"1628 brentwood-and-ongar conservative 1.005 \n",
"10097 tyneside-north labour 1.005 \n",
"8941 south-shields labour 1.005 \n",
"5029 hampshire-east conservative 1.005 \n",
"4989 halton labour 1.005 \n",
"4916 hackney-south-and-shoreditch labour 1.005 "
]
}
],
"prompt_number": 339
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###In Which Seats Was The Uncertainty Largest?\n",
"\n",
"This is not simply a question of finding the constituency with the longest odds, but a question about finding the constituency with the longest (largest) favourite's odds.\n",
"\n",
"If we find the minimum value across each row in the wide dataset, ignoring the missing values, we get the odds of the favourite. We can then sort on that value in a descending fashion to find the constituencies with the longest favourite odds."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"dfp.min(axis=1).order(ascending=False).head(10)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 340,
"text": [
"constituency\n",
"berwickshire-roxburgh-and-selkirk 1.200000\n",
"dumfries-and-galloway 0.909091\n",
"edinburgh-south 0.909091\n",
"northampton-north 0.909091\n",
"torbay 0.909091\n",
"st-ives 0.833333\n",
"halesowen-and-rowley-regis 0.833333\n",
"pudsey 0.833333\n",
"finchley-and-golders-green 0.800000\n",
"cornwall-north 0.800000\n",
"dtype: float64"
]
}
],
"prompt_number": 340
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMONSTRATE HOW TO PERFORM OPERATIONS ACROSS A ROW"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### In Which Seats Did the Green Party Have Odds of Better Than 10 to 1 ?\n",
"\n",
"We can ask this question by filtering the long dataset using two criteria combined using a Boolean operator:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df_wh[ (df_wh['party_clean']=='green') & (df_wh['odds']<=10) ]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>bookie</th>\n",
" <th>party</th>\n",
" <th>odds</th>\n",
" <th>oddsraw</th>\n",
" <th>constituency</th>\n",
" <th>party_clean</th>\n",
" <th>decimal_odds</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1713</th>\n",
" <td> 2015-05-05T09:58:49+00:00</td>\n",
" <td> WH</td>\n",
" <td> greens</td>\n",
" <td> 0.222222</td>\n",
" <td> 2/9</td>\n",
" <td> brighton-pavilion</td>\n",
" <td> green</td>\n",
" <td> 1.222222</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1788</th>\n",
" <td> 2015-05-05T09:58:56+00:00</td>\n",
" <td> WH</td>\n",
" <td> greens</td>\n",
" <td> 4.500000</td>\n",
" <td> 9/2</td>\n",
" <td> bristol-west</td>\n",
" <td> green</td>\n",
" <td> 5.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7454</th>\n",
" <td> 2015-05-05T10:12:36+00:00</td>\n",
" <td> WH</td>\n",
" <td> greens</td>\n",
" <td> 5.000000</td>\n",
" <td> 5</td>\n",
" <td> norwich-south</td>\n",
" <td> green</td>\n",
" <td> 6.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 341,
"text": [
" time bookie party odds oddsraw \\\n",
"1713 2015-05-05T09:58:49+00:00 WH greens 0.222222 2/9 \n",
"1788 2015-05-05T09:58:56+00:00 WH greens 4.500000 9/2 \n",
"7454 2015-05-05T10:12:36+00:00 WH greens 5.000000 5 \n",
"\n",
" constituency party_clean decimal_odds \n",
"1713 brighton-pavilion green 1.222222 \n",
"1788 bristol-west green 5.500000 \n",
"7454 norwich-south green 6.000000 "
]
}
],
"prompt_number": 341
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - USE A BOOLEAN OPERATOR TO FILTER A DATASET BASED ON TWO OR MORE CRITERIA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How Many Seats Were Each Party Favourite In?\n",
"\n",
"Trivially, we might think to sort the parties by each constituency in terms of increasing odds, then pick the one with the lowest odds.\n",
"\n",
"If we sort the data frame in order of increasing odds, and then group by constituency, the order of the rows within each group will be in increasing order of odds."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - GENERATE ONE OR MORE QUESTIONS TO ASK OF A SELECTED DATASET"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df_wh.sort('odds', ascending=True).groupby('constituency', as_index=False).get_group(\"aberavon\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>bookie</th>\n",
" <th>party</th>\n",
" <th>odds</th>\n",
" <th>oddsraw</th>\n",
" <th>constituency</th>\n",
" <th>party_clean</th>\n",
" <th>decimal_odds</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>6 </th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.01</td>\n",
" <td> 1/100</td>\n",
" <td> aberavon</td>\n",
" <td> labour</td>\n",
" <td> 1.01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> WH</td>\n",
" <td> ukip</td>\n",
" <td> 16.00</td>\n",
" <td> 16</td>\n",
" <td> aberavon</td>\n",
" <td> ukip</td>\n",
" <td> 17.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3 </th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> WH</td>\n",
" <td> pc</td>\n",
" <td> 50.00</td>\n",
" <td> 50</td>\n",
" <td> aberavon</td>\n",
" <td> pc</td>\n",
" <td> 51.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> WH</td>\n",
" <td> liberal democrats</td>\n",
" <td> 100.00</td>\n",
" <td> 100</td>\n",
" <td> aberavon</td>\n",
" <td> liberal democrat</td>\n",
" <td> 101.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td> 2015-05-05T09:54:19+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservatives</td>\n",
" <td> 100.00</td>\n",
" <td> 100</td>\n",
" <td> aberavon</td>\n",
" <td> conservative</td>\n",
" <td> 101.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 342,
"text": [
" time bookie party odds oddsraw \\\n",
"6 2015-05-05T09:54:19+00:00 WH labour 0.01 1/100 \n",
"13 2015-05-05T09:54:19+00:00 WH ukip 16.00 16 \n",
"3 2015-05-05T09:54:19+00:00 WH pc 50.00 50 \n",
"10 2015-05-05T09:54:19+00:00 WH liberal democrats 100.00 100 \n",
"16 2015-05-05T09:54:19+00:00 WH conservatives 100.00 100 \n",
"\n",
" constituency party_clean decimal_odds \n",
"6 aberavon labour 1.01 \n",
"13 aberavon ukip 17.00 \n",
"3 aberavon pc 51.00 \n",
"10 aberavon liberal democrat 101.00 \n",
"16 aberavon conservative 101.00 "
]
}
],
"prompt_number": 342
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMONSTRATE HOW TO GROUP A DATASET ACCORDING TO ONE OR MORE CRITERIA\n",
"ASSESSMENT - DEMONSTRATE HOW TO ACCESS A PARTICULAR GROUP AS A GROUP"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we pick the `first()` row in each group, we can generate a dataframe that contains the a single row for each consituency identifying a party with those best odds.\n",
"\n",
"We can then group these rows according to the cleaned party name, and count how many rows correspond to each party, ordering the result to show the most heavily favourited party first."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"likelyparty=df_wh.sort('odds', ascending=True).groupby('constituency', as_index=False).first()\n",
"likelyparty.groupby('party_clean').size().order(ascending=False)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 343,
"text": [
"party_clean\n",
"conservative 278\n",
"labour 263\n",
"snp 55\n",
"liberal democrat 26\n",
"dup 9\n",
"sinn fein 4\n",
"ukip 3\n",
"sdlp 3\n",
"pc 2\n",
"sylvia hermon 1\n",
"respect 1\n",
"plaid cymru 1\n",
"john bercow 1\n",
"green 1\n",
"dtype: int64"
]
}
],
"prompt_number": 343
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Interpreting this naively, we see there are 278 seats with the Conservatives as favourite, 263 with Labour as favourite, and so on."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMONSTRATE HOW TO PROCESS ELEMENTS IN A GROUP BY GROUP\n",
"ASSESSMENT - INTERPRET THE RESULTS GENERATED BY ASKING A PARTICULAR QUESTION OF A SELECTED DATASET"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, this approach would incorrectly predict seats where there are *joint favourites*, if there are any. Let's see if we can identify constituency seats where low odds are tied..."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Start by considering short odds\n",
"#then group by odds in each constituency\n",
"#count the rows in each group\n",
"#order the result\n",
"#and show the top few results\n",
"df_wh[df_wh['odds']<=2] \\\n",
".groupby(['odds','constituency']) \\\n",
".size() \\\n",
".order(ascending=False) \\\n",
".head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 344,
"text": [
"odds constituency \n",
"0.833333 pudsey 2\n",
"0.909091 northampton-north 2\n",
" torbay 2\n",
"0.833333 halesowen-and-rowley-regis 2\n",
"0.015152 workington 1\n",
"dtype: int64"
]
}
],
"prompt_number": 344
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"My reading of this is that there are two parties tied on odds of 0.8333 in Pudsey. Let's check:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df_wh[df_wh['constituency']=='pudsey']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>bookie</th>\n",
" <th>party</th>\n",
" <th>odds</th>\n",
" <th>oddsraw</th>\n",
" <th>constituency</th>\n",
" <th>party_clean</th>\n",
" <th>decimal_odds</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>8003</th>\n",
" <td> 2015-05-05T10:13:53+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.833333</td>\n",
" <td> 5/6</td>\n",
" <td> pudsey</td>\n",
" <td> labour</td>\n",
" <td> 1.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8009</th>\n",
" <td> 2015-05-05T10:13:53+00:00</td>\n",
" <td> WH</td>\n",
" <td> liberal democrats</td>\n",
" <td> 100.000000</td>\n",
" <td> 100</td>\n",
" <td> pudsey</td>\n",
" <td> liberal democrat</td>\n",
" <td> 101.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8013</th>\n",
" <td> 2015-05-05T10:13:53+00:00</td>\n",
" <td> WH</td>\n",
" <td> ukip</td>\n",
" <td> 40.000000</td>\n",
" <td> 40</td>\n",
" <td> pudsey</td>\n",
" <td> ukip</td>\n",
" <td> 41.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8017</th>\n",
" <td> 2015-05-05T10:13:53+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservatives</td>\n",
" <td> 0.833333</td>\n",
" <td> 5/6</td>\n",
" <td> pudsey</td>\n",
" <td> conservative</td>\n",
" <td> 1.833333</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 345,
"text": [
" time bookie party odds oddsraw \\\n",
"8003 2015-05-05T10:13:53+00:00 WH labour 0.833333 5/6 \n",
"8009 2015-05-05T10:13:53+00:00 WH liberal democrats 100.000000 100 \n",
"8013 2015-05-05T10:13:53+00:00 WH ukip 40.000000 40 \n",
"8017 2015-05-05T10:13:53+00:00 WH conservatives 0.833333 5/6 \n",
"\n",
" constituency party_clean decimal_odds \n",
"8003 pudsey labour 1.833333 \n",
"8009 pudsey liberal democrat 101.000000 \n",
"8013 pudsey ukip 41.000000 \n",
"8017 pudsey conservative 1.833333 "
]
}
],
"prompt_number": 345
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding Rows in Constituencies Where Odds Are Tied\n",
"\n",
"If we wanted a long dataset containing rows where the odds are tied within a constituency, we could use the following filter command:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Limit rows to rows in constituencies where there are parties with the same odds\n",
"#That is, where there is more than one member in groups of odds by constituency\n",
"df_sameodds=df_wh.groupby(['odds','constituency']).filter(lambda x: len(x) > 1)\n",
"df_sameodds.sort(['odds','constituency']).head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>bookie</th>\n",
" <th>party</th>\n",
" <th>odds</th>\n",
" <th>oddsraw</th>\n",
" <th>constituency</th>\n",
" <th>party_clean</th>\n",
" <th>decimal_odds</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4933</th>\n",
" <td> 2015-05-05T10:06:29+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.833333</td>\n",
" <td> 5/6</td>\n",
" <td> halesowen-and-rowley-regis</td>\n",
" <td> labour</td>\n",
" <td> 1.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4948</th>\n",
" <td> 2015-05-05T10:06:29+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservatives</td>\n",
" <td> 0.833333</td>\n",
" <td> 5/6</td>\n",
" <td> halesowen-and-rowley-regis</td>\n",
" <td> conservative</td>\n",
" <td> 1.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8003</th>\n",
" <td> 2015-05-05T10:13:53+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.833333</td>\n",
" <td> 5/6</td>\n",
" <td> pudsey</td>\n",
" <td> labour</td>\n",
" <td> 1.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8017</th>\n",
" <td> 2015-05-05T10:13:53+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservatives</td>\n",
" <td> 0.833333</td>\n",
" <td> 5/6</td>\n",
" <td> pudsey</td>\n",
" <td> conservative</td>\n",
" <td> 1.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7387</th>\n",
" <td> 2015-05-05T10:12:26+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.909091</td>\n",
" <td> 10/11</td>\n",
" <td> northampton-north</td>\n",
" <td> labour</td>\n",
" <td> 1.909091</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 346,
"text": [
" time bookie party odds oddsraw \\\n",
"4933 2015-05-05T10:06:29+00:00 WH labour 0.833333 5/6 \n",
"4948 2015-05-05T10:06:29+00:00 WH conservatives 0.833333 5/6 \n",
"8003 2015-05-05T10:13:53+00:00 WH labour 0.833333 5/6 \n",
"8017 2015-05-05T10:13:53+00:00 WH conservatives 0.833333 5/6 \n",
"7387 2015-05-05T10:12:26+00:00 WH labour 0.909091 10/11 \n",
"\n",
" constituency party_clean decimal_odds \n",
"4933 halesowen-and-rowley-regis labour 1.833333 \n",
"4948 halesowen-and-rowley-regis conservative 1.833333 \n",
"8003 pudsey labour 1.833333 \n",
"8017 pudsey conservative 1.833333 \n",
"7387 northampton-north labour 1.909091 "
]
}
],
"prompt_number": 346
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMONSTRATE HOW TO PROCESS A GROUP BASED ON GROUP PROPERTIES"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Which party is \"second\" favourite, ordered by shortest odds first?\n",
"\n",
"If we want to get a feel for which party is second favourite (or tied on joint odds with the \"first\" favourite), we can use the `nth()` rather than `first()` method on the odds sorted, constituency grouped form of the dataset, noting that `nth()` starts counting with index 0, so `nth(1)` corresponds to the *second* item in the ordered list:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"secondparty=df_wh.sort('odds', ascending=True).groupby('constituency', as_index=False).nth(1)\n",
"secondparty.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>bookie</th>\n",
" <th>party</th>\n",
" <th>odds</th>\n",
" <th>oddsraw</th>\n",
" <th>constituency</th>\n",
" <th>party_clean</th>\n",
" <th>decimal_odds</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4933</th>\n",
" <td> 2015-05-05T10:06:29+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.833333</td>\n",
" <td> 5/6</td>\n",
" <td> halesowen-and-rowley-regis</td>\n",
" <td> labour</td>\n",
" <td> 1.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8003</th>\n",
" <td> 2015-05-05T10:13:53+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.833333</td>\n",
" <td> 5/6</td>\n",
" <td> pudsey</td>\n",
" <td> labour</td>\n",
" <td> 1.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4364</th>\n",
" <td> 2015-05-05T10:05:10+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.909091</td>\n",
" <td> 10/11</td>\n",
" <td> finchley-and-golders-green</td>\n",
" <td> labour</td>\n",
" <td> 1.909091</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9960</th>\n",
" <td> 2015-05-05T10:18:59+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservative</td>\n",
" <td> 0.909091</td>\n",
" <td> 10/11</td>\n",
" <td> torbay</td>\n",
" <td> conservative</td>\n",
" <td> 1.909091</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2871</th>\n",
" <td> 2015-05-05T10:01:50+00:00</td>\n",
" <td> WH</td>\n",
" <td> liberal democrats</td>\n",
" <td> 0.909091</td>\n",
" <td> 10/11</td>\n",
" <td> cornwall-north</td>\n",
" <td> liberal democrat</td>\n",
" <td> 1.909091</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 347,
"text": [
" time bookie party odds oddsraw \\\n",
"4933 2015-05-05T10:06:29+00:00 WH labour 0.833333 5/6 \n",
"8003 2015-05-05T10:13:53+00:00 WH labour 0.833333 5/6 \n",
"4364 2015-05-05T10:05:10+00:00 WH labour 0.909091 10/11 \n",
"9960 2015-05-05T10:18:59+00:00 WH conservative 0.909091 10/11 \n",
"2871 2015-05-05T10:01:50+00:00 WH liberal democrats 0.909091 10/11 \n",
"\n",
" constituency party_clean decimal_odds \n",
"4933 halesowen-and-rowley-regis labour 1.833333 \n",
"8003 pudsey labour 1.833333 \n",
"4364 finchley-and-golders-green labour 1.909091 \n",
"9960 torbay conservative 1.909091 \n",
"2871 cornwall-north liberal democrat 1.909091 "
]
}
],
"prompt_number": 347
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That result for *pudsey* looks a little odd? In the data frame above, the *conservative* entry had a higher index value, so why is the *labour* party listed as the second item? Let's check:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df_wh.sort('odds', ascending=True).groupby('constituency', as_index=False).get_group(\"pudsey\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>bookie</th>\n",
" <th>party</th>\n",
" <th>odds</th>\n",
" <th>oddsraw</th>\n",
" <th>constituency</th>\n",
" <th>party_clean</th>\n",
" <th>decimal_odds</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>8017</th>\n",
" <td> 2015-05-05T10:13:53+00:00</td>\n",
" <td> WH</td>\n",
" <td> conservatives</td>\n",
" <td> 0.833333</td>\n",
" <td> 5/6</td>\n",
" <td> pudsey</td>\n",
" <td> conservative</td>\n",
" <td> 1.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8003</th>\n",
" <td> 2015-05-05T10:13:53+00:00</td>\n",
" <td> WH</td>\n",
" <td> labour</td>\n",
" <td> 0.833333</td>\n",
" <td> 5/6</td>\n",
" <td> pudsey</td>\n",
" <td> labour</td>\n",
" <td> 1.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8013</th>\n",
" <td> 2015-05-05T10:13:53+00:00</td>\n",
" <td> WH</td>\n",
" <td> ukip</td>\n",
" <td> 40.000000</td>\n",
" <td> 40</td>\n",
" <td> pudsey</td>\n",
" <td> ukip</td>\n",
" <td> 41.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8009</th>\n",
" <td> 2015-05-05T10:13:53+00:00</td>\n",
" <td> WH</td>\n",
" <td> liberal democrats</td>\n",
" <td> 100.000000</td>\n",
" <td> 100</td>\n",
" <td> pudsey</td>\n",
" <td> liberal democrat</td>\n",
" <td> 101.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 348,
"text": [
" time bookie party odds oddsraw \\\n",
"8017 2015-05-05T10:13:53+00:00 WH conservatives 0.833333 5/6 \n",
"8003 2015-05-05T10:13:53+00:00 WH labour 0.833333 5/6 \n",
"8013 2015-05-05T10:13:53+00:00 WH ukip 40.000000 40 \n",
"8009 2015-05-05T10:13:53+00:00 WH liberal democrats 100.000000 100 \n",
"\n",
" constituency party_clean decimal_odds \n",
"8017 pudsey conservative 1.833333 \n",
"8003 pudsey labour 1.833333 \n",
"8013 pudsey ukip 41.000000 \n",
"8009 pudsey liberal democrat 101.000000 "
]
}
],
"prompt_number": 348
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hmm, the ordering here does seem to put the *conservative* row first. *But I'm not sure why?*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comparing Joint/Close First and Second Favourites\n",
"\n",
"What happens if there is a swing from first to second favourite in constituencies with short odds across the joint first, or first and second, favourites? Are there particular swings likely from one party to another?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start by creating a dataframe where each row is an observation for a constituency that shows the joint or first and second favorites, along with their odds:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"col_subset=['constituency','party_clean','odds']\n",
"m=pd.merge(secondparty[col_subset],likelyparty[col_subset],on='constituency')\n",
"m.columns = ['constituency','secondparty','odds_second','firstparty','odds_first']\n",
"m.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>constituency</th>\n",
" <th>secondparty</th>\n",
" <th>odds_second</th>\n",
" <th>firstparty</th>\n",
" <th>odds_first</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> halesowen-and-rowley-regis</td>\n",
" <td> labour</td>\n",
" <td> 0.833333</td>\n",
" <td> conservative</td>\n",
" <td> 0.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> pudsey</td>\n",
" <td> labour</td>\n",
" <td> 0.833333</td>\n",
" <td> conservative</td>\n",
" <td> 0.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> finchley-and-golders-green</td>\n",
" <td> labour</td>\n",
" <td> 0.909091</td>\n",
" <td> conservative</td>\n",
" <td> 0.800000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> torbay</td>\n",
" <td> conservative</td>\n",
" <td> 0.909091</td>\n",
" <td> liberal democrat</td>\n",
" <td> 0.909091</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> cornwall-north</td>\n",
" <td> liberal democrat</td>\n",
" <td> 0.909091</td>\n",
" <td> conservative</td>\n",
" <td> 0.800000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 349,
"text": [
" constituency secondparty odds_second \\\n",
"0 halesowen-and-rowley-regis labour 0.833333 \n",
"1 pudsey labour 0.833333 \n",
"2 finchley-and-golders-green labour 0.909091 \n",
"3 torbay conservative 0.909091 \n",
"4 cornwall-north liberal democrat 0.909091 \n",
"\n",
" firstparty odds_first \n",
"0 conservative 0.833333 \n",
"1 conservative 0.833333 \n",
"2 conservative 0.800000 \n",
"3 liberal democrat 0.909091 \n",
"4 conservative 0.800000 "
]
}
],
"prompt_number": 349
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT - DEMONSTRATE HOW TO MERGE TWO OR MORE DATASETS\n",
"ASSESSMENT - DEMONSTRATE HOW TO MANIPULATE THE MARGINAL PROPERTIES OF A DATA TABLE (EG INDICES, COLUMN HEADINGS)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now look to see what the possible swings are by party away from the favourite to a joint of close second favourite."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Start by finding the rows where the odds are short for the second favourite\n",
"#then count group sizes swinging from first to second party\n",
"m[m['odds_second']<=1.5].groupby(['firstparty','secondparty']).size()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 350,
"text": [
"firstparty secondparty \n",
"conservative labour 13\n",
" liberal democrat 2\n",
"labour conservative 10\n",
" liberal democrat 1\n",
" snp 2\n",
"liberal democrat conservative 6\n",
" labour 1\n",
" plaid cymru 1\n",
"plaid cymru labour 1\n",
"respect labour 1\n",
"snp conservative 1\n",
" labour 4\n",
"ukip conservative 1\n",
"dtype: int64"
]
}
],
"prompt_number": 350
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This shows, for example, that there are good chances that up to 13 *conservative* seats could go to *labour*, and 3 seats from *liberal democrat* to *conservative*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"In this notebook, I have investigated a data set containing election odds for the majority of UK constituencies on a single day prior to the UK General Election, 2015.\n",
"\n",
"The data predicited that the Conservatives would win the largest number of seats, though not a majority. The data predicted that the SNP would win a large number of Scottish constituency seats, and that Ed Ball's Pudsey constituency was unsafe for Labour.\n",
"\n",
"Analysis of constituencies with low-odds, joint or close first and second favourites suggested more possible \"swings\" from Conservative to Labour than vice versa and mosts swings from Liberal Democrats to the Conservatives."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ASSESSMENT NOTES\n",
"\n",
"Sports such as gymnastics use criteria based scoring where participants must demonstrate several elements from different difficulty groups (eg http://www.british-gymnastics.org/technical-information/selection/womens-artistic/cat_view/334-regions-and-home-countries/467-south-east/578-event-info ).\n",
"\n",
"One approach to asssessing notebooks on an investigation around a free-data-choice activity might be to require students to demonstrate a range of technical skills (perhaps self-identifying them to reinforce reflection about their work) in an appropriate context. I have tried to identify - and abstract - assessment opportunities along the way; should students be required to do the same as part of the assessment as part of a critique of their own work?\n",
"\n",
"Looking back over the assessment points, many of the steps were included *becuase the data needed treating in some way in order to ask a particular question or perform a particular transformation*. How can we capture the relationship between questions asked of the data and how those questions prompt certain transformations of the data in order to answer them?\n",
"\n",
"Many data anlayses are likely to include false starts that still take time to explore. Students should be allowed to include 'false-start' components in their script if they derive from a plausible initial line of investigation and demonstrate a required element.\n",
"\n",
"This notebook has focussed on the demonstration of particular skills using a particular programming language (Python) and programming library within that language (pandas). Some (many? all?) of the questions could have been asked directly of the dataset using SQL.\n",
"\n",
"Should the notebook require students to demonstrate solutions to the same problem in different languages?\n",
"\n",
"What assessment points are missing?\n",
"\n",
"The notebook does not include any graphical representations of the data (no charts).\n",
"\n",
"The notebook does not include any statistical analyses, other than simple rankings, sorting and extrema detection.\n",
"\n",
"The notebook does not require the student to model any data form or ingest any data into, a database.\n",
"\n",
"The notebook does not require students to do any more than single line programming at each step. That is, the student is not required to develop any functions (other than one line lambda functions) at any stage.\n",
"\n",
"As a rule of thumb, I estimate that each question cell, code cell, intepretation cell combination will take of the order 5-15 minutes to produce."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment