Created
February 13, 2015 15:05
-
-
Save dougmet/3cb2c565ef51df705f09 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Predicting the six nations" | |
author: "Douglas Ashton" | |
output: | |
html_document: | |
css: rugby.css | |
--- | |
I don't know a lot about rugby, which can be a problem living in a [rugby town](http://en.wikipedia.org/wiki/Bath_Rugby). Especially when the office sweep stake on the upcoming Wales vs England six nations game goes round: apparently 2-1 is not a valid rugby score. I'm not about to put a pound down without some research. Fortunately England and Wales have played each other before: | |
http://en.wikipedia.org/wiki/History_of_rugby_union_matches_between_England_and_Wales | |
and R has some nice tools for grabbing data from the web. | |
## Scraping and cleaning | |
To get the data in a usable form the [rvest](http://cran.r-project.org/web/packages/rvest/index.html) package has some really useful tools. With a few lines of code we can pull the html data into a data frame. | |
```{r scraping, cache=TRUE} | |
library(rvest) | |
library(magrittr) | |
rugbyHTML <- html("http://en.wikipedia.org/wiki/History_of_rugby_union_matches_between_England_and_Wales") | |
rugbyData <- rugbyHTML %>% | |
html_nodes("table.wikitable") %>% .[[3]] %>% | |
html_table | |
``` | |
Some grepping and date parsing later we have a cleaned up dataset. | |
```{r echo=FALSE} | |
# Rows with valid data | |
rugbyData <- rugbyData[2:121,] | |
# Sort out the dates | |
rugbyData$Date <- as.Date(rugbyData$Date,format="%e %B %Y") | |
# Split the score into home score and away score | |
scoreList <- regmatches(rugbyData$Score,gregexpr("[0-9]+", rugbyData$Score)) | |
homeScore <- sapply(scoreList, function(x) as.numeric(x[1])) | |
awayScore <- sapply(scoreList, function(x) as.numeric(x[2])) | |
#rugbyData$homeScore <- homeScore | |
#rugbyData$awayScore <- awayScore | |
# A list of the winning score and losing scores | |
winningScore <- ifelse(homeScore > awayScore, homeScore, awayScore) | |
losingScore <- ifelse(homeScore < awayScore, homeScore, awayScore) | |
rugbyData$winningScore <- winningScore | |
rugbyData$losingScore <- losingScore | |
# If you won, take the winning score, otherwise take the losing score | |
walesWin <- rugbyData$Winner=="Wales" | |
walesScore <- homeScore # Any old vector | |
walesScore[walesWin] <- winningScore[walesWin] | |
walesScore[!walesWin] <- losingScore[!walesWin] | |
englandWin <- rugbyData$Winner=="England" | |
englandScore <- homeScore # Any old vector | |
englandScore[englandWin] <- winningScore[englandWin] | |
englandScore[!englandWin] <- losingScore[!englandWin] | |
# Put it in the data.frame | |
rugbyData$englandScore <- englandScore | |
rugbyData$walesScore <- walesScore | |
# Sort out who was home and who was away | |
rugbyData$homeTeam <- rep("Other", nrow(rugbyData)) | |
englandHome <- grepl("London|Bristol|Richmond|Leicester|Yorkshire|Birkenhead|Gloucester", rugbyData$Venue) | |
rugbyData$homeTeam[englandHome] <- "England" | |
walesHome <- grepl("Cardiff|Swansea|Newport|Llanelli", rugbyData$Venue) | |
rugbyData$homeTeam[walesHome] <- "Wales" | |
rugbyData$homeTeam <- factor(rugbyData$homeTeam) | |
rugbyData$Winner <- factor(rugbyData$Winner, levels=c("England", "draw", "Wales")) | |
``` | |
```{r printTable, echo=FALSE, message=FALSE} | |
# | |
library(knitr) | |
kable(rugbyData[1:6,c("Date", "homeTeam", "englandScore", "walesScore")], | |
format="markdown", digits=2,row.names=FALSE, | |
col.names=c("Date", "Home", " England", " Wales"), | |
table.attr = "class=center") | |
``` | |
A quick look at the data suggests things are have been pretty even over the years, with some big wins for England around the turn of the Millenium and Wales dominant in the 60s and 70s. | |
```{r echo=FALSE} | |
library(ggplot2) | |
qplot(Date,englandScore - walesScore, | |
data=rugbyData, color=homeTeam, | |
ylim=c(-60,60), ylab="Score difference (England - Wales)") + | |
scale_color_manual("Home Team", values=c('steelblue','grey','red')) + | |
theme_bw() | |
``` | |
## Who's going to win? | |
If we just look at who has won previous encounters we see that Wales have a slight edge but nothing statistically significant. | |
```{r echo=FALSE} | |
wins <-table(rugbyData$Winner) | |
names(wins) <- c("England Wins", "Draw", "Wales Wins") | |
kable(t(sort(wins, decreasing = TRUE)), format="markdown", row.names=FALSE, | |
caption="Match Wins") | |
``` | |
How about if we take into account home and away form? The game on the 6th will be in Cardiff, will that give the edge to Wales? | |
```{r} | |
walesWin <- rugbyData$englandScore - rugbyData$walesScore < 0 | |
home <- rugbyData$homeTeam | |
fit <- glm(walesWin ~ home - 1, family='binomial') | |
``` | |
```{r echo=FALSE} | |
kable(summary(fit)$coefficients, "markdown", digits=3, | |
caption="summary(fit)$coefficients") | |
x <- fit$coefficients["homeWales"] | |
pWales <- exp(x)/(1+exp(x)) | |
``` | |
<br \> | |
which suggests the chances of a Wales win is `r round(100*pWales)`%. So I'd say yes. It's not a powerful prediction but Wales tend to win in Wales. Good enough for me, I'll go with Wales. OK, so what's the damage going to be? | |
## What's the score? | |
The sweep stake requires scores. This is the bit I really have no idea about, for a football fan used to scores such as [2-0](http://www.bbc.co.uk/sport/0/football/30860467), rugby scores seem arbitrarily large. Back to the data I guess. First up, what's the total? | |
```{r echo=FALSE, warning=FALSE} | |
library(ggplot2) | |
# rugbyData50 <- rugbyData[rugbyData$Date > as.Date("1950-01-01"), ] | |
# fitScore <- lm(englandScore + walesScore ~ Date * Winner, data=rugbyData50) | |
qplot(Date,englandScore + walesScore, | |
data=rugbyData, color=Winner, ylab="Total points") + | |
scale_color_manual("Winning Team", values=c('steelblue','grey','red')) + | |
theme_bw() | |
# geom_abline(intercept=fitScore$coefficients[1] + fitScore$coefficients["WinnerWales"], | |
# slope=fitScore$coefficients["Date"] + fitScore$coefficients["Date:WinnerWales"], | |
# color="red", size=1.5, alpha=0.5) | |
``` | |
Interestingly it looks like the total points has been going up since the 50s. At this point I'm desperate, let's predict the total score by fitting since the 50s and extrapolating. When Wales win there tend to be less points, let's throw that into the model as well, it will screen out those silly big English wins at the Millenium. | |
```{r} | |
rugbyData50 <- rugbyData[rugbyData$Date > as.Date("1950-01-01"), ] | |
fitScore <- lm(englandScore + walesScore ~ Date * Winner, data=rugbyData50) | |
fitDiff <- lm(winningScore - losingScore ~ Date * Winner, data=rugbyData50) | |
tScore <- predict(fitScore, data.frame(Date=as.Date("2015-02-06"), Winner="Wales")) | |
dScore <- predict(fitDiff, data.frame(Date=as.Date("2015-02-06"), Winner="Wales")) | |
``` | |
Which predicts a total score on Friday of `r round(tScore)` and a difference of `r round(dScore)`, giving my final prediction as | |
<center><b> | |
Wales `r round(0.5*(tScore + dScore))` - `r round(0.5*(tScore - dScore))` England | |
</b></center> | |
That'll do for a pound I think. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment