Skip to content

Instantly share code, notes, and snippets.

@dougmet
Created February 13, 2015 15:05
Show Gist options
  • Save dougmet/3cb2c565ef51df705f09 to your computer and use it in GitHub Desktop.
Save dougmet/3cb2c565ef51df705f09 to your computer and use it in GitHub Desktop.
---
title: "Predicting the six nations"
author: "Douglas Ashton"
output:
html_document:
css: rugby.css
---
I don't know a lot about rugby, which can be a problem living in a [rugby town](http://en.wikipedia.org/wiki/Bath_Rugby). Especially when the office sweep stake on the upcoming Wales vs England six nations game goes round: apparently 2-1 is not a valid rugby score. I'm not about to put a pound down without some research. Fortunately England and Wales have played each other before:
http://en.wikipedia.org/wiki/History_of_rugby_union_matches_between_England_and_Wales
and R has some nice tools for grabbing data from the web.
## Scraping and cleaning
To get the data in a usable form the [rvest](http://cran.r-project.org/web/packages/rvest/index.html) package has some really useful tools. With a few lines of code we can pull the html data into a data frame.
```{r scraping, cache=TRUE}
library(rvest)
library(magrittr)
rugbyHTML <- html("http://en.wikipedia.org/wiki/History_of_rugby_union_matches_between_England_and_Wales")
rugbyData <- rugbyHTML %>%
html_nodes("table.wikitable") %>% .[[3]] %>%
html_table
```
Some grepping and date parsing later we have a cleaned up dataset.
```{r echo=FALSE}
# Rows with valid data
rugbyData <- rugbyData[2:121,]
# Sort out the dates
rugbyData$Date <- as.Date(rugbyData$Date,format="%e %B %Y")
# Split the score into home score and away score
scoreList <- regmatches(rugbyData$Score,gregexpr("[0-9]+", rugbyData$Score))
homeScore <- sapply(scoreList, function(x) as.numeric(x[1]))
awayScore <- sapply(scoreList, function(x) as.numeric(x[2]))
#rugbyData$homeScore <- homeScore
#rugbyData$awayScore <- awayScore
# A list of the winning score and losing scores
winningScore <- ifelse(homeScore > awayScore, homeScore, awayScore)
losingScore <- ifelse(homeScore < awayScore, homeScore, awayScore)
rugbyData$winningScore <- winningScore
rugbyData$losingScore <- losingScore
# If you won, take the winning score, otherwise take the losing score
walesWin <- rugbyData$Winner=="Wales"
walesScore <- homeScore # Any old vector
walesScore[walesWin] <- winningScore[walesWin]
walesScore[!walesWin] <- losingScore[!walesWin]
englandWin <- rugbyData$Winner=="England"
englandScore <- homeScore # Any old vector
englandScore[englandWin] <- winningScore[englandWin]
englandScore[!englandWin] <- losingScore[!englandWin]
# Put it in the data.frame
rugbyData$englandScore <- englandScore
rugbyData$walesScore <- walesScore
# Sort out who was home and who was away
rugbyData$homeTeam <- rep("Other", nrow(rugbyData))
englandHome <- grepl("London|Bristol|Richmond|Leicester|Yorkshire|Birkenhead|Gloucester", rugbyData$Venue)
rugbyData$homeTeam[englandHome] <- "England"
walesHome <- grepl("Cardiff|Swansea|Newport|Llanelli", rugbyData$Venue)
rugbyData$homeTeam[walesHome] <- "Wales"
rugbyData$homeTeam <- factor(rugbyData$homeTeam)
rugbyData$Winner <- factor(rugbyData$Winner, levels=c("England", "draw", "Wales"))
```
```{r printTable, echo=FALSE, message=FALSE}
#
library(knitr)
kable(rugbyData[1:6,c("Date", "homeTeam", "englandScore", "walesScore")],
format="markdown", digits=2,row.names=FALSE,
col.names=c("Date", "Home", " &nbsp; &nbsp; England", " &nbsp; &nbsp; Wales"),
table.attr = "class=center")
```
A quick look at the data suggests things are have been pretty even over the years, with some big wins for England around the turn of the Millenium and Wales dominant in the 60s and 70s.
```{r echo=FALSE}
library(ggplot2)
qplot(Date,englandScore - walesScore,
data=rugbyData, color=homeTeam,
ylim=c(-60,60), ylab="Score difference (England - Wales)") +
scale_color_manual("Home Team", values=c('steelblue','grey','red')) +
theme_bw()
```
## Who's going to win?
If we just look at who has won previous encounters we see that Wales have a slight edge but nothing statistically significant.
```{r echo=FALSE}
wins <-table(rugbyData$Winner)
names(wins) <- c("England Wins", "Draw", "Wales Wins")
kable(t(sort(wins, decreasing = TRUE)), format="markdown", row.names=FALSE,
caption="Match Wins")
```
How about if we take into account home and away form? The game on the 6th will be in Cardiff, will that give the edge to Wales?
```{r}
walesWin <- rugbyData$englandScore - rugbyData$walesScore < 0
home <- rugbyData$homeTeam
fit <- glm(walesWin ~ home - 1, family='binomial')
```
```{r echo=FALSE}
kable(summary(fit)$coefficients, "markdown", digits=3,
caption="summary(fit)$coefficients")
x <- fit$coefficients["homeWales"]
pWales <- exp(x)/(1+exp(x))
```
<br \>
which suggests the chances of a Wales win is `r round(100*pWales)`%. So I'd say yes. It's not a powerful prediction but Wales tend to win in Wales. Good enough for me, I'll go with Wales. OK, so what's the damage going to be?
## What's the score?
The sweep stake requires scores. This is the bit I really have no idea about, for a football fan used to scores such as [2-0](http://www.bbc.co.uk/sport/0/football/30860467), rugby scores seem arbitrarily large. Back to the data I guess. First up, what's the total?
```{r echo=FALSE, warning=FALSE}
library(ggplot2)
# rugbyData50 <- rugbyData[rugbyData$Date > as.Date("1950-01-01"), ]
# fitScore <- lm(englandScore + walesScore ~ Date * Winner, data=rugbyData50)
qplot(Date,englandScore + walesScore,
data=rugbyData, color=Winner, ylab="Total points") +
scale_color_manual("Winning Team", values=c('steelblue','grey','red')) +
theme_bw()
# geom_abline(intercept=fitScore$coefficients[1] + fitScore$coefficients["WinnerWales"],
# slope=fitScore$coefficients["Date"] + fitScore$coefficients["Date:WinnerWales"],
# color="red", size=1.5, alpha=0.5)
```
Interestingly it looks like the total points has been going up since the 50s. At this point I'm desperate, let's predict the total score by fitting since the 50s and extrapolating. When Wales win there tend to be less points, let's throw that into the model as well, it will screen out those silly big English wins at the Millenium.
```{r}
rugbyData50 <- rugbyData[rugbyData$Date > as.Date("1950-01-01"), ]
fitScore <- lm(englandScore + walesScore ~ Date * Winner, data=rugbyData50)
fitDiff <- lm(winningScore - losingScore ~ Date * Winner, data=rugbyData50)
tScore <- predict(fitScore, data.frame(Date=as.Date("2015-02-06"), Winner="Wales"))
dScore <- predict(fitDiff, data.frame(Date=as.Date("2015-02-06"), Winner="Wales"))
```
Which predicts a total score on Friday of `r round(tScore)` and a difference of `r round(dScore)`, giving my final prediction as
<center><b>
Wales `r round(0.5*(tScore + dScore))` - `r round(0.5*(tScore - dScore))` England
</b></center>
That'll do for a pound I think.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment