Last active
August 29, 2015 14:00
-
-
Save jmcastagnetto/11127154 to your computer and use it in GitHub Desktop.
A simple document showing the use of XML to get data from an HTML table, dplyr for data manipulation, lubridate for some simple date mangling, and ggplot2 and googleVis to generate nice charts (interactive in the case of googleVis)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Analysis of the data on Presidents of Peru" | |
author: Jesus M. Castagnetto | |
output: | |
html_document: | |
toc: true | |
theme: readable | |
highlight: tango | |
--- | |
```{r chunkconfig, echo=FALSE} | |
library(knitr) | |
opts_chunk$set(comment=NA, warning=FALSE, message=FALSE) | |
``` | |
**Last generated on: `r date()`** | |
We will use the list of Presidents of Peru from a Wikipedia page, to | |
play a bit with some cool R packages (XML, dplyr, lubridate, ggplot2, | |
and googleVis), which will be used to extract and clean up the data, | |
and later make some summaries and plots. | |
## Requirements | |
For this experiment, we will need the following libraries | |
- XML: to parse and extract a table from an HTML page | |
- dplyr: to do some data manipulation | |
- lubridate: to do some date operations | |
- ggplot2: to generate a nice boxplot | |
- googleVis: to make some interactive tables and plots (I am using | |
the development version from github) | |
```{r} | |
require(XML) | |
require(dplyr) | |
require(lubridate) | |
require(ggplot2) | |
require(googleVis) | |
``` | |
If you don't have them installed, then you might want to run: | |
```{r eval=FALSE} | |
install.packages(c("XML", "dplyr", "lubridate", "ggplot2")) | |
# pre-requisites for the development version of googleVis | |
install.packages(c("devtools","RJSONIO", "knitr", "shiny", "httpuv")) | |
devtools::install_github("mages/googleVis") | |
``` | |
## Getting and mangling the data | |
First, let's read the data from the third HTML table in Wikipedia's | |
page: ["List of Presidents of Peru"](http://en.wikipedia.org/wiki/List_of_Presidents_of_Peru) | |
```{r getdata} | |
src <- "http://en.wikipedia.org/wiki/List_of_Presidents_of_Peru" | |
doc = htmlParse(src, encoding = "UTF-8") | |
tables <- readHTMLTable(doc) | |
# the table we need is the third one | |
t3 <- tables[[3]] | |
``` | |
Then, we ought to fix some weirdness in the data, and will save it to a | |
CSV just in case we want to do some more processing in the future. As we | |
are keeping the original column names from the HTML table, some code is | |
a bit more cumbersome (because we need to use backticks). | |
```{r mangle} | |
# We do not need column #2, which contains an image | |
# also, let's reorder the columns | |
t3 <- t3[,c(3,6,7,4,5)] | |
# convert to dates the start and end term columns | |
fix_date <- function(x) { | |
return(as.Date(strptime(x, format="%B %e, %Y"))) | |
} | |
t3[c("Inaugurated","Left office")] <- lapply(t3[c("Inaugurated","Left office")], fix_date) | |
# cleanup random stuff in []s in a couple of columns | |
t3$`Form of entry` <- sub("\\[7\\]", "", gsub("\n", " - ", t3$`Form of entry`)) | |
t3$President <- gsub("\\[.+\\]", "", as.character(t3$President)) | |
# add the regular end of term (5 yrs) for the current president | |
last <- nrow(t3) | |
if (is.na(t3[last,]$`Left office`)) { | |
tmp <- t3[last,]$Inaugurated | |
year(tmp) <- year(tmp) + 5 # normal presidential term: 5 years | |
t3[last,]$`Left office` <- tmp | |
} | |
# save the cleaned up data | |
write.csv(t3, "peru-presidents.csv", row.names=FALSE) | |
``` | |
## Displaying the data as a sortable and paginated table | |
Let's look at the data we got after scraping Wikipedia and mangling values around. We'll | |
make an interactive table using the *gvisTable* function from the **googleVis** package. | |
We want to paginate the table, because Peru has had `r nrow(t3)` people | |
that held the Presidency at one point or another. The table is a bit | |
wide, so it will look nicer. | |
```{r ptable, results='asis'} | |
opts <- list(width=1000, height=330, showRowNumber=TRUE, page="enable") | |
presidents_table <- gvisTable(t3, options=opts) | |
print(presidents_table, "chart") | |
``` | |
## Creating a timeline chart | |
Now, let's visualize the succession of presidents using a timeline chart | |
as implemented in **googleVis**, coloring each timespan by the what | |
original data calls "Form of entry", which is how a particular person got | |
into the Presidency. There are `r sum(t3[,2]=="")` records that do not | |
have a given value for the aforementioned field, so we will recode those | |
as "*Unknown*". | |
This chart is also a bit wide, because the data spans over | |
`r year(max(t3$Inaugurated)) - year(min(t3$Inaugurated))` years. | |
```{r ptimeline, results='asis'} | |
t3[t3[, 2] == "", 2] <- "Unkown" | |
presidents_timeline <- gvisTimeline( | |
t3, rowlabel="President", start="Inaugurated", end="Left office", | |
barlabel="Form of entry", options=list(height=500, width=1200)) | |
print(presidents_timeline, "chart") | |
``` | |
You might have noticed that at some points in Peru's history we had | |
more than one President, and at other times they seem to change rapidly | |
or to swing back and forth among a number of recurring characters. Such | |
was our lot back then, but we have had better luck for some decades now. | |
## Understanding how they got into power | |
We will make cummulative frequency chart, by using **dplyr** to manipulate | |
and summarize the data and **googleVis** to plot it. We could've | |
used *table()* along with other base functions, but dplyr's syntax is | |
cleaner and more readable. | |
```{r dplyr, results='asis'} | |
# group by "Form of entry", get the counts per group, and sort the | |
# data frame in descending order of counts | |
t3_summary <- t3 %>% group_by("Form of entry") %>% | |
summarise(count=n()) %>% | |
arrange(-count, `Form of entry`) %>% | |
mutate(`Cummulative frequency`=round(100*cumsum(count)/sum(count),2)) | |
# make the cummulative frequency chart and print it | |
t3_summary_chart <- gvisLineChart( | |
t3_summary, xvar="Form of entry", yvar="Cummulative frequency", | |
options=list(height=400, width=800, pointSize=5, | |
title="How peruvian presidents got into office", | |
vAxis="{title:'Cummulative frequency (%)'}", | |
hAxis="{title:'Mode of attaining office'}", | |
legend="{position:'none'}") | |
) | |
print(t3_summary_chart, "chart") | |
``` | |
```{r echo=FALSE} | |
tmp1 <- t3_summary[1:4,] | |
ntmp1 <- nrow(tmp1) | |
cftmp1 <- round(tmp1[ntmp1, 3]) | |
``` | |
In this chart we can plainly see that the first `r ntmp1` modes of | |
attaining office (`r paste0(paste0("\"*",tmp1[1:(ntmp1 - 1),1], collapse = "*\", "), "*\", and \"*", tmp1[ntmp1,1], "*\"")`), comprise the majority (a bit over `r cftmp1`%) of all the | |
ways that the office of President have ever been attained in Peru. | |
## Length of time in office | |
If we wanted to know the distribution of the lengths of time | |
in office for all presidents, we can do some simple data exploration and | |
create a histogram, with the the median and mean overlayed on it: | |
```{r} | |
# length of time in office | |
lio_days <- as.numeric(t3$`Left office` - t3$Inaugurated + 1) # in days | |
lio_yrs <- lio_days / 365.25 # in years | |
lio_yrs_mean <- round(mean(lio_yrs),2) | |
lio_yrs_median <- round(median(lio_yrs),2) | |
hist(lio_yrs, main="Distribution of time in office", xlab="Time span (years)") | |
abline(v=lio_yrs_mean, col="red", lwd=2, lty="dashed") | |
abline(v=lio_yrs_median, col="blue", lwd=2, lty="dashed") | |
text(x=c(lio_yrs_mean+.1, lio_yrs_median+.1), y=c(25,45), | |
labels=c(paste0("Mean=", lio_yrs_mean), paste0("Median=", lio_yrs_median)), | |
pos=4, col=c("red", "blue")) | |
``` | |
We can see a typical right-skewed distribution, with a great majority of short | |
lengths of term in office (as little as `r min(lio_days)` days), and some | |
exceptionally long ones (as much as ~`r round(max(lio_yrs), 2)` years). So | |
in this case, the mean (`r lio_yrs_mean` years) is not very informative, and | |
the median (`r lio_yrs_median` years) looks suspiciosly short. | |
Let's look at these time spans groupíng them by the way each one attained the | |
office. | |
```{r lengthoffice} | |
t3$len_office <- lio_yrs | |
# generate a grouping variable based on the form of entry | |
t3$group <- gsub("^([^-]+) -(.*)", "\\1", t3[,"Form of entry"], fixed=FALSE) | |
# combine the forms of entry with counts less than 10 | |
tg <- table(t3$group) | |
otherlvl <- names(tg[tg < 10]) | |
t3[t3$group %in% otherlvl,]$group <- "Other" | |
# reorder the grouping column by increasing count | |
tg <- table(t3$group) | |
t3$group <- factor(as.character(t3$group), levels=names(tg[order(tg)])) | |
# Make boxplots for each grouping factor | |
t3_plot <- ggplot(t3, aes(group, len_office)) + | |
geom_hline(yintercept=5, colour="gray", linetype="longdash") + | |
geom_boxplot(aes(colour=group)) + | |
ggtitle("Distributions of Peruvian President's terms in office") + | |
coord_flip() + ylab("Length in office (years)") + xlab("How office was attained") + | |
theme_bw() + theme(legend.position="none") | |
t3_plot | |
``` | |
In this chart we have added a reference line, the official time span for a | |
President's term in office in Peru: 5 years. It would seem that if you got | |
into office by "Direct Elections" you have a better chance to reach you | |
usual term (median ~ 4 years), but if you got by another *route* (let's | |
say by "Coup d'état") you are more likely to be there for a short time. | |
In the table below, we can see a set of summary statistics per group, which | |
indicate a distinctive difference between them. | |
```{r results='asis'} | |
t3_grouped <- t3 %>% group_by(group) %>% | |
summarise(n=n(), avg=mean(len_office), sd=sd(len_office), median=median(len_office), | |
min=min(len_office), max=max(len_office), iqr=IQR(len_office)) %>% | |
arrange(-n) | |
t3_grouped[-1] <- sapply(t3_grouped[-1], function (x) { round(x,3) }) | |
t3_grouped_table <- gvisTable(t3_grouped, | |
options=list(width=800, height=200)) | |
print(t3_grouped_table, "chart") | |
``` | |
In fact, using a Kruskal-Walis rank sum test, seems to indicate that the | |
groups are indeed different (p < 0.001). | |
```{r} | |
kruskal.test(len_office ~ group, data=t3) | |
``` | |
There might be a moral in this data, but policital conclusions run the risk | |
of degenerating in random rants, so I'll skip that. | |
## Reproducibility information | |
The source code for this document is available at [https://gist.github.com/jmcastagnetto/11127154](https://gist.github.com/jmcastagnetto/11127154) | |
```{r} | |
sessionInfo() | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment