jmcastagnetto/peru-presidents.Rmd

## peru-presidents.Rmd
---
title: "Analysis of the data on Presidents of Peru"
author: Jesus M. Castagnetto
output:
  html_document:
    toc: true
    theme: readable
    highlight: tango
---

```{r chunkconfig, echo=FALSE}
library(knitr)
opts_chunk$set(comment=NA, warning=FALSE, message=FALSE)
```

**Last generated on: `r date()`**

We will use the list of Presidents of Peru from a Wikipedia page, to
play a bit with some cool R packages (XML, dplyr, lubridate, ggplot2,
and googleVis), which will be used to extract and clean up the data,
and later make some summaries and plots.

## Requirements

For this experiment, we will need the following libraries

- XML: to parse and extract a table from an HTML page
- dplyr: to do some data manipulation
- lubridate: to do some date operations
- ggplot2: to generate a nice boxplot
- googleVis: to make some interactive tables and plots (I am using
  the development version from github)

```{r}
require(XML)
require(dplyr)
require(lubridate)
require(ggplot2)
require(googleVis)
```

If you don't have them installed, then you might want to run:

```{r eval=FALSE}
install.packages(c("XML", "dplyr", "lubridate", "ggplot2"))
# pre-requisites for the development version of googleVis
install.packages(c("devtools","RJSONIO", "knitr", "shiny", "httpuv"))
devtools::install_github("mages/googleVis")
```

## Getting and mangling the data

First, let's read the data from the third HTML table in Wikipedia's
page: ["List of Presidents of Peru"](http://en.wikipedia.org/wiki/List_of_Presidents_of_Peru)

```{r getdata}
src <- "http://en.wikipedia.org/wiki/List_of_Presidents_of_Peru"
doc = htmlParse(src, encoding = "UTF-8")
tables <- readHTMLTable(doc)
# the table we need is the third one
t3 <- tables[[3]]
```

Then, we ought to fix some weirdness in the data, and will save it to a
CSV just in case we want to do some more processing in the future. As we
are keeping the original column names from the HTML table, some code is
a bit more cumbersome (because we need to use backticks).

```{r mangle}
# We do not need column #2, which contains an image
# also, let's reorder the columns
t3 <- t3[,c(3,6,7,4,5)]

# convert to dates the start and end term columns
fix_date <- function(x) {
    return(as.Date(strptime(x, format="%B %e, %Y")))
}
t3[c("Inaugurated","Left office")] <- lapply(t3[c("Inaugurated","Left office")], fix_date)

# cleanup random stuff in []s in a couple of columns
t3$`Form of entry` <- sub("\\[7\\]", "", gsub("\n", " - ", t3$`Form of entry`))
t3$President <- gsub("\\[.+\\]", "", as.character(t3$President))

# add the regular end of term (5 yrs) for the current president
last <- nrow(t3)
if (is.na(t3[last,]$`Left office`)) {
  tmp <- t3[last,]$Inaugurated
  year(tmp) <- year(tmp) + 5  # normal presidential term: 5 years
  t3[last,]$`Left office` <- tmp
}

# save the cleaned up data
write.csv(t3, "peru-presidents.csv", row.names=FALSE)
```

## Displaying the data as a sortable and paginated table

Let's look at the data we got after scraping Wikipedia and mangling values around. We'll
make an interactive table using the *gvisTable* function from the **googleVis** package.

We want to paginate the table, because Peru has had `r nrow(t3)` people
that held the Presidency at one point or another. The table is a bit
wide, so it will look nicer.

```{r ptable, results='asis'}
opts <- list(width=1000, height=330, showRowNumber=TRUE, page="enable")
presidents_table <- gvisTable(t3, options=opts)
print(presidents_table, "chart")
```

## Creating a timeline chart

Now, let's visualize the succession of presidents using a timeline chart
as implemented in **googleVis**, coloring each timespan by the what
original data calls "Form of entry", which is how a particular person got
into the Presidency. There are `r sum(t3[,2]=="")` records that do not
have a given value for the aforementioned field, so we will recode those
as "*Unknown*".

This chart is also a bit wide, because the data spans over
`r year(max(t3$Inaugurated)) - year(min(t3$Inaugurated))` years.

```{r ptimeline, results='asis'}
t3[t3[, 2] == "", 2] <- "Unkown"
presidents_timeline <- gvisTimeline(
    t3, rowlabel="President", start="Inaugurated", end="Left office",
    barlabel="Form of entry", options=list(height=500, width=1200))
print(presidents_timeline, "chart")
```

You might have noticed that at some points in Peru's history we had
more than one President, and at other times they seem to change rapidly
or to swing back and forth among a number of recurring characters. Such
was our lot back then, but we have had better luck for some decades now.

## Understanding how they got into power

We will make cummulative frequency chart, by using **dplyr** to manipulate
and summarize the data and **googleVis** to plot it. We could've
used *table()* along with other base functions, but dplyr's syntax is
cleaner and more readable.

```{r dplyr, results='asis'}
# group by "Form of entry", get the counts per group, and sort the
# data frame in descending order of counts
t3_summary <- t3 %>% group_by("Form of entry") %>%
    summarise(count=n()) %>%
    arrange(-count, `Form of entry`) %>%
    mutate(`Cummulative frequency`=round(100*cumsum(count)/sum(count),2))

# make the cummulative frequency chart and print it
t3_summary_chart <- gvisLineChart(
    t3_summary, xvar="Form of entry", yvar="Cummulative frequency",
    options=list(height=400, width=800, pointSize=5,
                 title="How peruvian presidents got into office",
                 vAxis="{title:'Cummulative frequency (%)'}",
                 hAxis="{title:'Mode of attaining office'}",
                 legend="{position:'none'}")
    )
print(t3_summary_chart, "chart")
```

```{r echo=FALSE}
tmp1 <- t3_summary[1:4,]
ntmp1 <- nrow(tmp1)
cftmp1 <- round(tmp1[ntmp1, 3])
```

In this chart we can plainly see that the first `r ntmp1` modes of
attaining office (`r paste0(paste0("\"*",tmp1[1:(ntmp1 - 1),1], collapse = "*\", "), "*\", and \"*", tmp1[ntmp1,1], "*\"")`), comprise the majority (a bit over `r cftmp1`%) of all the
ways that the office of President have ever been attained in Peru.

## Length of time in office

If we wanted to know the distribution of the lengths of time
in office for all presidents, we can do some simple data exploration and
create a histogram, with the the median and mean overlayed on it:

```{r}
# length of time in office
lio_days <- as.numeric(t3$`Left office` - t3$Inaugurated + 1)  # in days
lio_yrs <- lio_days / 365.25   # in years
lio_yrs_mean <- round(mean(lio_yrs),2)
lio_yrs_median <- round(median(lio_yrs),2)
hist(lio_yrs, main="Distribution of time in office", xlab="Time span (years)")
abline(v=lio_yrs_mean, col="red", lwd=2, lty="dashed")
abline(v=lio_yrs_median, col="blue", lwd=2, lty="dashed")
text(x=c(lio_yrs_mean+.1, lio_yrs_median+.1), y=c(25,45),
     labels=c(paste0("Mean=", lio_yrs_mean), paste0("Median=", lio_yrs_median)),
     pos=4, col=c("red", "blue"))
```

We can see a typical right-skewed distribution, with a great majority of short
lengths of term in office (as little as `r min(lio_days)` days), and some
exceptionally long ones (as much as ~`r round(max(lio_yrs), 2)` years). So
in this case, the mean (`r lio_yrs_mean` years) is not very informative, and
the median (`r lio_yrs_median` years) looks suspiciosly short.

Let's look at these time spans groupíng them by the way each one attained the
office.

```{r lengthoffice}
t3$len_office <- lio_yrs
# generate a grouping variable based on the form of entry
t3$group <- gsub("^([^-]+) -(.*)", "\\1", t3[,"Form of entry"], fixed=FALSE)
# combine the forms of entry with counts less than 10
tg <- table(t3$group)
otherlvl <- names(tg[tg < 10])
t3[t3$group %in% otherlvl,]$group <- "Other"
# reorder the grouping column by increasing count
tg <- table(t3$group)
t3$group <- factor(as.character(t3$group), levels=names(tg[order(tg)]))
# Make boxplots for each grouping factor
t3_plot <- ggplot(t3, aes(group, len_office)) +
    geom_hline(yintercept=5, colour="gray", linetype="longdash") +
    geom_boxplot(aes(colour=group)) +
    ggtitle("Distributions of Peruvian President's terms in office") +
    coord_flip() + ylab("Length in office (years)") + xlab("How office was attained") +
    theme_bw() + theme(legend.position="none")
t3_plot
```

In this chart we have added a reference line, the official time span for a
President's term in office in Peru: 5 years. It would seem that if you got
into office by "Direct Elections" you have a better chance to reach you
usual term (median ~ 4 years), but if you got by another *route* (let's
say by "Coup d'état") you are more likely to be there for a short time.

In the table below, we can see a set of summary statistics per group, which
indicate a distinctive difference between them.

```{r results='asis'}
t3_grouped <- t3 %>% group_by(group) %>%
    summarise(n=n(), avg=mean(len_office), sd=sd(len_office), median=median(len_office),
              min=min(len_office), max=max(len_office), iqr=IQR(len_office)) %>%
    arrange(-n)
t3_grouped[-1] <- sapply(t3_grouped[-1], function (x) { round(x,3) })
t3_grouped_table <- gvisTable(t3_grouped,
                              options=list(width=800, height=200))
print(t3_grouped_table, "chart")
```

In fact, using a Kruskal-Walis rank sum test, seems to indicate that the
groups are indeed different (p < 0.001).

```{r}
kruskal.test(len_office ~ group, data=t3)
```

There might be a moral in this data, but policital conclusions run the risk
of degenerating in random rants, so I'll skip that.

## Reproducibility information

The source code for this document is available at [https://gist.github.com/jmcastagnetto/11127154](https://gist.github.com/jmcastagnetto/11127154)

```{r}
sessionInfo()
```
	---
	title: "Analysis of the data on Presidents of Peru"
	author: Jesus M. Castagnetto
	output:
	html_document:
	toc: true
	theme: readable
	highlight: tango
	---

	```{r chunkconfig, echo=FALSE}
	library(knitr)
	opts_chunk$set(comment=NA, warning=FALSE, message=FALSE)
	```

	Last generated on: `r date()`

	We will use the list of Presidents of Peru from a Wikipedia page, to
	play a bit with some cool R packages (XML, dplyr, lubridate, ggplot2,
	and googleVis), which will be used to extract and clean up the data,
	and later make some summaries and plots.

	## Requirements

	For this experiment, we will need the following libraries

	- XML: to parse and extract a table from an HTML page
	- dplyr: to do some data manipulation
	- lubridate: to do some date operations
	- ggplot2: to generate a nice boxplot
	- googleVis: to make some interactive tables and plots (I am using
	the development version from github)

	```{r}
	require(XML)
	require(dplyr)
	require(lubridate)
	require(ggplot2)
	require(googleVis)
	```

	If you don't have them installed, then you might want to run:

	```{r eval=FALSE}
	install.packages(c("XML", "dplyr", "lubridate", "ggplot2"))
	# pre-requisites for the development version of googleVis
	install.packages(c("devtools","RJSONIO", "knitr", "shiny", "httpuv"))
	devtools::install_github("mages/googleVis")
	```

	## Getting and mangling the data

	First, let's read the data from the third HTML table in Wikipedia's
	page: ["List of Presidents of Peru"](http://en.wikipedia.org/wiki/List_of_Presidents_of_Peru)

	```{r getdata}
	src <- "http://en.wikipedia.org/wiki/List_of_Presidents_of_Peru"
	doc = htmlParse(src, encoding = "UTF-8")
	tables <- readHTMLTable(doc)
	# the table we need is the third one
	t3 <- tables[[3]]
	```

	Then, we ought to fix some weirdness in the data, and will save it to a
	CSV just in case we want to do some more processing in the future. As we
	are keeping the original column names from the HTML table, some code is
	a bit more cumbersome (because we need to use backticks).

	```{r mangle}
	# We do not need column #2, which contains an image
	# also, let's reorder the columns
	t3 <- t3[,c(3,6,7,4,5)]

	# convert to dates the start and end term columns
	fix_date <- function(x) {
	return(as.Date(strptime(x, format="%B %e, %Y")))
	}
	t3[c("Inaugurated","Left office")] <- lapply(t3[c("Inaugurated","Left office")], fix_date)

	# cleanup random stuff in []s in a couple of columns
	t3$`Form of entry` <- sub("\\[7\\]", "", gsub("\n", " - ", t3$`Form of entry`))
	t3$President <- gsub("\\[.+\\]", "", as.character(t3$President))

	# add the regular end of term (5 yrs) for the current president
	last <- nrow(t3)
	if (is.na(t3[last,]$`Left office`)) {
	tmp <- t3[last,]$Inaugurated
	year(tmp) <- year(tmp) + 5 # normal presidential term: 5 years
	t3[last,]$`Left office` <- tmp
	}

	# save the cleaned up data
	write.csv(t3, "peru-presidents.csv", row.names=FALSE)
	```

	## Displaying the data as a sortable and paginated table

	Let's look at the data we got after scraping Wikipedia and mangling values around. We'll
	make an interactive table using the gvisTable function from the googleVis package.

	We want to paginate the table, because Peru has had `r nrow(t3)` people
	that held the Presidency at one point or another. The table is a bit
	wide, so it will look nicer.

	```{r ptable, results='asis'}
	opts <- list(width=1000, height=330, showRowNumber=TRUE, page="enable")
	presidents_table <- gvisTable(t3, options=opts)
	print(presidents_table, "chart")
	```

	## Creating a timeline chart

	Now, let's visualize the succession of presidents using a timeline chart
	as implemented in googleVis, coloring each timespan by the what
	original data calls "Form of entry", which is how a particular person got
	into the Presidency. There are `r sum(t3[,2]=="")` records that do not
	have a given value for the aforementioned field, so we will recode those
	as "Unknown".

	This chart is also a bit wide, because the data spans over
	`r year(max(t3$Inaugurated)) - year(min(t3$Inaugurated))` years.

	```{r ptimeline, results='asis'}
	t3[t3[, 2] == "", 2] <- "Unkown"
	presidents_timeline <- gvisTimeline(
	t3, rowlabel="President", start="Inaugurated", end="Left office",
	barlabel="Form of entry", options=list(height=500, width=1200))
	print(presidents_timeline, "chart")
	```

	You might have noticed that at some points in Peru's history we had
	more than one President, and at other times they seem to change rapidly
	or to swing back and forth among a number of recurring characters. Such
	was our lot back then, but we have had better luck for some decades now.

	## Understanding how they got into power

	We will make cummulative frequency chart, by using dplyr to manipulate
	and summarize the data and googleVis to plot it. We could've
	used table() along with other base functions, but dplyr's syntax is
	cleaner and more readable.

	```{r dplyr, results='asis'}
	# group by "Form of entry", get the counts per group, and sort the
	# data frame in descending order of counts
	t3_summary <- t3 %>% group_by("Form of entry") %>%
	summarise(count=n()) %>%
	arrange(-count, `Form of entry`) %>%
	mutate(`Cummulative frequency`=round(100*cumsum(count)/sum(count),2))

	# make the cummulative frequency chart and print it
	t3_summary_chart <- gvisLineChart(
	t3_summary, xvar="Form of entry", yvar="Cummulative frequency",
	options=list(height=400, width=800, pointSize=5,
	title="How peruvian presidents got into office",
	vAxis="{title:'Cummulative frequency (%)'}",
	hAxis="{title:'Mode of attaining office'}",
	legend="{position:'none'}")
	)
	print(t3_summary_chart, "chart")
	```

	```{r echo=FALSE}
	tmp1 <- t3_summary[1:4,]
	ntmp1 <- nrow(tmp1)
	cftmp1 <- round(tmp1[ntmp1, 3])
	```

	In this chart we can plainly see that the first `r ntmp1` modes of
	attaining office (`r paste0(paste0("\"",tmp1[1:(ntmp1 - 1),1], collapse = "\", "), "\", and \"", tmp1[ntmp1,1], "*\"")`), comprise the majority (a bit over `r cftmp1`%) of all the
	ways that the office of President have ever been attained in Peru.

	## Length of time in office

	If we wanted to know the distribution of the lengths of time
	in office for all presidents, we can do some simple data exploration and
	create a histogram, with the the median and mean overlayed on it:

	```{r}
	# length of time in office
	lio_days <- as.numeric(t3$`Left office` - t3$Inaugurated + 1) # in days
	lio_yrs <- lio_days / 365.25 # in years
	lio_yrs_mean <- round(mean(lio_yrs),2)
	lio_yrs_median <- round(median(lio_yrs),2)
	hist(lio_yrs, main="Distribution of time in office", xlab="Time span (years)")
	abline(v=lio_yrs_mean, col="red", lwd=2, lty="dashed")
	abline(v=lio_yrs_median, col="blue", lwd=2, lty="dashed")
	text(x=c(lio_yrs_mean+.1, lio_yrs_median+.1), y=c(25,45),
	labels=c(paste0("Mean=", lio_yrs_mean), paste0("Median=", lio_yrs_median)),
	pos=4, col=c("red", "blue"))
	```

	We can see a typical right-skewed distribution, with a great majority of short
	lengths of term in office (as little as `r min(lio_days)` days), and some
	exceptionally long ones (as much as ~`r round(max(lio_yrs), 2)` years). So
	in this case, the mean (`r lio_yrs_mean` years) is not very informative, and
	the median (`r lio_yrs_median` years) looks suspiciosly short.

	Let's look at these time spans groupíng them by the way each one attained the
	office.

	```{r lengthoffice}
	t3$len_office <- lio_yrs
	# generate a grouping variable based on the form of entry
	t3$group <- gsub("^([^-]+) -(.*)", "\\1", t3[,"Form of entry"], fixed=FALSE)
	# combine the forms of entry with counts less than 10
	tg <- table(t3$group)
	otherlvl <- names(tg[tg < 10])
	t3[t3$group %in% otherlvl,]$group <- "Other"
	# reorder the grouping column by increasing count
	tg <- table(t3$group)
	t3$group <- factor(as.character(t3$group), levels=names(tg[order(tg)]))
	# Make boxplots for each grouping factor
	t3_plot <- ggplot(t3, aes(group, len_office)) +
	geom_hline(yintercept=5, colour="gray", linetype="longdash") +
	geom_boxplot(aes(colour=group)) +
	ggtitle("Distributions of Peruvian President's terms in office") +
	coord_flip() + ylab("Length in office (years)") + xlab("How office was attained") +
	theme_bw() + theme(legend.position="none")
	t3_plot
	```

	In this chart we have added a reference line, the official time span for a
	President's term in office in Peru: 5 years. It would seem that if you got
	into office by "Direct Elections" you have a better chance to reach you
	usual term (median ~ 4 years), but if you got by another route (let's
	say by "Coup d'état") you are more likely to be there for a short time.

	In the table below, we can see a set of summary statistics per group, which
	indicate a distinctive difference between them.

	```{r results='asis'}
	t3_grouped <- t3 %>% group_by(group) %>%
	summarise(n=n(), avg=mean(len_office), sd=sd(len_office), median=median(len_office),
	min=min(len_office), max=max(len_office), iqr=IQR(len_office)) %>%
	arrange(-n)
	t3_grouped[-1] <- sapply(t3_grouped[-1], function (x) { round(x,3) })
	t3_grouped_table <- gvisTable(t3_grouped,
	options=list(width=800, height=200))
	print(t3_grouped_table, "chart")
	```

	In fact, using a Kruskal-Walis rank sum test, seems to indicate that the
	groups are indeed different (p < 0.001).

	```{r}
	kruskal.test(len_office ~ group, data=t3)
	```

	There might be a moral in this data, but policital conclusions run the risk
	of degenerating in random rants, so I'll skip that.

	## Reproducibility information

	The source code for this document is available at [https://gist.github.com/jmcastagnetto/11127154](https://gist.github.com/jmcastagnetto/11127154)

	```{r}
	sessionInfo()
	```