geoffwoollard/stat545a-2013-hw04_woollard-geo.rmd

## stat545a-2013-hw04_woollard-geo.rmd
stat545a-2013-hw04_woollard-geo
================================
---

```{r, echo=FALSE}
library(plyr)
library(xtable)
library(lattice)
#install.packages("reshape", dependencies=TRUE)
library(reshape)
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
iDat  <- read.table(gdURL, header = TRUE, sep = '\t', quote = "\"") #initial data, with Oceania
removeContinent  <- "Oceania"
gDat  <- droplevels(subset(iDat, continent != removeContinent))
```
Load the data and get rid of the continent `r removeContinent` since it has only `r length(unique(iDat[with(data = iDat, continent == removeContinent),]$country)) ` countries

# Crique of another student's code
---
This code comes from homework 3: gao-wen [source](https://gist.github.com/sibyl229/6668047#file-stat545a-2013-hw03_gao-wen-rmd) | [report](http://rpubs.com/less/stat545a-2013-hw03_gao-wen)

#### print_table
She first made a generalized function called `print_table`, this makes the code compact, although one improvement would be to include an option to round certain columns as desired.
```{r}
print_table = function(df) {
  print(xtable(df), type = "html", include.rownames = FALSE)
}
```

### Change in life expectancy
---
She sometimes makes the code uber compact (2 lines!).
* Pro: avoid place holder names like `foo` for objects (like data.frames) that are not going to be used later
* Con: can make the line really long and hard to read, at least on rpubs which doesn't save the whitespace formatting of the source

In fact the original source is acttually much more readable than the rpubs version
#### original source
```{r, eval=FALSE, tidy=FALSE}
trimFrac <- 0.15
print_table(
  ddply(gDat, ~year, summarize,
        trimmedMean = mean(lifeExp, trim=trimFrac)))
```
#### rpubs
```{r, eval=FALSE, tidy=FALSE}
print_table(ddply(gDat, ~year, summarize, trimmedMean = mean(lifeExp, trim = trimFrac)))
```
#### Plot the time course of median world life expenctancy
Ploting this data shows that it is increasing, but at a slower and slower rate
```{r fig.width = 3, fig.height=3}
trimFrac <- 0.15
df  <- ddply(gDat, ~year, summarize, meanLifeExp = mean(lifeExp, trim = trimFrac))
xyplot(meanLifeExp ~ year, df, type=c("a","p"))
```


### Rich and Poor Continents
---
These questions use just a subset of the data from year 2007. Why the year 2007? This is the final year for each country. Perhaps it would be better to have a year variable set to 2007 instead of hardcoding it in, since it it referenced in multiple places. If there was just one variable controling it then it could be changed easily by just modifying it in one spot.

```{r, tidy=FALSE}
gdp_stat <- function(gDatCntnt) {
  minGDP <- min(gDatCntnt$gdpPercap)
  maxGDP <- max(gDatCntnt$gdpPercap)
  return(data.frame(stat.type = c("min", "max"), GDP.per.cap = c(minGDP, maxGDP)))
}
gdpByContinentTall <- ddply(subset(gDat, year == 2007), ~continent, gdp_stat)
gdpByContinentTall <- arrange(gdpByContinentTall, GDP.per.cap)
```

> One quick fix is to get the year from the data itself and then replace all references to `2007` with `finalYear`
```{r}
(finalYear = max(gDat$year))
```

#### Plot the max and min life expectancy in each continent
> [there's more than one way to skin a cat](http://en.wiktionary.org/wiki/there's_more_than_one_way_to_skin_a_cat)

The above results tell me that something of the ordering between continents (omitting Oceania):
* MIN: Africa < Asia < Americas < Europe
* MAX: Africa < Americas < Asia < Europe
<br> But this is just for the year 2007... what's going on in other years. Since there is only a few continents, we can use panel plots.
We tall table has `max` or `min` in one column and the value in another. This makes it very easy to plot on the same graph since we can set `groups = stat.type`.

```{r fig.width = 10, tidy=FALSE}
gdpByContYear <- ddply(gDat, ~continent + year, gdp_stat)
xyplot(GDP.per.cap ~ year | continent,
      gdpByContYear,
       groups = stat.type, auto.key=TRUE, type=c("a","p")
       )
```
> There is not as much information as a box plot or violin plot, but the presentation is perhaps cleaner. But what are we really interested in? We want to compare the different continents, right? I don't really like looking at different panels back and forth, so we can swap the conditioning argument (`|`) and `groups`: ` | A, groups = B` --> ` | B, groups = A`.
```{r fig.width = 10, tidy=FALSE}
xyplot(GDP.per.cap ~ year | stat.type, data = gdpByContYear,
       groups = continent, auto.key=TRUE, type=c("a","p"), scale="free"
       )
```
> note the `scale="free"` argument , which automatically sets the scales on the panels.


### Interesting deviation from predicted life expectancy
---
We can find "oputlying" countries by finding countries that have a large residual. The approach is:
* build function that gets the max residuals
** note that it finds the absolute value of the residual, which can be positive or negative depending on weather it is respectively above or below the fit.
* feed it into `ddply`
* order the output by the residuals.
The function `max_residual` is a function that we feed into `ddply` without the `summarize` function. It is customized to give back what we want, in this case the maximum residual and the year.
```{r, comment=NA, results='asis', tidy=FALSE}
max_residual <- function(cDat){
  clm <- lm(lifeExp ~ I(year-min(cDat$year)), data=cDat)
  largestResidual <- abs(clm$residuals)
  maxRsdl <- max(largestResidual)
  return(c(Max.Residual=maxRsdl,
           Year=cDat[which.max(largestResidual),'year']))
}
interesting <- ddply(gDat, ~country, max_residual)
interesting <- arrange(interesting, abs(Max.Residual), decreasing=T)
print_table(head(interesting))
```
> At the end we checked the output with head. Here we see an issue with the `print_table` function -- it has `.00` after each year, which are always integers. What we need instead is to use the `digits` command in `xtable`. This could be packaged up into `print_table` easily since it works well with data.frames (there aren't issues with rounding factors).

#### My partial solution
```{r, comment=NA, results='asis'}
numCountries  <- 11# number of interesting countries
rounded  <- xtable(head(interesting,numCountries), digits=c(0,0,2,0)) # just use 0 for row number and country
print(rounded, type = "html", include.rownames = FALSE)
```

#### Plot the top `r numCountries` interesting countries
We used the residuals to find countries and their associalted years. We can see the big residuals if we plot the lifeExp vs year with a line -- just look for the data points that are far from the line.
```{r, tidy=FALSE}
interesting <- arrange(interesting, abs(Max.Residual), decreasing = T)
xyplot(lifeExp ~ year | country, gDat,
       subset = country %in% head(interesting,numCountries)$country,
       type = c("p", "r"))
```
As we can wee with Iraq, the largest residual is in `r subset(interesting, subset = country == "Iraq")$Year`, but this isn't some dramatic year, but the year before a change in trend. There are so many wars in Iraq, that it is difficult to relate historical events to the data:
* 1980-1988 [Iran-Iraq War](http://en.wikipedia.org/wiki/Iran%E2%80%93Iraq_War)
* 1990-1991 [Gulf War](http://en.wikipedia.org/wiki/Persian_Gulf_War)
* 2003 - present [Iraq War](http://en.wikipedia.org/wiki/Iraq_War)


# Choose your own adventure
---
This answers the question **Find countries with sudden, substantial departures from the temporal trend in one of the quantitative measures**.
Why does the life expencancy decrease so dramatically? Two stricking examples were Cambodia and Rwanda, which had clear conflict times in their history (see [Khmer Rouge rule of Cambodia](http://en.wikipedia.org/wiki/Khmer_Rouge_rule_of_Cambodia) in the 1970s and the [Rwandan Genocide](http://en.wikipedia.org/wiki/Rwandan_Genocide) in the 1990s). But the life expectancy doesn't just reflect a over loss of life, it reflects changes in the average age of the population. Roiughly speaking the life expectancy changes when on the average more young people are dying than old people, or vice-versa. So I decided to look into the relation between population and life expencancy. Obviously I need to somehow normalize, since population is on the oder of 10^6, while life expencancy tends to fluctuate in the range 20-100.

```{r, tidy=FALSE}
wide  <- ddply(subset(gDat, subset = country %in% head(interesting,numCountries)$country),
               ~country, summarize,
               normPop=scale(pop, center=FALSE),
               normLifeExp=scale(lifeExp, center=FALSE),
               year=year
               )
tall  <- melt(wide, id=c('country','year'))
xyplot(data=tall,
       value ~ year | country,
       group = variable, auto.key=TRUE
       )
```

I think the take home message from here is that life exectancy is a much more sensative variable than overall population in visualizing major shifts in the population. There are only a few cases where the overall population is decreases over a 5 year interval (Lesotho, Rwanda, Cambodia), and at these time points the life expectancy has a more defined negative trend.
	stat545a-2013-hw04_woollard-geo
	================================
	---

	```{r, echo=FALSE}
	library(plyr)
	library(xtable)
	library(lattice)
	#install.packages("reshape", dependencies=TRUE)
	library(reshape)
	gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
	iDat <- read.table(gdURL, header = TRUE, sep = '\t', quote = "\"") #initial data, with Oceania
	removeContinent <- "Oceania"
	gDat <- droplevels(subset(iDat, continent != removeContinent))
	```
	Load the data and get rid of the continent `r removeContinent` since it has only `r length(unique(iDat[with(data = iDat, continent == removeContinent),]$country)) ` countries

	# Crique of another student's code
	---
	This code comes from homework 3: gao-wen [source](https://gist.github.com/sibyl229/6668047#file-stat545a-2013-hw03_gao-wen-rmd) \| [report](http://rpubs.com/less/stat545a-2013-hw03_gao-wen)

	#### print_table
	She first made a generalized function called `print_table`, this makes the code compact, although one improvement would be to include an option to round certain columns as desired.
	```{r}
	print_table = function(df) {
	print(xtable(df), type = "html", include.rownames = FALSE)
	}
	```

	### Change in life expectancy
	---
	She sometimes makes the code uber compact (2 lines!).
	* Pro: avoid place holder names like `foo` for objects (like data.frames) that are not going to be used later
	* Con: can make the line really long and hard to read, at least on rpubs which doesn't save the whitespace formatting of the source

	In fact the original source is acttually much more readable than the rpubs version
	#### original source
	```{r, eval=FALSE, tidy=FALSE}
	trimFrac <- 0.15
	print_table(
	ddply(gDat, ~year, summarize,
	trimmedMean = mean(lifeExp, trim=trimFrac)))
	```
	#### rpubs
	```{r, eval=FALSE, tidy=FALSE}
	print_table(ddply(gDat, ~year, summarize, trimmedMean = mean(lifeExp, trim = trimFrac)))
	```
	#### Plot the time course of median world life expenctancy
	Ploting this data shows that it is increasing, but at a slower and slower rate
	```{r fig.width = 3, fig.height=3}
	trimFrac <- 0.15
	df <- ddply(gDat, ~year, summarize, meanLifeExp = mean(lifeExp, trim = trimFrac))
	xyplot(meanLifeExp ~ year, df, type=c("a","p"))
	```


	### Rich and Poor Continents
	---
	These questions use just a subset of the data from year 2007. Why the year 2007? This is the final year for each country. Perhaps it would be better to have a year variable set to 2007 instead of hardcoding it in, since it it referenced in multiple places. If there was just one variable controling it then it could be changed easily by just modifying it in one spot.

	```{r, tidy=FALSE}
	gdp_stat <- function(gDatCntnt) {
	minGDP <- min(gDatCntnt$gdpPercap)
	maxGDP <- max(gDatCntnt$gdpPercap)
	return(data.frame(stat.type = c("min", "max"), GDP.per.cap = c(minGDP, maxGDP)))
	}
	gdpByContinentTall <- ddply(subset(gDat, year == 2007), ~continent, gdp_stat)
	gdpByContinentTall <- arrange(gdpByContinentTall, GDP.per.cap)
	```

	> One quick fix is to get the year from the data itself and then replace all references to `2007` with `finalYear`
	```{r}
	(finalYear = max(gDat$year))
	```

	#### Plot the max and min life expectancy in each continent
	> [there's more than one way to skin a cat](http://en.wiktionary.org/wiki/there's_more_than_one_way_to_skin_a_cat)

	The above results tell me that something of the ordering between continents (omitting Oceania):
	* MIN: Africa < Asia < Americas < Europe
	* MAX: Africa < Americas < Asia < Europe
	<br> But this is just for the year 2007... what's going on in other years. Since there is only a few continents, we can use panel plots.
	We tall table has `max` or `min` in one column and the value in another. This makes it very easy to plot on the same graph since we can set `groups = stat.type`.

	```{r fig.width = 10, tidy=FALSE}
	gdpByContYear <- ddply(gDat, ~continent + year, gdp_stat)
	xyplot(GDP.per.cap ~ year \| continent,
	gdpByContYear,
	groups = stat.type, auto.key=TRUE, type=c("a","p")
	)
	```
	> There is not as much information as a box plot or violin plot, but the presentation is perhaps cleaner. But what are we really interested in? We want to compare the different continents, right? I don't really like looking at different panels back and forth, so we can swap the conditioning argument (`\|`) and `groups`: ` \| A, groups = B` --> ` \| B, groups = A`.
	```{r fig.width = 10, tidy=FALSE}
	xyplot(GDP.per.cap ~ year \| stat.type, data = gdpByContYear,
	groups = continent, auto.key=TRUE, type=c("a","p"), scale="free"
	)
	```
	> note the `scale="free"` argument , which automatically sets the scales on the panels.


	### Interesting deviation from predicted life expectancy
	---
	We can find "oputlying" countries by finding countries that have a large residual. The approach is:
	* build function that gets the max residuals
	** note that it finds the absolute value of the residual, which can be positive or negative depending on weather it is respectively above or below the fit.
	* feed it into `ddply`
	* order the output by the residuals.
	The function `max_residual` is a function that we feed into `ddply` without the `summarize` function. It is customized to give back what we want, in this case the maximum residual and the year.
	```{r, comment=NA, results='asis', tidy=FALSE}
	max_residual <- function(cDat){
	clm <- lm(lifeExp ~ I(year-min(cDat$year)), data=cDat)
	largestResidual <- abs(clm$residuals)
	maxRsdl <- max(largestResidual)
	return(c(Max.Residual=maxRsdl,
	Year=cDat[which.max(largestResidual),'year']))
	}
	interesting <- ddply(gDat, ~country, max_residual)
	interesting <- arrange(interesting, abs(Max.Residual), decreasing=T)
	print_table(head(interesting))
	```
	> At the end we checked the output with head. Here we see an issue with the `print_table` function -- it has `.00` after each year, which are always integers. What we need instead is to use the `digits` command in `xtable`. This could be packaged up into `print_table` easily since it works well with data.frames (there aren't issues with rounding factors).

	#### My partial solution
	```{r, comment=NA, results='asis'}
	numCountries <- 11# number of interesting countries
	rounded <- xtable(head(interesting,numCountries), digits=c(0,0,2,0)) # just use 0 for row number and country
	print(rounded, type = "html", include.rownames = FALSE)
	```

	#### Plot the top `r numCountries` interesting countries
	We used the residuals to find countries and their associalted years. We can see the big residuals if we plot the lifeExp vs year with a line -- just look for the data points that are far from the line.
	```{r, tidy=FALSE}
	interesting <- arrange(interesting, abs(Max.Residual), decreasing = T)
	xyplot(lifeExp ~ year \| country, gDat,
	subset = country %in% head(interesting,numCountries)$country,
	type = c("p", "r"))
	```
	As we can wee with Iraq, the largest residual is in `r subset(interesting, subset = country == "Iraq")$Year`, but this isn't some dramatic year, but the year before a change in trend. There are so many wars in Iraq, that it is difficult to relate historical events to the data:
	* 1980-1988 [Iran-Iraq War](http://en.wikipedia.org/wiki/Iran%E2%80%93Iraq_War)
	* 1990-1991 [Gulf War](http://en.wikipedia.org/wiki/Persian_Gulf_War)
	* 2003 - present [Iraq War](http://en.wikipedia.org/wiki/Iraq_War)


	# Choose your own adventure
	---
	This answers the question Find countries with sudden, substantial departures from the temporal trend in one of the quantitative measures.
	Why does the life expencancy decrease so dramatically? Two stricking examples were Cambodia and Rwanda, which had clear conflict times in their history (see [Khmer Rouge rule of Cambodia](http://en.wikipedia.org/wiki/Khmer_Rouge_rule_of_Cambodia) in the 1970s and the [Rwandan Genocide](http://en.wikipedia.org/wiki/Rwandan_Genocide) in the 1990s). But the life expectancy doesn't just reflect a over loss of life, it reflects changes in the average age of the population. Roiughly speaking the life expectancy changes when on the average more young people are dying than old people, or vice-versa. So I decided to look into the relation between population and life expencancy. Obviously I need to somehow normalize, since population is on the oder of 10^6, while life expencancy tends to fluctuate in the range 20-100.

	```{r, tidy=FALSE}
	wide <- ddply(subset(gDat, subset = country %in% head(interesting,numCountries)$country),
	~country, summarize,
	normPop=scale(pop, center=FALSE),
	normLifeExp=scale(lifeExp, center=FALSE),
	year=year
	)
	tall <- melt(wide, id=c('country','year'))
	xyplot(data=tall,
	value ~ year \| country,
	group = variable, auto.key=TRUE
	)
	```

	I think the take home message from here is that life exectancy is a much more sensative variable than overall population in visualizing major shifts in the population. There are only a few cases where the overall population is decreases over a 5 year interval (Lesotho, Rwanda, Cambodia), and at these time points the life expectancy has a more defined negative trend.