Skip to content

Instantly share code, notes, and snippets.

@vmeng321
Created September 23, 2013 07:28
Show Gist options
  • Save vmeng321/6667418 to your computer and use it in GitHub Desktop.
Save vmeng321/6667418 to your computer and use it in GitHub Desktop.
Homework03
========================================================
```{r echo=FALSE}
library(stringr)
library(xtable)
library(plyr)
```
This homeork assignment is an excercise in data aggregation with a focus on writing code that is clean and readable.
```{r}
# read from internet
# gDat <- read.delim("http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt")
#read locally
gDat <- read.delim(file.path(str_replace(getwd(), "stat545a-2013-hw03", "GapMinder"), "gapminderDataFiveYear.txt"))
```
I begin with the "harder" sections. I hope this saves the TA some time in marking.
Life Expectancy by Year
---------------------------
### Mean versus trimmed mean
We compare the mean life expectancy by year to the trimmed mean (20% from each end).
```{r results="asis"}
mean.lifeExp_by_year <- arrange(ddply(gDat, ~year, summarize, mean=mean(lifeExp), trimmed.mean = mean(lifeExp, trim=0.2)),year)
print(xtable(mean.lifeExp_by_year, digits=2), type="html", include.rownames=F)
```
Curiously, the mean value of earlier years are *higher* than the trimmed mean, while the reverse is true for the later years...
### (Challenge) Life expectancy over time on different continents: the "Wide" format
```{r }
### performing the task with 1 line of code
mean.lifeExp_by_year_by_cont.wide <- daply(gDat,~year, daply, ~continent, summarize, mean=mean(lifeExp), median=median(lifeExp)) #note multiple stats can be computed with one call!
```
This is an odd excercise, that involves essentially "nested" daply() calls. The "inner" call takes data frame for a given year, returns a "array/matrix" of 2 dimensions: 1st dimension the factor levels you split by, and 2nd dimensions the stats you make it calculate. The "outer" call takes the resulting "array/matrix" and inserts another dimension to this array representing the year.
```{r }
### data structure: array of 3 dimensions
class(mean.lifeExp_by_year_by_cont.wide); dim(mean.lifeExp_by_year_by_cont.wide)
```
To print a 2 dimensional table, one would use the multidimensional array indexing, [,...,]. As an example myMultDimArray[,,"statName"] retrieves the matrix with information on a statistic called "statName".
```{r results='asis'}
### print 2d tables of "mean"
print(xtable(mean.lifeExp_by_year_by_cont.wide[,,"mean"], digits=2),type="html", include.rownames=T)
### print 2d table of "median", with table caption
print(xtable(mean.lifeExp_by_year_by_cont.wide[,,"median"], digits=2, caption="Median life expectancy of 5 continents by year"), type="html", include.rownames=T, caption.placement="top")
```
### (Basically the same Challenge) Report the absolute and/or relative abundance of countries with low life expectancy over time by continent. "Wide" format. Extension of another task of unknown difficulty / value.
This task becomes very easy with the use of nested daply()!
```{r }
(thresh <- quantile(gDat$lifeExp, 0.1))
(low_lifeExp <- daply(gDat, ~year, daply, ~continent, summarize, countLowLifeExp= sum(lifeExp< thresh), propLowLifeExp= mean(lifeExp < thresh))) #note that all of these stats can be computed in one call!
```
Note that printing the object results in this "slicing" of multidimensional arrays by the inner-most dimension- in our case the dimension representing statistics we created.
This task actually embeds my knowledge of how to create counts and proportions using comparisons/masks (e.g. `propLowLifeExp= mean(lifeExp < thresh)`). It is probably unnecessary to print so many tables for 1 assignment. saves times for the marker.
```{r results='asis'}
### print 2d tables of "mean"
print(xtable(low_lifeExp[,,"countLowLifeExp"], digits=0, caption="The counts of countries with low life expectancy in each continent by year"),type="html", include.rownames=T, caption.placement="top")
### print 2d table of "median", with table caption
print(xtable(low_lifeExp[,,"propLowLifeExp"], digits=2, caption="The proportion of countries with low life expectancy in each continent by year"), type="html", include.rownames=T, caption.placement="top")
```
GDP per capita by Continent
--------------------------
### A wide table
Create a table with columns "Continent", "Minimum GDP per Capita" and "Maximum GDP per Capita". The continents are arranged by ascending minimum GDP.
```{r results='asis'}
minmax.gdp_by_continent.wide <- arrange(ddply(gDat, ~continent, summarize, minGDP = min(gdpPercap), maxGDP = max(gdpPercap)), minGDP)
print(xtable(minmax.gdp_by_continent.wide, digits=0), type="html", include.rownames=F)
```
### A Tall table
An alternative look at minimum and maximum GDP per capita by continent. Here the "type" of statistic occupies its one column followed by the results of applying that summary statistic.
```{r results='asis'}
# this is not very elegant
fct <- function(dframe){
res <- data.frame(factor_name="min", GDP=min(dframe$gdpPercap))
res<- rbind(res, data.frame(factor_name="max", GDP=max(dframe$gdpPercap)))
}
minmax.gdp_by_continent.tall <- arrange(ddply(gDat, ~continent, fct), factor_name, GDP)
print(xtable(minmax.gdp_by_continent.tall, digits=0), type="html", include.rownames=F)
```
Spread of GDP per capita
---------------------------
Here we investigate the spread of GDP per capita in each continent by 3 different measures, the standard deviation (sd), median absolute deviation (mad), and interquartile range (IQR).
```{r results='asis'}
spread.gdp_by_continent <- arrange(ddply(gDat, ~continent, summarize, sd=sd(gdpPercap), mad=mad(gdpPercap), IQR=IQR(gdpPercap)), IQR)
print(xtable(spread.gdp_by_continent, digits=0), type="html", include.rownames=F)
```
The above table is sorted by ascending IQR values. Notice that ordering by standard deviations is very different from this ordering by IQR, while the order by mad follows the order of IQR more closely. This suggests that sd may not always be the most meaningful spread to report and interpret for strangely shaped distributions. One often interprets sd as width containing 60% of the data assuming normal distribution, but for highly skewed data this interpretation is far from correct. An IQR, based on actual proporions from the data may be a more useful measure of spread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment