geoffwoollard/stat545a-2013-hw05_woollard-geo.rmd

## stat545a-2013-hw05_woollard-geo.rmd
stat545a-2013-hw05_woollard-geo
================================
---

```{r}
library(plyr)
library(mgcv)
library(lattice)
library(ggplot2)
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
iDat  <- read.table(gdURL, header = TRUE, sep = '\t', quote = "\"") #initial data, with Oceania
removeContinent  <- "Oceania"
gDat  <- droplevels(subset(iDat, continent != removeContinent))
str(gDat)
```
Load the data and get rid of the continent `r removeContinent` since it has only `r length(unique(iDat[with(data = iDat, continent == removeContinent),]$country)) ` countries

## Lattice vs. ggplot2
---
The following `ddply` and `xyplot` wizardry comes from [Jenny's course notes](http://www.stat.ubc.ca/~jenny/STAT545A/hw04_univariateLattice.html)


```{r, tidy=FALSE, fig.width = 5, fig.height=5}
# modified from jenny's lattice solution to max and min per continent
lifeExpByContinent <- ddply(gDat, ~ continent + year, function(x) {
  jLevels <- c("min", "max")
  data.frame(lifeExp = range(x$lifeExp),
             stat = factor(jLevels, levels = jLevels))
})
head(lifeExpByContinent)

# in lattice
xyplot(lifeExp ~ year | continent, lifeExpByContinent,
       group = stat, type = "b", grid = "h", as.table = TRUE,
       auto.key = list(columns = 2),
       scales = list(y = list(log = TRUE, equispaced.log = FALSE)))

# same thing but with ggplot2
plt  <- ggplot(lifeExpByContinent, aes(y=lifeExp, x=year, color=stat))
plt + geom_line() + geom_point() +facet_wrap(~continent)
```

## gdpPercap vs. lifeExp
---
This density plot uses `geom_hex()`, which I find easy on the eyes. The correlation between lifeExp and gdpPercap -- or better yet log(gdpPercap) -- is clear with a scatter plot. We use a [LOESS curve](http://en.wikipedia.org/wiki/Local_regression) with `geom_smooth(method='loess')`, since this is the option that is chosen with `geom_smooth(method='auto')` with so many data points; all the data makes the spread are a bit hard to see.
```{r fig.width = 10, fig.height=5}
# density plot
plt  <- ggplot(gDat, aes(x = gdpPercap, y = lifeExp))
plt + scale_x_log10() + geom_hex()
# scatter plot, showing raw data and spread
plt  <- ggplot(gDat, aes(y=lifeExp,x=gdpPercap, color=continent))
plt + facet_wrap(~continent) + geom_point(shape=1) + geom_smooth(method='loess') + scale_x_log10()
```

## [Diabetes data](http://www.kaggle.com/c/pf2012-diabetes/data) from [kaggle.com](http://www.kaggle.com/)
Kaggle hosts data analysis competitions. Real world problems are posted, and experts from around the world provide solutins to prediction and optimization challenges. The Featured competitions as of `r today <- Sys.Date() ; format(today, format="%d %B %Y")` include:
* **$220 000** : Optimize flight routes based on current weather and traffic (GE)
* **$25 000** : Learning to rank hotels to maximize purchases (Expedia)
* **A job** : Keyword Extraction (Facebook)
```{r fig.width = 10, fig.height=5}
# import kaggle diabetes data
kDat  <- read.table("/Users/geoffrey/Downloads/kaggleDataSets/diabetes_trainingSet/training_SyncTranscript.csv", header=TRUE, sep=",", na.strings=c("NULL", 0,""))

# type of physician
kClean  <- kDat[(!is.na(kDat$PhysicianSpecialty)),] # remove the NAs
problemFactor  <- "Developmental – Behavioral Pediatrics"
kClean  <- kClean[kClean$PhysicianSpecialty != problemFactor,] # one factor level is giving a hard time, after googling around I suspect the encoding of the dash
plt  <- ggplot(kClean,
               aes(x=reorder(PhysicianSpecialty,PhysicianSpecialty, length))
               )
plt  + geom_histogram() + scale_y_log10() + theme(axis.text.x=element_text(angle=90, hjust=1)) + xlab("PhysicianSpecialty") # vertical text
```
	stat545a-2013-hw05_woollard-geo
	================================
	---

	```{r}
	library(plyr)
	library(mgcv)
	library(lattice)
	library(ggplot2)
	gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
	iDat <- read.table(gdURL, header = TRUE, sep = '\t', quote = "\"") #initial data, with Oceania
	removeContinent <- "Oceania"
	gDat <- droplevels(subset(iDat, continent != removeContinent))
	str(gDat)
	```
	Load the data and get rid of the continent `r removeContinent` since it has only `r length(unique(iDat[with(data = iDat, continent == removeContinent),]$country)) ` countries

	## Lattice vs. ggplot2
	---
	The following `ddply` and `xyplot` wizardry comes from [Jenny's course notes](http://www.stat.ubc.ca/~jenny/STAT545A/hw04_univariateLattice.html)


	```{r, tidy=FALSE, fig.width = 5, fig.height=5}
	# modified from jenny's lattice solution to max and min per continent
	lifeExpByContinent <- ddply(gDat, ~ continent + year, function(x) {
	jLevels <- c("min", "max")
	data.frame(lifeExp = range(x$lifeExp),
	stat = factor(jLevels, levels = jLevels))
	})
	head(lifeExpByContinent)

	# in lattice
	xyplot(lifeExp ~ year \| continent, lifeExpByContinent,
	group = stat, type = "b", grid = "h", as.table = TRUE,
	auto.key = list(columns = 2),
	scales = list(y = list(log = TRUE, equispaced.log = FALSE)))

	# same thing but with ggplot2
	plt <- ggplot(lifeExpByContinent, aes(y=lifeExp, x=year, color=stat))
	plt + geom_line() + geom_point() +facet_wrap(~continent)
	```

	## gdpPercap vs. lifeExp
	---
	This density plot uses `geom_hex()`, which I find easy on the eyes. The correlation between lifeExp and gdpPercap -- or better yet log(gdpPercap) -- is clear with a scatter plot. We use a [LOESS curve](http://en.wikipedia.org/wiki/Local_regression) with `geom_smooth(method='loess')`, since this is the option that is chosen with `geom_smooth(method='auto')` with so many data points; all the data makes the spread are a bit hard to see.
	```{r fig.width = 10, fig.height=5}
	# density plot
	plt <- ggplot(gDat, aes(x = gdpPercap, y = lifeExp))
	plt + scale_x_log10() + geom_hex()
	# scatter plot, showing raw data and spread
	plt <- ggplot(gDat, aes(y=lifeExp,x=gdpPercap, color=continent))
	plt + facet_wrap(~continent) + geom_point(shape=1) + geom_smooth(method='loess') + scale_x_log10()
	```

	## [Diabetes data](http://www.kaggle.com/c/pf2012-diabetes/data) from [kaggle.com](http://www.kaggle.com/)
	Kaggle hosts data analysis competitions. Real world problems are posted, and experts from around the world provide solutins to prediction and optimization challenges. The Featured competitions as of `r today <- Sys.Date() ; format(today, format="%d %B %Y")` include:
	* $220 000 : Optimize flight routes based on current weather and traffic (GE)
	* $25 000 : Learning to rank hotels to maximize purchases (Expedia)
	* A job : Keyword Extraction (Facebook)
	```{r fig.width = 10, fig.height=5}
	# import kaggle diabetes data
	kDat <- read.table("/Users/geoffrey/Downloads/kaggleDataSets/diabetes_trainingSet/training_SyncTranscript.csv", header=TRUE, sep=",", na.strings=c("NULL", 0,""))

	# type of physician
	kClean <- kDat[(!is.na(kDat$PhysicianSpecialty)),] # remove the NAs
	problemFactor <- "Developmental – Behavioral Pediatrics"
	kClean <- kClean[kClean$PhysicianSpecialty != problemFactor,] # one factor level is giving a hard time, after googling around I suspect the encoding of the dash
	plt <- ggplot(kClean,
	aes(x=reorder(PhysicianSpecialty,PhysicianSpecialty, length))
	)
	plt + geom_histogram() + scale_y_log10() + theme(axis.text.x=element_text(angle=90, hjust=1)) + xlab("PhysicianSpecialty") # vertical text
	```