Skip to content

Instantly share code, notes, and snippets.

@geoffwoollard
Last active December 24, 2015 20:59
Show Gist options
  • Save geoffwoollard/6862144 to your computer and use it in GitHub Desktop.
Save geoffwoollard/6862144 to your computer and use it in GitHub Desktop.
stat545a-2013-hw05_woollard-geo
================================
---
```{r}
library(plyr)
library(mgcv)
library(lattice)
library(ggplot2)
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
iDat <- read.table(gdURL, header = TRUE, sep = '\t', quote = "\"") #initial data, with Oceania
removeContinent <- "Oceania"
gDat <- droplevels(subset(iDat, continent != removeContinent))
str(gDat)
```
Load the data and get rid of the continent `r removeContinent` since it has only `r length(unique(iDat[with(data = iDat, continent == removeContinent),]$country)) ` countries
## Lattice vs. ggplot2
---
The following `ddply` and `xyplot` wizardry comes from [Jenny's course notes](http://www.stat.ubc.ca/~jenny/STAT545A/hw04_univariateLattice.html)
```{r, tidy=FALSE, fig.width = 5, fig.height=5}
# modified from jenny's lattice solution to max and min per continent
lifeExpByContinent <- ddply(gDat, ~ continent + year, function(x) {
jLevels <- c("min", "max")
data.frame(lifeExp = range(x$lifeExp),
stat = factor(jLevels, levels = jLevels))
})
head(lifeExpByContinent)
# in lattice
xyplot(lifeExp ~ year | continent, lifeExpByContinent,
group = stat, type = "b", grid = "h", as.table = TRUE,
auto.key = list(columns = 2),
scales = list(y = list(log = TRUE, equispaced.log = FALSE)))
# same thing but with ggplot2
plt <- ggplot(lifeExpByContinent, aes(y=lifeExp, x=year, color=stat))
plt + geom_line() + geom_point() +facet_wrap(~continent)
```
## gdpPercap vs. lifeExp
---
This density plot uses `geom_hex()`, which I find easy on the eyes. The correlation between lifeExp and gdpPercap -- or better yet log(gdpPercap) -- is clear with a scatter plot. We use a [LOESS curve](http://en.wikipedia.org/wiki/Local_regression) with `geom_smooth(method='loess')`, since this is the option that is chosen with `geom_smooth(method='auto')` with so many data points; all the data makes the spread are a bit hard to see.
```{r fig.width = 10, fig.height=5}
# density plot
plt <- ggplot(gDat, aes(x = gdpPercap, y = lifeExp))
plt + scale_x_log10() + geom_hex()
# scatter plot, showing raw data and spread
plt <- ggplot(gDat, aes(y=lifeExp,x=gdpPercap, color=continent))
plt + facet_wrap(~continent) + geom_point(shape=1) + geom_smooth(method='loess') + scale_x_log10()
```
## [Diabetes data](http://www.kaggle.com/c/pf2012-diabetes/data) from [kaggle.com](http://www.kaggle.com/)
Kaggle hosts data analysis competitions. Real world problems are posted, and experts from around the world provide solutins to prediction and optimization challenges. The Featured competitions as of `r today <- Sys.Date() ; format(today, format="%d %B %Y")` include:
* **$220 000** : Optimize flight routes based on current weather and traffic (GE)
* **$25 000** : Learning to rank hotels to maximize purchases (Expedia)
* **A job** : Keyword Extraction (Facebook)
```{r fig.width = 10, fig.height=5}
# import kaggle diabetes data
kDat <- read.table("/Users/geoffrey/Downloads/kaggleDataSets/diabetes_trainingSet/training_SyncTranscript.csv", header=TRUE, sep=",", na.strings=c("NULL", 0,""))
# type of physician
kClean <- kDat[(!is.na(kDat$PhysicianSpecialty)),] # remove the NAs
problemFactor <- "Developmental – Behavioral Pediatrics"
kClean <- kClean[kClean$PhysicianSpecialty != problemFactor,] # one factor level is giving a hard time, after googling around I suspect the encoding of the dash
plt <- ggplot(kClean,
aes(x=reorder(PhysicianSpecialty,PhysicianSpecialty, length))
)
plt + geom_histogram() + scale_y_log10() + theme(axis.text.x=element_text(angle=90, hjust=1)) + xlab("PhysicianSpecialty") # vertical text
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment