jeromyanglim/caschools-analysis.rmd

## caschools-analysis.rmd
`r opts_chunk$set(cache=TRUE)`

This is a quick set of analyses of the California Test Score dataset.  The post was produced using R Markdown in RStudio 0.96.  The main purpose of this post is to provide a case study of using R Markdown to prepare a quick reproducible report.  It provides examples of using plots, output, in-line R code, and markdown. The post is designed to be read along side the R Markdown source code, which is available as a gist on github.

<!-- more -->

### Preliminaries
* This post builds on my earlier post which provided a guide for [Getting Started with R Markdown, knitr, and RStudio 0.96](jeromyanglim.blogspot.com/2012/05/getting-started-with-r-markdown-knitr.html)
* The dataset analysed comes from the `AER` package which is an accompaniment to the book [Applied Econometrics with R](http://www.amazon.com/Applied-Econometrics-R-Use/dp/0387773169) written by [Christian Kleiber](http://wwz.unibas.ch/personen/profil/person/kleiber/) and [Achim Zeileis](http://eeecon.uibk.ac.at/~zeileis/).

### Load packages and data
```{r load_packages, message=FALSE, results='hide'}
# if necessary uncomment and install packages.
# install.packages("AER")
# install.packages("psych")
# install.packages("Hmisc")
# install.packages("ggplot2")
# install.packages("relaimpo")
library(AER) # interesting datasets
library(psych) # describe and psych.panels
library(Hmisc) # describe
library(ggplot2) # plots: ggplot and qplot
library(relaimpo) # relative importance in regression
```

```{r load_data}
# load the California Schools Dataset and give the dataset a shorter name
data(CASchools)
cas <- CASchools

# Convert grade to numeric

# table(cas$grades)
cas$gradesN <- cas$grades == "KK-08"

# Get the set of numeric variables
v <- setdiff(names(cas), c("district", "school",
    			"county", "grades"))

```

### Q 1 What does the CASchools dataset involve?
Quoting the help (i.e., `?CASchools`), the data is "from all 420 K-6 and K-8 districts in California with data available for 1998 and 1999" and the variables are:

    * district: character. District code.
    * school: character. School name.
    * county: factor indicating county.
    * grades: factor indicating grade span of district.
    * students: Total enrollment.
    * teachers: Number of teachers.
    * calworks: Percent qualifying for CalWorks (income assistance).
    * lunch: Percent qualifying for reduced-price lunch.
    * computer: Number of computers.
    * expenditure: Expenditure per student.
    * income: District average income (in USD 1,000).
    * english: Percent of English learners.
    * read: Average reading score.
    * math: Average math score.

Let's look at the basic structure of the data frame. i.e., the number of observations and the types of values:

```{r}
str(cas)
# Hmisc::describe(cas) # For more extensive summary statistics
```


### Q. 2  To what extent does expenditure per student vary?
```{r cas2, message=FALSE}
qplot(expenditure, data = cas) + xlim(0, 8000) +
		xlab("Money spent per student ($)") +
		ylab("Count of schools")

round(t(psych::describe(cas$expenditure)), 1)
```

The greatest expenditure per student is around double that of the least expenditure  per student.


### Q. 3a  What predicts expenditure per student?
```{r}
# Compute and format set of correlations
corExp <- cor(cas["expenditure"],
		cas[setdiff(v, "expenditure")])
corExp <- round(t(corExp),2)
corExp[order(corExp[,1], decreasing = TRUE), ,
		drop = FALSE]
```

More is spent per student in schools :

1. where people with greater incomes live
2. reading scores are higher
3. that are K-6


### Q. 4  what is the relationship between district level maths and reading scores?
```{r cas4, message=FALSE}
ggplot(cas, aes(read, math)) + geom_point() +
		geom_smooth()
```

At the district level, the correlation is very strong (r = The correlation is `r round(cor(cas$read, cas$math), 2)`). From prior experience I'd expect correlations at the individual-level in the .3 to .6 range.  Thus, these results are consistent with group-level relationships  being much larger than individual-level relationships.

### Q. 5 What is the relationship between maths and reading after partialling out other effects?


```{r}
# command has strange syntax requiring column numbers rather than variable names
partial.r(cas[v],
          c(which(names(cas[v]) == "read"), which(names(cas[v]) == "math")),
          which(!names(cas[v]) %in% c("read", "math"))
          )
```

The partial correlation is still very strong but is substantially reduced.


### Q. 6 What fraction of a computer does each student have?
```{r}
cas$compstud <- cas$computer / cas$students
describe(cas$compstud)
qplot(compstud, data = cas)
```

The mean number of computers per student is `r round(mean(cas$compstud), 3)`.


### Q. 7 What is a good model of the combined effect of other variables on academic performance (i.e., math and read)?
```{r cas7}
# Examine correlations between variables
psych::pairs.panels(cas[v])
```

`pairs.panels` shows correlations in the upper triangle, scatterplots in the lower triangle, and variable names and distributions on the main diagonal.
After examining the plot several ideas emerge.

```{r cas7.transformation, tidy=FALSE}
# (a) students is a count and could be log transformed
cas$studentsLog <- log(cas$students)

# (b) teachers is not the variable of interest:
#	it is the number of students per teacher
cas$studteach <- cas$students /cas$teachers
# (c) computers is not the variable of interest:
#  it is the ratio of computers to students
# table(cas$computer==0)
# Note some schools have no computers so ratio would be problematic.
# Take percentage of a computer instead
cas$compstud <- cas$computer / cas$students

# (d) math and reading are correlated highly, reduce to one variable
cas$performance <- as.numeric(
		scale(scale(cas$read) + scale(cas$math)))
```
Normally, I'd add all these transformations to an initial data transformation file that I call in the first block, but for the sake of the narrative, I'll leave them here.

Let's examine correlations between predictors and outcome.
```{r}
m1cor <- cor(cas$performance,
		cas[c("studentsLog", "studteach",	"calworks",
						"lunch", "compstud", "income",
						"expenditure", "gradesN")])
t(round(m1cor, 2))
```


Let's examine the multiple regression.
```{r}
m1 <- lm(performance ~ studentsLog + studteach +
				calworks + lunch + compstud
				+ income + expenditure + grades, data = cas)
summary(m1)
```
And some indicators of predictor relative importance.
```{r}
# calc.relimp from relaimpo package.
(m1relaimpo <- calc.relimp(m1,	type="lmg",	rela=TRUE))
```

Thus, we can conclude that:

1. Income and indicators of income (e.g., low levels of lunch vouchers) are the two main predictors. Thus, schools with greater average income tend to have better student performance.
2. Schools with more computers per student have better student performance.
3. Schools with fewer students per teacher have better student performance.

For more information about relative importance and the `relaimpo` package measures check out [Ulrike Grömping's website](http://prof.beuth-hochschule.de/groemping/relaimpo/).
Of course this is all observational data with the usual caveats regarding causal interpretation.

## Now, let's look at some weird stuff.
### Q. 8.1 What are common words in Californian School names?

```{r}
# create a vector of the words that occur in school names
lw <- unlist(strsplit(cas$school, split = " "))

# create a table of the frequency of school names
tlw <- table(lw)

# extract cells of table with count greater than 3
tlw2 <- tlw[tlw > 3]

# sorted in decreasing order
tlw2 <- sort(tlw2, decreasing = TRUE)

# values as proporitions
tlw2p <- round(tlw2 / nrow(cas), 3)

# show this in a bar graph
tlw2pdf <- data.frame(word = names(tlw2p),
		prop = as.numeric(tlw2p),
		stringsAsFactors = FALSE)
ggplot(tlw2pdf, aes(word, prop)) + geom_bar() + coord_flip()
```

```{r}
# make it log counts
ggplot(tlw2pdf, aes(word, log(prop*nrow(cas)))) +
		geom_bar() + coord_flip()
```

The word "Elementary" appears in almost all school names (`r round(100 * tlw2p["Elementary"], 1)`%).  The word "Union" appears in around half (`r round(100 * tlw2p["Union"], 1)`%).

Other common words pertain to:

* Directions (e.g., South, West),
* Features of the environment
    (e.g., Creek, Vista, View, Valley)
* Spanish words (e.g., rio for river; san for saint)


### Q. 8.2 Is the number of letters in the school's name related to academic performance?
```{r}
cas$namelen <- nchar(cas$school)
table(cas$namelen)
round(cor(cas$namelen, cas[,c("read", "math")]), 2)
```
The answer appears to be "no".


### Q.  8.3 Is the number of words in the school name related to academic performance?
```{r}
cas$nameWordCount <-
		sapply(strsplit(cas$school, " "), length)
table(cas$nameWordCount)
round(cor(cas$nameWordCount, cas[,c("read", "math")]), 2)
```
The answer appears to be "no".


### Q. 8.4 Are schools with nice popular nature words in their name doing better academically?
```{r}
tlw2p #recall the list of popular names
```

```{r}
# Create a quick and dirty list of popular nature names
naturenames <- c("Valley", "View", "Creek",
		"Lake", "Mountain",	"Park", "Rio",
		"Vista", "Grove", "Lakeside")

# work out whether the word is in the school name
schsplit <- strsplit(cas$school, " ")
cas$hasNature <- sapply(schsplit,
		function(X) length(intersect(X, naturenames)) > 0)
round(cor(cas$hasNature, cas[,c("read", "math")]), 2)
```
So we've found a small correlation.
Let's graph the data to see what it means:

```{r}
ggplot(cas, aes(hasNature, read)) +
        geom_boxplot() +
		geom_jitter(position=position_jitter(width=0.1)) +
		xlab("Has a nature name") +
		ylab("Mean student reading score")
```
So in the sample nature schools have slightly better reading score (and if we were to graph it, maths scores). However, the number of schools having nature names is actually somewhat small (n= `r sum(cas$hasNature)`) despite the overall quite large sample size.

But is it statistically significant?
```{r}
t.read <- t.test(cas[cas$hasNature, "read"], cas[!cas$hasNature, "read"])
t.math <- t.test(cas[cas$hasNature, "math"], cas[!cas$hasNature, "math"])
```
So, the p-value is less than .05 for reading (p = `r round(t.read$p.value, 3)`) but not quite for maths (p = `r round(t.math$p.value, 3)`).  Bingo!  After a little bit of data fishing we have found that reading scores are "significantly" greater for those schools with the listed nature names.

**But wait**: I've asked three separate exploratory questions or perhaps six if we take maths into account.

* $\frac{.05}{3} =$ `r 0.05 / 3`
* $\frac{.05}{6} =$ `r 0.05 / 6`

At these Bonferonni corrected p-values,  the result is non-significant. Oh well...


## Review
Anyway, the aim of this post was not to make profound statements about California schools. Rather the aim was to show how easy it is to produce quick reproducible reports with R Markdown. If you haven't already, you may want to open up the R Markdown file used to produce this post in RStudio, and compile the report yourself.

In particular, I can see R Markdown being my tool of choice for:

* Blog posts
* Posts to StackExchange sites
* Materials for training workshops
* Short consulting reports, and
* Exploratory analyses as part of a larger project.

The real question is how far I can push Markdown before I start to miss the control of LaTeX.  Markdown does permit arbitrary HTML. Anyway, if you have any thoughts about the scope of R Markdown, feel free to add a comment.
	`r opts_chunk$set(cache=TRUE)`

	This is a quick set of analyses of the California Test Score dataset. The post was produced using R Markdown in RStudio 0.96. The main purpose of this post is to provide a case study of using R Markdown to prepare a quick reproducible report. It provides examples of using plots, output, in-line R code, and markdown. The post is designed to be read along side the R Markdown source code, which is available as a gist on github.

	<!-- more -->

	### Preliminaries
	* This post builds on my earlier post which provided a guide for [Getting Started with R Markdown, knitr, and RStudio 0.96](jeromyanglim.blogspot.com/2012/05/getting-started-with-r-markdown-knitr.html)
	* The dataset analysed comes from the `AER` package which is an accompaniment to the book [Applied Econometrics with R](http://www.amazon.com/Applied-Econometrics-R-Use/dp/0387773169) written by [Christian Kleiber](http://wwz.unibas.ch/personen/profil/person/kleiber/) and [Achim Zeileis](http://eeecon.uibk.ac.at/~zeileis/).

	### Load packages and data
	```{r load_packages, message=FALSE, results='hide'}
	# if necessary uncomment and install packages.
	# install.packages("AER")
	# install.packages("psych")
	# install.packages("Hmisc")
	# install.packages("ggplot2")
	# install.packages("relaimpo")
	library(AER) # interesting datasets
	library(psych) # describe and psych.panels
	library(Hmisc) # describe
	library(ggplot2) # plots: ggplot and qplot
	library(relaimpo) # relative importance in regression
	```

	```{r load_data}
	# load the California Schools Dataset and give the dataset a shorter name
	data(CASchools)
	cas <- CASchools

	# Convert grade to numeric

	# table(cas$grades)
	cas$gradesN <- cas$grades == "KK-08"

	# Get the set of numeric variables
	v <- setdiff(names(cas), c("district", "school",
	"county", "grades"))

	```

	### Q 1 What does the CASchools dataset involve?
	Quoting the help (i.e., `?CASchools`), the data is "from all 420 K-6 and K-8 districts in California with data available for 1998 and 1999" and the variables are:

	* district: character. District code.
	* school: character. School name.
	* county: factor indicating county.
	* grades: factor indicating grade span of district.
	* students: Total enrollment.
	* teachers: Number of teachers.
	* calworks: Percent qualifying for CalWorks (income assistance).
	* lunch: Percent qualifying for reduced-price lunch.
	* computer: Number of computers.
	* expenditure: Expenditure per student.
	* income: District average income (in USD 1,000).
	* english: Percent of English learners.
	* read: Average reading score.
	* math: Average math score.

	Let's look at the basic structure of the data frame. i.e., the number of observations and the types of values:

	```{r}
	str(cas)
	# Hmisc::describe(cas) # For more extensive summary statistics
	```



	### Q. 2 To what extent does expenditure per student vary?
	```{r cas2, message=FALSE}
	qplot(expenditure, data = cas) + xlim(0, 8000) +
	xlab("Money spent per student ($)") +
	ylab("Count of schools")

	round(t(psych::describe(cas$expenditure)), 1)
	```

	The greatest expenditure per student is around double that of the least expenditure per student.


	### Q. 3a What predicts expenditure per student?
	```{r}
	# Compute and format set of correlations
	corExp <- cor(cas["expenditure"],
	cas[setdiff(v, "expenditure")])
	corExp <- round(t(corExp),2)
	corExp[order(corExp[,1], decreasing = TRUE), ,
	drop = FALSE]
	```

	More is spent per student in schools :

	1. where people with greater incomes live
	2. reading scores are higher
	3. that are K-6


	### Q. 4 what is the relationship between district level maths and reading scores?
	```{r cas4, message=FALSE}
	ggplot(cas, aes(read, math)) + geom_point() +
	geom_smooth()
	```

	At the district level, the correlation is very strong (r = The correlation is `r round(cor(cas$read, cas$math), 2)`). From prior experience I'd expect correlations at the individual-level in the .3 to .6 range. Thus, these results are consistent with group-level relationships being much larger than individual-level relationships.

	### Q. 5 What is the relationship between maths and reading after partialling out other effects?


	```{r}
	# command has strange syntax requiring column numbers rather than variable names
	partial.r(cas[v],
	c(which(names(cas[v]) == "read"), which(names(cas[v]) == "math")),
	which(!names(cas[v]) %in% c("read", "math"))
	)
	```

	The partial correlation is still very strong but is substantially reduced.


	### Q. 6 What fraction of a computer does each student have?
	```{r}
	cas$compstud <- cas$computer / cas$students
	describe(cas$compstud)
	qplot(compstud, data = cas)
	```

	The mean number of computers per student is `r round(mean(cas$compstud), 3)`.


	### Q. 7 What is a good model of the combined effect of other variables on academic performance (i.e., math and read)?
	```{r cas7}
	# Examine correlations between variables
	psych::pairs.panels(cas[v])
	```

	`pairs.panels` shows correlations in the upper triangle, scatterplots in the lower triangle, and variable names and distributions on the main diagonal.
	After examining the plot several ideas emerge.

	```{r cas7.transformation, tidy=FALSE}
	# (a) students is a count and could be log transformed
	cas$studentsLog <- log(cas$students)

	# (b) teachers is not the variable of interest:
	# it is the number of students per teacher
	cas$studteach <- cas$students /cas$teachers
	# (c) computers is not the variable of interest:
	# it is the ratio of computers to students
	# table(cas$computer==0)
	# Note some schools have no computers so ratio would be problematic.
	# Take percentage of a computer instead
	cas$compstud <- cas$computer / cas$students

	# (d) math and reading are correlated highly, reduce to one variable
	cas$performance <- as.numeric(
	scale(scale(cas$read) + scale(cas$math)))
	```
	Normally, I'd add all these transformations to an initial data transformation file that I call in the first block, but for the sake of the narrative, I'll leave them here.

	Let's examine correlations between predictors and outcome.
	```{r}
	m1cor <- cor(cas$performance,
	cas[c("studentsLog", "studteach", "calworks",
	"lunch", "compstud", "income",
	"expenditure", "gradesN")])
	t(round(m1cor, 2))
	```


	Let's examine the multiple regression.
	```{r}
	m1 <- lm(performance ~ studentsLog + studteach +
	calworks + lunch + compstud
	+ income + expenditure + grades, data = cas)
	summary(m1)
	```
	And some indicators of predictor relative importance.
	```{r}
	# calc.relimp from relaimpo package.
	(m1relaimpo <- calc.relimp(m1, type="lmg", rela=TRUE))
	```

	Thus, we can conclude that:

	1. Income and indicators of income (e.g., low levels of lunch vouchers) are the two main predictors. Thus, schools with greater average income tend to have better student performance.
	2. Schools with more computers per student have better student performance.
	3. Schools with fewer students per teacher have better student performance.

	For more information about relative importance and the `relaimpo` package measures check out [Ulrike Grömping's website](http://prof.beuth-hochschule.de/groemping/relaimpo/).
	Of course this is all observational data with the usual caveats regarding causal interpretation.

	## Now, let's look at some weird stuff.
	### Q. 8.1 What are common words in Californian School names?

	```{r}
	# create a vector of the words that occur in school names
	lw <- unlist(strsplit(cas$school, split = " "))

	# create a table of the frequency of school names
	tlw <- table(lw)

	# extract cells of table with count greater than 3
	tlw2 <- tlw[tlw > 3]

	# sorted in decreasing order
	tlw2 <- sort(tlw2, decreasing = TRUE)

	# values as proporitions
	tlw2p <- round(tlw2 / nrow(cas), 3)

	# show this in a bar graph
	tlw2pdf <- data.frame(word = names(tlw2p),
	prop = as.numeric(tlw2p),
	stringsAsFactors = FALSE)
	ggplot(tlw2pdf, aes(word, prop)) + geom_bar() + coord_flip()
	```

	```{r}
	# make it log counts
	ggplot(tlw2pdf, aes(word, log(prop*nrow(cas)))) +
	geom_bar() + coord_flip()
	```

	The word "Elementary" appears in almost all school names (`r round(100 * tlw2p["Elementary"], 1)`%). The word "Union" appears in around half (`r round(100 * tlw2p["Union"], 1)`%).

	Other common words pertain to:

	* Directions (e.g., South, West),
	* Features of the environment
	(e.g., Creek, Vista, View, Valley)
	* Spanish words (e.g., rio for river; san for saint)


	### Q. 8.2 Is the number of letters in the school's name related to academic performance?
	```{r}
	cas$namelen <- nchar(cas$school)
	table(cas$namelen)
	round(cor(cas$namelen, cas[,c("read", "math")]), 2)
	```
	The answer appears to be "no".


	### Q. 8.3 Is the number of words in the school name related to academic performance?
	```{r}
	cas$nameWordCount <-
	sapply(strsplit(cas$school, " "), length)
	table(cas$nameWordCount)
	round(cor(cas$nameWordCount, cas[,c("read", "math")]), 2)
	```
	The answer appears to be "no".


	### Q. 8.4 Are schools with nice popular nature words in their name doing better academically?
	```{r}
	tlw2p #recall the list of popular names
	```

	```{r}
	# Create a quick and dirty list of popular nature names
	naturenames <- c("Valley", "View", "Creek",
	"Lake", "Mountain", "Park", "Rio",
	"Vista", "Grove", "Lakeside")

	# work out whether the word is in the school name
	schsplit <- strsplit(cas$school, " ")
	cas$hasNature <- sapply(schsplit,
	function(X) length(intersect(X, naturenames)) > 0)
	round(cor(cas$hasNature, cas[,c("read", "math")]), 2)
	```
	So we've found a small correlation.
	Let's graph the data to see what it means:

	```{r}
	ggplot(cas, aes(hasNature, read)) +
	geom_boxplot() +
	geom_jitter(position=position_jitter(width=0.1)) +
	xlab("Has a nature name") +
	ylab("Mean student reading score")
	```
	So in the sample nature schools have slightly better reading score (and if we were to graph it, maths scores). However, the number of schools having nature names is actually somewhat small (n= `r sum(cas$hasNature)`) despite the overall quite large sample size.

	But is it statistically significant?
	```{r}
	t.read <- t.test(cas[cas$hasNature, "read"], cas[!cas$hasNature, "read"])
	t.math <- t.test(cas[cas$hasNature, "math"], cas[!cas$hasNature, "math"])
	```
	So, the p-value is less than .05 for reading (p = `r round(t.read$p.value, 3)`) but not quite for maths (p = `r round(t.math$p.value, 3)`). Bingo! After a little bit of data fishing we have found that reading scores are "significantly" greater for those schools with the listed nature names.

	But wait: I've asked three separate exploratory questions or perhaps six if we take maths into account.

	* $\frac{.05}{3} =$ `r 0.05 / 3`
	* $\frac{.05}{6} =$ `r 0.05 / 6`

	At these Bonferonni corrected p-values, the result is non-significant. Oh well...


	## Review
	Anyway, the aim of this post was not to make profound statements about California schools. Rather the aim was to show how easy it is to produce quick reproducible reports with R Markdown. If you haven't already, you may want to open up the R Markdown file used to produce this post in RStudio, and compile the report yourself.

	In particular, I can see R Markdown being my tool of choice for:

	* Blog posts
	* Posts to StackExchange sites
	* Materials for training workshops
	* Short consulting reports, and
	* Exploratory analyses as part of a larger project.

	The real question is how far I can push Markdown before I start to miss the control of LaTeX. Markdown does permit arbitrary HTML. Anyway, if you have any thoughts about the scope of R Markdown, feel free to add a comment.