jrminter/linModelMASS.Rmd

## linModelMASS.Rmd
Linear Regression with Factor Variables
========================================================

One of the struggles I have in the Coursera Data Analysis course is
understanding the strengths and weaknesses of the algorithms we have
examined because of the messy data sets chosen ( I appreciate that
real data sets __are__ messy). As a chemist by training, now
specializing in microscopy and image processing, I have repeatedly
learned the benefit of trying out an unfamiliar algorithm on a
well-behaved data set before "going where no one has gone before..."

Let's consider a well-behaved data set of measurements of gas use as
a function of temperature taken before and after insulation and frame
some questions...

The vignette in MASS notes
> Mr Derek Whiteside of the UK Building Research Station recorded the weekly gas consumption and average external temperature at his own house in south-east England for two heating seasons, one of 26 weeks before, and one of 30 weeks after cavity-wall insulation was installed. The object of the exercise was to assess the effect of the insulation on gas consumption.

> __Temp:__ Purportedly the average outside temperature in degrees Celsius. (These values is far too low for any 56-week period in the 1960s in South-East England. It might be the weekly average of daily minima.)

> __Gas:__ The weekly gas consumption in 1000s of cubic feet.

> __Insul:__ A factor, before or after insulation.

```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F}
# make this an external chunk that can be included in any file
# adapted from Jeff Leek's Coursera Slides
rm(list=ls())
options(width = 65)
opts_chunk$set(message = FALSE, error = FALSE, warning = FALSE,
               comment = NA, fig.align = 'center', dpi = 100,
               tidy = FALSE, cache.path = '.cache/',
               fig.path = 'fig/')

options(xtable.type = 'html')
knit_hooks$set(inline = function(x) {
  if(is.numeric(x)) {
    round(x, getOption('digits'))
  } else {
    paste(as.character(x), collapse = ', ')
  }
})
knit_hooks$set(plot = knitr:::hook_plot_html)
```


We take a first peek at the loaded data:

```{r setupAna, echo=FALSE, cache=TRUE}
require(MASS)
str.wd <- '~/work/proj/DataAnalysis/Rmd/linModelMASS/'
setwd(str.wd)
data(whiteside)
print(head(whiteside))
```

First, let's plot the data where we see clear separation between
the two factors. This gives us a high probability of getting
statistically significant effects...

```{r, echo=FALSE, cache=TRUE, fig.width=6, fig.height=4}
#          B    L    T    R
par(mar=c(5.1, 4.1, 4.1, 2.1))
plot(whiteside$Temp, whiteside$Gas, type='n',
     xlab='Temperature', ylab='Gas',
     main='Gas use before and after insulation')
points(whiteside$Temp, whiteside$Gas,pch=19,
       col=(whiteside$Insul=="Before")*1+1)
legend("topright",
       legend=c('Before', 'After'),
       col=c('red', 'black'),
       pch=c(19,19))
par(mar=c(5.1, 4.1, 4.1, 2.1))

```

We can first subset the data, do separate linear models,
and look at the 95% confidence intervals of the parameters

```{r indFits, echo=FALSE, cache=TRUE}
before <- whiteside[whiteside$Insul=="Before", c(2,3)]
after  <- whiteside[whiteside$Insul=="After", c(2,3)]
fit.lm.b <- lm(Gas ~ Temp, data=before)
fit.lm.a <- lm(Gas ~ Temp, data=after)
ci.b <- confint(fit.lm.b, level=0.95)
ci.a <- confint(fit.lm.a, level=0.95)
rownames(ci.b) <- c('intercept.before','slope.before')
rownames(ci.a) <- c('intercept.after','slope.after')
ci <- rbind(ci.b, ci.a)
ci <- ci[order(row.names(ci)),]
ci
```

We note that __both__ the slope and intercept are different at
the 95% confidence level.

Next, we construct a mixed model and look at the confidence
intervals
```{r mixedFit, echo=TRUE, cache=TRUE}
fit.lm.mixed <- lm(Gas ~ Temp + as.factor(Insul), data=whiteside)
ci.lm.mixed <- confint(fit.lm.mixed, level=0.95)
ci.lm.mixed
```

I _believe_ one would interpret the __as.factor(Insul)After__ term as
the change in gas use after insulation for a given temperature. This
_seems_ to ignore the significant difference in slope we saw in the
individual fits.

We can also compute an ANOVA:
```{r mixedFitAnova, echo=TRUE, cache=TRUE}
an.fit.lm.mixed <- anova(fit.lm.mixed)
an.fit.lm.mixed
```

It seems that both the separate and combined models are useful.
I still wonder how one would relate these effects to more fundamental
physical models, such as Fourier's law of heat conduction... It has
been over 30 years since I took Thermodaynamics...

In response to a suggestion on the Coursera forum, I thought
this was interesting

```{r compareAnova, echo=TRUE, cache=TRUE}
fit.lm.1 <- lm(Gas ~ Temp, data=whiteside)
fit.lm.2 <- lm(Gas ~ Temp + Insul, data=whiteside)

an <- anova(fit.lm.1, fit.lm.2)
an
```

As expected, it shows that includuding the ```Insul`` factor
had a large effect on reducing variance. That **was**, after all,
the point of the study,

So let's try an interaction term to see if the interaction
between Insulation and temperature is significant (the slope
in our individual fits might suggest that it is...)

```{r lmInteract, echo=TRUE, cache=TRUE}
fit.lm.3 <- lm(Gas ~ Temp + Insul + Temp*Insul, data=whiteside)
sum.lm.3 <- summary(fit.lm.3)
sum.lm.3

an.lm.3 <- anova(fit.lm.3)
an.lm.3

t.med <- median(whiteside$Temp)
t.med
```

Note on a single LM, the anova doesn't tell more than the F.
Need to understant phys sig

Cool - it shows the interaction term is significant... So how would
I report this?

Something like: For each degree C warmer in temperature, on average
gas consumption decreased by 0.39 kcf. Adding insulation improved
matters, further reducing gas consumption by an average of 2.13 kcF
multiplied by 0.115 for every degree C difference in temperatue
from 4.9 C.
	Linear Regression with Factor Variables
	========================================================

	One of the struggles I have in the Coursera Data Analysis course is
	understanding the strengths and weaknesses of the algorithms we have
	examined because of the messy data sets chosen ( I appreciate that
	real data sets __are__ messy). As a chemist by training, now
	specializing in microscopy and image processing, I have repeatedly
	learned the benefit of trying out an unfamiliar algorithm on a
	well-behaved data set before "going where no one has gone before..."

	Let's consider a well-behaved data set of measurements of gas use as
	a function of temperature taken before and after insulation and frame
	some questions...

	The vignette in MASS notes
	> Mr Derek Whiteside of the UK Building Research Station recorded the weekly gas consumption and average external temperature at his own house in south-east England for two heating seasons, one of 26 weeks before, and one of 30 weeks after cavity-wall insulation was installed. The object of the exercise was to assess the effect of the insulation on gas consumption.

	> __Temp:__ Purportedly the average outside temperature in degrees Celsius. (These values is far too low for any 56-week period in the 1960s in South-East England. It might be the weekly average of daily minima.)

	> __Gas:__ The weekly gas consumption in 1000s of cubic feet.

	> __Insul:__ A factor, before or after insulation.

	```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F}
	# make this an external chunk that can be included in any file
	# adapted from Jeff Leek's Coursera Slides
	rm(list=ls())
	options(width = 65)
	opts_chunk$set(message = FALSE, error = FALSE, warning = FALSE,
	comment = NA, fig.align = 'center', dpi = 100,
	tidy = FALSE, cache.path = '.cache/',
	fig.path = 'fig/')

	options(xtable.type = 'html')
	knit_hooks$set(inline = function(x) {
	if(is.numeric(x)) {
	round(x, getOption('digits'))
	} else {
	paste(as.character(x), collapse = ', ')
	}
	})
	knit_hooks$set(plot = knitr:::hook_plot_html)
	```


	We take a first peek at the loaded data:

	```{r setupAna, echo=FALSE, cache=TRUE}
	require(MASS)
	str.wd <- '~/work/proj/DataAnalysis/Rmd/linModelMASS/'
	setwd(str.wd)
	data(whiteside)
	print(head(whiteside))
	```

	First, let's plot the data where we see clear separation between
	the two factors. This gives us a high probability of getting
	statistically significant effects...

	```{r, echo=FALSE, cache=TRUE, fig.width=6, fig.height=4}
	# B L T R
	par(mar=c(5.1, 4.1, 4.1, 2.1))
	plot(whiteside$Temp, whiteside$Gas, type='n',
	xlab='Temperature', ylab='Gas',
	main='Gas use before and after insulation')
	points(whiteside$Temp, whiteside$Gas,pch=19,
	col=(whiteside$Insul=="Before")*1+1)
	legend("topright",
	legend=c('Before', 'After'),
	col=c('red', 'black'),
	pch=c(19,19))
	par(mar=c(5.1, 4.1, 4.1, 2.1))

	```

	We can first subset the data, do separate linear models,
	and look at the 95% confidence intervals of the parameters

	```{r indFits, echo=FALSE, cache=TRUE}
	before <- whiteside[whiteside$Insul=="Before", c(2,3)]
	after <- whiteside[whiteside$Insul=="After", c(2,3)]
	fit.lm.b <- lm(Gas ~ Temp, data=before)
	fit.lm.a <- lm(Gas ~ Temp, data=after)
	ci.b <- confint(fit.lm.b, level=0.95)
	ci.a <- confint(fit.lm.a, level=0.95)
	rownames(ci.b) <- c('intercept.before','slope.before')
	rownames(ci.a) <- c('intercept.after','slope.after')
	ci <- rbind(ci.b, ci.a)
	ci <- ci[order(row.names(ci)),]
	ci
	```

	We note that __both__ the slope and intercept are different at
	the 95% confidence level.

	Next, we construct a mixed model and look at the confidence
	intervals
	```{r mixedFit, echo=TRUE, cache=TRUE}
	fit.lm.mixed <- lm(Gas ~ Temp + as.factor(Insul), data=whiteside)
	ci.lm.mixed <- confint(fit.lm.mixed, level=0.95)
	ci.lm.mixed
	```

	I _believe_ one would interpret the __as.factor(Insul)After__ term as
	the change in gas use after insulation for a given temperature. This
	_seems_ to ignore the significant difference in slope we saw in the
	individual fits.

	We can also compute an ANOVA:
	```{r mixedFitAnova, echo=TRUE, cache=TRUE}
	an.fit.lm.mixed <- anova(fit.lm.mixed)
	an.fit.lm.mixed
	```

	It seems that both the separate and combined models are useful.
	I still wonder how one would relate these effects to more fundamental
	physical models, such as Fourier's law of heat conduction... It has
	been over 30 years since I took Thermodaynamics...

	In response to a suggestion on the Coursera forum, I thought
	this was interesting

	```{r compareAnova, echo=TRUE, cache=TRUE}
	fit.lm.1 <- lm(Gas ~ Temp, data=whiteside)
	fit.lm.2 <- lm(Gas ~ Temp + Insul, data=whiteside)

	an <- anova(fit.lm.1, fit.lm.2)
	an
	```

	As expected, it shows that includuding the ```Insul`` factor
	had a large effect on reducing variance. That was, after all,
	the point of the study,

	So let's try an interaction term to see if the interaction
	between Insulation and temperature is significant (the slope
	in our individual fits might suggest that it is...)

	```{r lmInteract, echo=TRUE, cache=TRUE}
	fit.lm.3 <- lm(Gas ~ Temp + Insul + Temp*Insul, data=whiteside)
	sum.lm.3 <- summary(fit.lm.3)
	sum.lm.3

	an.lm.3 <- anova(fit.lm.3)
	an.lm.3

	t.med <- median(whiteside$Temp)
	t.med
	```

	Note on a single LM, the anova doesn't tell more than the F.
	Need to understant phys sig

	Cool - it shows the interaction term is significant... So how would
	I report this?

	Something like: For each degree C warmer in temperature, on average
	gas consumption decreased by 0.39 kcf. Adding insulation improved
	matters, further reducing gas consumption by an average of 2.13 kcF
	multiplied by 0.115 for every degree C difference in temperatue
	from 4.9 C.