Created
February 17, 2013 04:04
-
-
Save jrminter/4970086 to your computer and use it in GitHub Desktop.
Ann analysis of the whiteside data from the MASS package to better understand linear models
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Linear Regression with Factor Variables | |
======================================================== | |
One of the struggles I have in the Coursera Data Analysis course is | |
understanding the strengths and weaknesses of the algorithms we have | |
examined because of the messy data sets chosen ( I appreciate that | |
real data sets __are__ messy). As a chemist by training, now | |
specializing in microscopy and image processing, I have repeatedly | |
learned the benefit of trying out an unfamiliar algorithm on a | |
well-behaved data set before "going where no one has gone before..." | |
Let's consider a well-behaved data set of measurements of gas use as | |
a function of temperature taken before and after insulation and frame | |
some questions... | |
The vignette in MASS notes | |
> Mr Derek Whiteside of the UK Building Research Station recorded the weekly gas consumption and average external temperature at his own house in south-east England for two heating seasons, one of 26 weeks before, and one of 30 weeks after cavity-wall insulation was installed. The object of the exercise was to assess the effect of the insulation on gas consumption. | |
> __Temp:__ Purportedly the average outside temperature in degrees Celsius. (These values is far too low for any 56-week period in the 1960s in South-East England. It might be the weekly average of daily minima.) | |
> __Gas:__ The weekly gas consumption in 1000s of cubic feet. | |
> __Insul:__ A factor, before or after insulation. | |
```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F} | |
# make this an external chunk that can be included in any file | |
# adapted from Jeff Leek's Coursera Slides | |
rm(list=ls()) | |
options(width = 65) | |
opts_chunk$set(message = FALSE, error = FALSE, warning = FALSE, | |
comment = NA, fig.align = 'center', dpi = 100, | |
tidy = FALSE, cache.path = '.cache/', | |
fig.path = 'fig/') | |
options(xtable.type = 'html') | |
knit_hooks$set(inline = function(x) { | |
if(is.numeric(x)) { | |
round(x, getOption('digits')) | |
} else { | |
paste(as.character(x), collapse = ', ') | |
} | |
}) | |
knit_hooks$set(plot = knitr:::hook_plot_html) | |
``` | |
We take a first peek at the loaded data: | |
```{r setupAna, echo=FALSE, cache=TRUE} | |
require(MASS) | |
str.wd <- '~/work/proj/DataAnalysis/Rmd/linModelMASS/' | |
setwd(str.wd) | |
data(whiteside) | |
print(head(whiteside)) | |
``` | |
First, let's plot the data where we see clear separation between | |
the two factors. This gives us a high probability of getting | |
statistically significant effects... | |
```{r, echo=FALSE, cache=TRUE, fig.width=6, fig.height=4} | |
# B L T R | |
par(mar=c(5.1, 4.1, 4.1, 2.1)) | |
plot(whiteside$Temp, whiteside$Gas, type='n', | |
xlab='Temperature', ylab='Gas', | |
main='Gas use before and after insulation') | |
points(whiteside$Temp, whiteside$Gas,pch=19, | |
col=(whiteside$Insul=="Before")*1+1) | |
legend("topright", | |
legend=c('Before', 'After'), | |
col=c('red', 'black'), | |
pch=c(19,19)) | |
par(mar=c(5.1, 4.1, 4.1, 2.1)) | |
``` | |
We can first subset the data, do separate linear models, | |
and look at the 95% confidence intervals of the parameters | |
```{r indFits, echo=FALSE, cache=TRUE} | |
before <- whiteside[whiteside$Insul=="Before", c(2,3)] | |
after <- whiteside[whiteside$Insul=="After", c(2,3)] | |
fit.lm.b <- lm(Gas ~ Temp, data=before) | |
fit.lm.a <- lm(Gas ~ Temp, data=after) | |
ci.b <- confint(fit.lm.b, level=0.95) | |
ci.a <- confint(fit.lm.a, level=0.95) | |
rownames(ci.b) <- c('intercept.before','slope.before') | |
rownames(ci.a) <- c('intercept.after','slope.after') | |
ci <- rbind(ci.b, ci.a) | |
ci <- ci[order(row.names(ci)),] | |
ci | |
``` | |
We note that __both__ the slope and intercept are different at | |
the 95% confidence level. | |
Next, we construct a mixed model and look at the confidence | |
intervals | |
```{r mixedFit, echo=TRUE, cache=TRUE} | |
fit.lm.mixed <- lm(Gas ~ Temp + as.factor(Insul), data=whiteside) | |
ci.lm.mixed <- confint(fit.lm.mixed, level=0.95) | |
ci.lm.mixed | |
``` | |
I _believe_ one would interpret the __as.factor(Insul)After__ term as | |
the change in gas use after insulation for a given temperature. This | |
_seems_ to ignore the significant difference in slope we saw in the | |
individual fits. | |
We can also compute an ANOVA: | |
```{r mixedFitAnova, echo=TRUE, cache=TRUE} | |
an.fit.lm.mixed <- anova(fit.lm.mixed) | |
an.fit.lm.mixed | |
``` | |
It seems that both the separate and combined models are useful. | |
I still wonder how one would relate these effects to more fundamental | |
physical models, such as Fourier's law of heat conduction... It has | |
been over 30 years since I took Thermodaynamics... | |
In response to a suggestion on the Coursera forum, I thought | |
this was interesting | |
```{r compareAnova, echo=TRUE, cache=TRUE} | |
fit.lm.1 <- lm(Gas ~ Temp, data=whiteside) | |
fit.lm.2 <- lm(Gas ~ Temp + Insul, data=whiteside) | |
an <- anova(fit.lm.1, fit.lm.2) | |
an | |
``` | |
As expected, it shows that includuding the ```Insul`` factor | |
had a large effect on reducing variance. That **was**, after all, | |
the point of the study, | |
So let's try an interaction term to see if the interaction | |
between Insulation and temperature is significant (the slope | |
in our individual fits might suggest that it is...) | |
```{r lmInteract, echo=TRUE, cache=TRUE} | |
fit.lm.3 <- lm(Gas ~ Temp + Insul + Temp*Insul, data=whiteside) | |
sum.lm.3 <- summary(fit.lm.3) | |
sum.lm.3 | |
an.lm.3 <- anova(fit.lm.3) | |
an.lm.3 | |
t.med <- median(whiteside$Temp) | |
t.med | |
``` | |
Note on a single LM, the anova doesn't tell more than the F. | |
Need to understant phys sig | |
Cool - it shows the interaction term is significant... So how would | |
I report this? | |
Something like: For each degree C warmer in temperature, on average | |
gas consumption decreased by 0.39 kcf. Adding insulation improved | |
matters, further reducing gas consumption by an average of 2.13 kcF | |
multiplied by 0.115 for every degree C difference in temperatue | |
from 4.9 C. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment