Skip to content

Instantly share code, notes, and snippets.

@davebraze
Last active September 25, 2018 15:29
Show Gist options
  • Save davebraze/5d30c4cd20700b7074d52193ae906354 to your computer and use it in GitHub Desktop.
Save davebraze/5d30c4cd20700b7074d52193ae906354 to your computer and use it in GitHub Desktop.
Basics of factor level ordering
##### Basic factor level ordering and (treatment) contrasts
## set up data.frame with 1 continuous variable and 1 factor with 8 levels.
set.seed(1234)
x <- rnorm(80)
fac <- factor(rep(LETTERS[8:1], 10))
df <- data.frame(x, fac)
df$x[as.integer(df$fac) %% 5 == 0] <- rnorm(10, 1)
head(df, 16) ## Note the order of factor levels in this data is reverse
## alphanumeric
str(df)
## By default, R sets the order of factor levels to be alphanumeric
## ascending, regardless of their order in the data set.
levels(df$fac)
## This is important because the order of levels impacts the specific
## contrasts entailed by each type of contrast coding.
##
## R's default is to use "treatment" contrast coding for unordered factors.
options("contrasts")
contrasts(df$fac)
## For treatment contrasts (sometimes called dummy coding), the first level
## of a factor is compared pairwise to each subsequent level, and the
## intercept is set to the mean of the first level.
summary(lm(x~fac, data=df))
by(df$x, df$fac, mean) ## cell means
## Note that the intercept Estimate corresponds to the cell mean for level
## A. All other Estimates correspond to the difference between the given
## cell mean and cell A.
## It's usually an extraordinary coincidence if the baseline level that you
## want happens to be alphanumerically first in order. So what do you do if
## you want something different?
## You can use the relevel() function to specify which level you want to be
## the baseline. All other levels are simply pushed down 1 place.
relevel(df$fac, "H")
df$fac <- relevel(df$fac, "H")
## note that multiple r-squared for the model does not change, but the
## specific contrasts have (because the baseline has changed).
summary(lm(x~fac, data=df))
by(df$x, df$fac, mean)
## Now the intercept corresponds to the cell mean for H, and all other
## estimates are the difference between the given level and H.
## You can use factor() to specify a specific order for all levels.
## This is sometimes handy for reasons we don't need to get in to here.
factor(df$fac, levels = c("E", "F", "G", "H", "A", "B", "C", "D"))
df$fac <- factor(df$fac, levels = c("E", "F", "G", "H", "A", "B", "C", "D"))
summary(lm(x~fac, data=df))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment