dmbates/dyestuff.Rnd

## dyestuff.jag
model {
    for (i in 1:length(yield)) {
        yield[i] ~ dnorm(mu + b[batch[i]],tau);
    }
    for (j in 1:J) {
        b[j] ~ dnorm(0., tau.b);
    }
    mu       ~ dnorm(1500,1.0E-8);
    tau      ~ dgamma(0.001,0.001);
    tau.b    ~ dgamma(0.001,0.001);
    sigma   <- 1./sqrt(tau);
    sigma.b <- 1./sqrt(tau.b);
}

## dyestuff.Rnd
---
title: JAGS and lme4
author: Douglas Bates
date: March 28, 2014
output:
  html_document:
    toc: yes
    highlight: tango
    fig_width: 7
    fig_height: 5.5
    fig_caption: true
  pdf_document:
    toc: yes
---
```{r preliminaries,cache=FALSE,include=FALSE}
library(rjags)
library(lme4)
library(knitr)
opts_chunk$set(cache=TRUE)
```

# Simple examples of the use of JAGS with linear mixed-effects models

Linear mixed-effects models (LMEMs) are used to model a numeric response on a continuous scale as it depends on a number of covariates.  The "effects" of categorical covariates with levels that are fixed across studies (such as _sex_ with levels _female_ and _male_) are represented as _fixed-effects_ coefficients.  Categorical covariates whose levels represent a random sample from a population of possible values (such as _subject_ or _item_) are modeled with _random effects_.

In the classical, "sampling theory" approach to statistical inference we regard the parameters as fixed but unknown and the observed responses as a sample from a probability distribution.  For this approach there are important differences between fixed-effects coefficients, which are parameters, and random effects, which are unobserved random variables.

In Bayesian analysis the parameters in the model and the observations and the random effects are all considered to be random variables.  A _Markov-Chain Monte Carlo_ (MCMC) sampler provides a random sample from the distribution of the parameters of interest.  It is not important to understand exactly what the "Markov-Chain" part of the name means but it is important to realize that the samples provided from such a sampler are not the "independent and identically distributed" samples described in introductory courses.  The samples have a sequential dependence, meaning that the $i$th sample depends upon its predecessors. This is described as _autocorrelation_ in the sample.  Sophisticated MCMC sampler methods are designed to minimize auto-correlation but we must be aware of the possibility that the samples are poorly "mixed" and use diagnostic methods to check for this.

## Example 1 Dyestuff yield

In addition to providing functions for fitting and analyzing linear mixed-effects models, the __lme4__ package for __R__ provides several data sets to illustrate simple models.  The _Dyestuff_ data gives the _Yield_ of dyestuff for several batches (_Batch_) of an intermediate product in the manufacturing process.

```{r dyestuffplot,fig=TRUE,cache=TRUE,fig.align='center',echo=FALSE,fig.height=3.5,fig.cap="Dotplot of the Dyestuff data.  The batches have been reordered by increasing mean yield.  The blue line joins the mean yields."}
set.seed(1234543)
print(dotplot(reorder(Batch, Yield) ~ Yield, Dyestuff,
              ylab = "Batch", jitter.y = TRUE, pch = 21, aspect = 0.32,
              xlab = "Yield of dyestuff (grams of standard color)",
              type = c("p", "a")))
```

A model with random effects for the batches is fit as
```{r dyestuffML,cache=TRUE}
summary(fm1 <- lmer(Yield ~ (1|Batch), Dyestuff, REML=FALSE))
```

The only fixed-effects parameter in this model is $\mu$, the mean yield across batches, which is labelled as __(Intercept)__ in this summary.  The other two parameters in this model are the standard deviation of the random effects for __Batch__, whose maximum likelihood estimate is 37.3, and the standard deviation of the residual "noise" term, whose maximum likelihood estimate is 49.5

The random effects, representing the deviation from $\mu$ for the mean yield of each batch, are not parameters in the "sampling theory" form of the model.  They are considered to be random variables with the (unconditional) distribution
$$
\mathcal{B}\sim\mathcal{N}({\mathbf 0},\sigma_b^2{\mathbf I}) .
$$
The conditional distribution $\mathcal{Y}|\mathcal{B} = \mathbf{b}$ is expressed using _model matrices_, $\mathbf{X}$ and $\mathbf{Z}$ as
$$
\mathcal{Y}|\mathcal{B}=\mathbf{b} \sim \mathcal{N}(\mathbf{X}\mathbf{\beta}+\mathbf{Z}\mathbf{b},\sigma^2\mathbf{I})
$$

Combining these two produces the conditional distribution $\mathcal{B}|\mathcal{Y}=\mathbf{y}$ where $\mathbf{y}$ is the vector of observed responses.  In the sampling theory approach we do not "estimate" the random effects but, instead, determine their "conditional means", given the observed data.

```{r fm1ranef,cache=TRUE}
ranef(fm1)
```

Because model __fm1__ is so simple, we can use a _profiling_ technique to assess the variability in the parameter estimates.
```{r pr1,cache=TRUE}
pr1 <- profile(fm1)
```
and plot the densities corresponding to the profiled log-likelihood
```{r pr1dens,cache=TRUE,echo=FALSE,fig.align='center',fig.cap="Densities corresponding to the profiled log-likelihoods in model fm1"}
densityplot(pr1,layout=c(1,3),strip=FALSE,strip.left=TRUE)
```

Alternatively, we can plot the profiles in a form like a normal probability plot with vertical lines delimiting confidence intervals.
```{r pr1xy,cache=TRUE,echo=FALSE,fig.align='center',fig.width=6.5,fig.height=3.7,fig.cap="Profile plots of the parameters in model fm1.  The vertical bars delimit 50%, 80%, 90%, 95% and 99% profile-based confidence intervals on the parameters"}
xyplot(pr1)
```
The vertical axis is on the scale of standard normal deviates and the vertical lines delimit 50%, 80%, 90%, 95% and 99% profile-based confidence intervals on the parameters.  Thus the 95% profile-based confidence interval is bounded by the points where the curve is at $\pm 1.960$ on the vertical axis.

A close examination of the left-hand panel shows that the 99% confidence interval on the parameter written $\sigma_1$ here ($\sigma_b$ in the equations shown above) does not have a lower bound.  It should be where the curve reaches `r format(qnorm(0.005))` but the curve doesn't reach that value.

Also, notice that the curve flattens out as $\sigma_1$ approaches $0$.  We will return to this issue when we discuss the results from the JAGS sampler.

The numerical confidence intervals are returned by the `confint` function.
```{r confintpr1,cache=TRUE}
confint(pr1)
```

## A JAGS sampler approach to the Dyestuff model

__JAGS__ (Just Another Gibbs Sampler) is a system to produce MCMC samples from statistical models.  It is particularly well suited to linear mixed-effects and generalized linear mixed-effects models as it provides an efficient method of sampling from the fixed and random effects.  The samples can be drawn and analyzed in __R__ using the __rjags__ package.

JAGS has a modeling language similar to that used in __BUGS__, another popular MCMC sampler system.  It is easiest to create a file containing the model description.  To distinguish the data in the JAGS model from that in the __R__ data frame we will use lower-case names in JAGS, calling the data vectors __yield__ and __batch__.  JAGS doesn't provide for factors as in R and hence we must convert a factor like __Batch__ to an integer vector and also specify the number of levels of the factor.
```{r d}
str(d <- with(Dyestuff, list(yield=Yield,batch=as.integer(Batch),
                             J=length(levels(Batch)))))
```

The JAGS model includes specification of the distributions in the probability model and _prior_ distributions on the parameters, which, in the Bayesian formulation, are regarded as random variables.
```{.jags}
model {
    for (i in 1:length(yield)) {
        yield[i] ~ dnorm(mu + b[batch[i]],tau);
    }
    for (j in 1:J) {
        b[j] ~ dnorm(0., tau.b);
    }
    mu       ~ dnorm(1500,1.0E-8);
    tau      ~ dgamma(1.0E-3,1.0E-3);
    tau.b    ~ dgamma(1.0E-3,1.0E-3);
    sigma   <- 1.0/sqrt(tau);
    sigma.b <- 1.0/sqrt(tau.b);
}
```

In a JAGS model a normal distribution is specified by its mean and _precision_, generally written as $\tau$, which is the inverse of the variance.  The standard deviation of the per-observation noise, $\sigma$, is the reciprocal square root of $\tau$.  Similarly the standard deviation of the unconditional distribution of the random effects, `b[j], j=1,...,J`, is the reciprocal square root of $\tau_b$.

The two loops at the beginning of the model establish the conditional distribution $\mathcal{Y}|\mathcal{B}=\mathbf{b}$ and the unconditional distribution of $\mathcal{B}$.  Because the multivariate form of these distributions is an independent and identically distributed, or "spherical", distribution it is sufficient to describe the distribution of each component.

The _prior_ distribution on `mu` is taken to be a normal distribution with a very small precision (i.e. a very large variance).  The mean of this prior distribution is set at 1500 which is close to the overall mean yield.  We could set it to be zero or some other arbitrary value but it not make much of a difference because the prior is so diffuse.  Priors on the two precision parameters are taken to be diffuse gamma distributions because the gamma distribution is a "conjugate" distribution for the precision parameter.  It is not important to know exactly what a conjugate distribution is as much as to realize that specifying the prior in this way provides for easier sampling.

We need to load the __rjags__ package and, importantly for these kinds of models, also load the "glm" module for JAGS.
```{r rjags,cache=FALSE}
library(rjags)
load.module("glm")
```
We now create a JAGS model object in __R__.
```{r m}
m <- jags.model("dyestuff.jag",data=d,n.chains=1,quiet=TRUE)
```
and check the samplers being used with
```{r samplers}
list.samplers(m)
```
We check that the fixed-effects parameters and the random effects are being sampled jointly using the `Linear` sampler.  This is the reason we load the "glm" model and is important for the efficiency of the sampling.

A sample is generated with the `coda.samples` function.  We will specify that we want to monitor the values of `mu`, `sigma` and `sigma.b` and return them to __R__.
```{r codasamples,cache=TRUE}
system.time(samp <- coda.samples(m, c("mu","sigma","sigma.b"), n.iter=10000))
```
It is very fast to create these samples when the "glm" module is loaded.

It is tempting to skip to the "good part" and look at the summary of the sample and density plots so that we can evaluate the precision of parameter estimates.  However, we should first check on the behavior of the sequence of parameter values.  We want to see a "well-mixed" sequence, with a plot that looks like a fuzzy caterpillar.
```{r xyplotsamp,echo=FALSE,fig.align='center',fig.cap="Sequential plots of the parameter values in the first JAGS sample",cache=TRUE}
xyplot(samp,layout=c(1,3),strip=FALSE,strip.left=TRUE,scales=list(x=list(axs="i")))
```

The plot of $\mu$ versus the iteration number shows the type of pattern that we want.  There are a few spikes and a couple of unusually stable patches just before iteration 1000 and around iteration 9000 but overall it is a stable, fuzzy pattern.  The patterns for the standard deviations, $\sigma$ and $\sigma_b$, are problematic.

First consider the pattern for $\sigma_b$.  There are several places where the plot flattens out at a value close to zero and stays there for many iterations.  The anomalies in the patterns for the other two parameters correspond to these flat spots.  When $\sigma_b$ is close to zero with almost no variability, $\sigma$ becomes larger to compensate (there is a fixed amount of variability in the data and if it isn't being absorbed into the random effects it must get absorbed in the per-observation noise) and the pattern for $\mu$ becomes less variable.

The reason that $\sigma_b$ will get stuck near zero is related to the flattening of the profile likelihood curve for $\sigma_b$ near zero.  When $\sigma_b$ is close to zero the conditional means of the random effects are all close to zero and the model collapses to
$$
\mathcal{Y}\sim\mathcal{N}(\mu,\sigma^2 \mathbf{I})
$$
As we see in the plot, it can sometimes take several hundred iterations for the sampler to shift away from near-zero random effects.

## Versions of packages used
```{r sessionInfo}
sessionInfo()
```
	model {
	for (i in 1:length(yield)) {
	yield[i] ~ dnorm(mu + b[batch[i]],tau);
	}
	for (j in 1:J) {
	b[j] ~ dnorm(0., tau.b);
	}
	mu ~ dnorm(1500,1.0E-8);
	tau ~ dgamma(0.001,0.001);
	tau.b ~ dgamma(0.001,0.001);
	sigma <- 1./sqrt(tau);
	sigma.b <- 1./sqrt(tau.b);
	}
	---
	title: JAGS and lme4
	author: Douglas Bates
	date: March 28, 2014
	output:
	html_document:
	toc: yes
	highlight: tango
	fig_width: 7
	fig_height: 5.5
	fig_caption: true
	pdf_document:
	toc: yes
	---
	```{r preliminaries,cache=FALSE,include=FALSE}
	library(rjags)
	library(lme4)
	library(knitr)
	opts_chunk$set(cache=TRUE)
	```

	# Simple examples of the use of JAGS with linear mixed-effects models

	Linear mixed-effects models (LMEMs) are used to model a numeric response on a continuous scale as it depends on a number of covariates. The "effects" of categorical covariates with levels that are fixed across studies (such as _sex_ with levels _female_ and _male_) are represented as _fixed-effects_ coefficients. Categorical covariates whose levels represent a random sample from a population of possible values (such as _subject_ or _item_) are modeled with _random effects_.

	In the classical, "sampling theory" approach to statistical inference we regard the parameters as fixed but unknown and the observed responses as a sample from a probability distribution. For this approach there are important differences between fixed-effects coefficients, which are parameters, and random effects, which are unobserved random variables.

	In Bayesian analysis the parameters in the model and the observations and the random effects are all considered to be random variables. A _Markov-Chain Monte Carlo_ (MCMC) sampler provides a random sample from the distribution of the parameters of interest. It is not important to understand exactly what the "Markov-Chain" part of the name means but it is important to realize that the samples provided from such a sampler are not the "independent and identically distributed" samples described in introductory courses. The samples have a sequential dependence, meaning that the $i$th sample depends upon its predecessors. This is described as _autocorrelation_ in the sample. Sophisticated MCMC sampler methods are designed to minimize auto-correlation but we must be aware of the possibility that the samples are poorly "mixed" and use diagnostic methods to check for this.

	## Example 1 Dyestuff yield

	In addition to providing functions for fitting and analyzing linear mixed-effects models, the __lme4__ package for __R__ provides several data sets to illustrate simple models. The _Dyestuff_ data gives the _Yield_ of dyestuff for several batches (_Batch_) of an intermediate product in the manufacturing process.

	```{r dyestuffplot,fig=TRUE,cache=TRUE,fig.align='center',echo=FALSE,fig.height=3.5,fig.cap="Dotplot of the Dyestuff data. The batches have been reordered by increasing mean yield. The blue line joins the mean yields."}
	set.seed(1234543)
	print(dotplot(reorder(Batch, Yield) ~ Yield, Dyestuff,
	ylab = "Batch", jitter.y = TRUE, pch = 21, aspect = 0.32,
	xlab = "Yield of dyestuff (grams of standard color)",
	type = c("p", "a")))
	```

	A model with random effects for the batches is fit as
	```{r dyestuffML,cache=TRUE}
	summary(fm1 <- lmer(Yield ~ (1\|Batch), Dyestuff, REML=FALSE))
	```

	The only fixed-effects parameter in this model is $\mu$, the mean yield across batches, which is labelled as __(Intercept)__ in this summary. The other two parameters in this model are the standard deviation of the random effects for __Batch__, whose maximum likelihood estimate is 37.3, and the standard deviation of the residual "noise" term, whose maximum likelihood estimate is 49.5

	The random effects, representing the deviation from $\mu$ for the mean yield of each batch, are not parameters in the "sampling theory" form of the model. They are considered to be random variables with the (unconditional) distribution
	$$
	\mathcal{B}\sim\mathcal{N}({\mathbf 0},\sigma_b^2{\mathbf I}) .
	$$
	The conditional distribution $\mathcal{Y}\|\mathcal{B} = \mathbf{b}$ is expressed using _model matrices_, $\mathbf{X}$ and $\mathbf{Z}$ as
	$$
	\mathcal{Y}\|\mathcal{B}=\mathbf{b} \sim \mathcal{N}(\mathbf{X}\mathbf{\beta}+\mathbf{Z}\mathbf{b},\sigma^2\mathbf{I})
	$$

	Combining these two produces the conditional distribution $\mathcal{B}\|\mathcal{Y}=\mathbf{y}$ where $\mathbf{y}$ is the vector of observed responses. In the sampling theory approach we do not "estimate" the random effects but, instead, determine their "conditional means", given the observed data.

	```{r fm1ranef,cache=TRUE}
	ranef(fm1)
	```

	Because model __fm1__ is so simple, we can use a _profiling_ technique to assess the variability in the parameter estimates.
	```{r pr1,cache=TRUE}
	pr1 <- profile(fm1)
	```
	and plot the densities corresponding to the profiled log-likelihood
	```{r pr1dens,cache=TRUE,echo=FALSE,fig.align='center',fig.cap="Densities corresponding to the profiled log-likelihoods in model fm1"}
	densityplot(pr1,layout=c(1,3),strip=FALSE,strip.left=TRUE)
	```

	Alternatively, we can plot the profiles in a form like a normal probability plot with vertical lines delimiting confidence intervals.
	```{r pr1xy,cache=TRUE,echo=FALSE,fig.align='center',fig.width=6.5,fig.height=3.7,fig.cap="Profile plots of the parameters in model fm1. The vertical bars delimit 50%, 80%, 90%, 95% and 99% profile-based confidence intervals on the parameters"}
	xyplot(pr1)
	```
	The vertical axis is on the scale of standard normal deviates and the vertical lines delimit 50%, 80%, 90%, 95% and 99% profile-based confidence intervals on the parameters. Thus the 95% profile-based confidence interval is bounded by the points where the curve is at $\pm 1.960$ on the vertical axis.

	A close examination of the left-hand panel shows that the 99% confidence interval on the parameter written $\sigma_1$ here ($\sigma_b$ in the equations shown above) does not have a lower bound. It should be where the curve reaches `r format(qnorm(0.005))` but the curve doesn't reach that value.

	Also, notice that the curve flattens out as $\sigma_1$ approaches $0$. We will return to this issue when we discuss the results from the JAGS sampler.

	The numerical confidence intervals are returned by the `confint` function.
	```{r confintpr1,cache=TRUE}
	confint(pr1)
	```

	## A JAGS sampler approach to the Dyestuff model

	__JAGS__ (Just Another Gibbs Sampler) is a system to produce MCMC samples from statistical models. It is particularly well suited to linear mixed-effects and generalized linear mixed-effects models as it provides an efficient method of sampling from the fixed and random effects. The samples can be drawn and analyzed in __R__ using the __rjags__ package.

	JAGS has a modeling language similar to that used in __BUGS__, another popular MCMC sampler system. It is easiest to create a file containing the model description. To distinguish the data in the JAGS model from that in the __R__ data frame we will use lower-case names in JAGS, calling the data vectors __yield__ and __batch__. JAGS doesn't provide for factors as in R and hence we must convert a factor like __Batch__ to an integer vector and also specify the number of levels of the factor.
	```{r d}
	str(d <- with(Dyestuff, list(yield=Yield,batch=as.integer(Batch),
	J=length(levels(Batch)))))
	```

	The JAGS model includes specification of the distributions in the probability model and _prior_ distributions on the parameters, which, in the Bayesian formulation, are regarded as random variables.
	```{.jags}
	model {
	for (i in 1:length(yield)) {
	yield[i] ~ dnorm(mu + b[batch[i]],tau);
	}
	for (j in 1:J) {
	b[j] ~ dnorm(0., tau.b);
	}
	mu ~ dnorm(1500,1.0E-8);
	tau ~ dgamma(1.0E-3,1.0E-3);
	tau.b ~ dgamma(1.0E-3,1.0E-3);
	sigma <- 1.0/sqrt(tau);
	sigma.b <- 1.0/sqrt(tau.b);
	}
	```

	In a JAGS model a normal distribution is specified by its mean and _precision_, generally written as $\tau$, which is the inverse of the variance. The standard deviation of the per-observation noise, $\sigma$, is the reciprocal square root of $\tau$. Similarly the standard deviation of the unconditional distribution of the random effects, `b[j], j=1,...,J`, is the reciprocal square root of $\tau_b$.

	The two loops at the beginning of the model establish the conditional distribution $\mathcal{Y}\|\mathcal{B}=\mathbf{b}$ and the unconditional distribution of $\mathcal{B}$. Because the multivariate form of these distributions is an independent and identically distributed, or "spherical", distribution it is sufficient to describe the distribution of each component.

	The _prior_ distribution on `mu` is taken to be a normal distribution with a very small precision (i.e. a very large variance). The mean of this prior distribution is set at 1500 which is close to the overall mean yield. We could set it to be zero or some other arbitrary value but it not make much of a difference because the prior is so diffuse. Priors on the two precision parameters are taken to be diffuse gamma distributions because the gamma distribution is a "conjugate" distribution for the precision parameter. It is not important to know exactly what a conjugate distribution is as much as to realize that specifying the prior in this way provides for easier sampling.

	We need to load the __rjags__ package and, importantly for these kinds of models, also load the "glm" module for JAGS.
	```{r rjags,cache=FALSE}
	library(rjags)
	load.module("glm")
	```
	We now create a JAGS model object in __R__.
	```{r m}
	m <- jags.model("dyestuff.jag",data=d,n.chains=1,quiet=TRUE)
	```
	and check the samplers being used with
	```{r samplers}
	list.samplers(m)
	```
	We check that the fixed-effects parameters and the random effects are being sampled jointly using the `Linear` sampler. This is the reason we load the "glm" model and is important for the efficiency of the sampling.

	A sample is generated with the `coda.samples` function. We will specify that we want to monitor the values of `mu`, `sigma` and `sigma.b` and return them to __R__.
	```{r codasamples,cache=TRUE}
	system.time(samp <- coda.samples(m, c("mu","sigma","sigma.b"), n.iter=10000))
	```
	It is very fast to create these samples when the "glm" module is loaded.

	It is tempting to skip to the "good part" and look at the summary of the sample and density plots so that we can evaluate the precision of parameter estimates. However, we should first check on the behavior of the sequence of parameter values. We want to see a "well-mixed" sequence, with a plot that looks like a fuzzy caterpillar.
	```{r xyplotsamp,echo=FALSE,fig.align='center',fig.cap="Sequential plots of the parameter values in the first JAGS sample",cache=TRUE}
	xyplot(samp,layout=c(1,3),strip=FALSE,strip.left=TRUE,scales=list(x=list(axs="i")))
	```

	The plot of $\mu$ versus the iteration number shows the type of pattern that we want. There are a few spikes and a couple of unusually stable patches just before iteration 1000 and around iteration 9000 but overall it is a stable, fuzzy pattern. The patterns for the standard deviations, $\sigma$ and $\sigma_b$, are problematic.

	First consider the pattern for $\sigma_b$. There are several places where the plot flattens out at a value close to zero and stays there for many iterations. The anomalies in the patterns for the other two parameters correspond to these flat spots. When $\sigma_b$ is close to zero with almost no variability, $\sigma$ becomes larger to compensate (there is a fixed amount of variability in the data and if it isn't being absorbed into the random effects it must get absorbed in the per-observation noise) and the pattern for $\mu$ becomes less variable.

	The reason that $\sigma_b$ will get stuck near zero is related to the flattening of the profile likelihood curve for $\sigma_b$ near zero. When $\sigma_b$ is close to zero the conditional means of the random effects are all close to zero and the model collapses to
	$$
	\mathcal{Y}\sim\mathcal{N}(\mu,\sigma^2 \mathbf{I})
	$$
	As we see in the plot, it can sometimes take several hundred iterations for the sampler to shift away from near-zero random effects.

	## Versions of packages used
	```{r sessionInfo}
	sessionInfo()
	```