ahlusar1989/gibbs.md

## gibbs.md

      
    Raw
  

              gibbs.md
            
          
  generator
  title
  viewport
  
  
  pandoc
  Gibbs sampling
  width=device-width, initial-scale=1
  
  
::: {.container-fluid .main-container}
::: {#header .fluid-row}
Gibbs sampling {#gibbs-sampling .title .toc-ignore}

:::
::: {#lesson-5.1 .section .level1}
So far, we have demonstrated MCMC for a single parameter. What if we
seek the posterior distribution of multiple parameters, and that
posterior distribution does not have a standard form? One option is to
perform Metropolis-Hastings (M-H) by sampling candidates for all
parameters at once, and accepting or rejecting all of those candidates
together. While this is possible, it can get complicated. Another
(simpler) option is to sample the parameters one at a time.
As a simple example, suppose we have a joint posterior distribution for
two parameters [\(\theta\)]{.math .inline} and [\(\phi\)]{.math
.inline}, written [\(p(\theta, \phi \mid y) \propto g(\theta,
\phi)\)]{.math .inline}. If we knew the value of [\(\phi\)]{.math
.inline}, then we would just draw a candidate for [\(\theta\)]{.math
.inline} and use [\(g(\theta, \phi)\)]{.math .inline} to compute our
Metropolis-Hastings ratio, and possibly accept the candidate. Before
moving on to the next iteration, if we don't know [\(\phi\)]{.math
.inline}, then we can perform a similar update for it. Draw a candidate
for [\(\phi\)]{.math .inline} using some proposal distribution and
again use [\(g(\theta, \phi)\)]{.math .inline} to compute our
Metropolis-Hastings ratio. Here we pretend we know the value of
[\(\theta\)]{.math .inline} by substituting its current iteration
from the Markov chain. Once we've drawn for both [\(\theta\)]{.math
.inline} and [\(\phi\)]{.math .inline}, that completes one iteration
and we begin the next iteration by drawing a new [\(\theta\)]{.math
.inline}. In other words, we're just going back and forth, updating the
parameters one at a time, plugging the current value of the other
parameter into [\(g(\theta, \phi)\)]{.math .inline}.
This idea of one-at-a-time updates is used in what we call Gibbs
sampling, which also produces a stationary Markov chain (whose
stationary distribution is the posterior). If you recall, this is the
namesake of JAGS, "just another Gibbs sampler."
::: {#full-conditional-distributions .section .level3}
Full conditional distributions

Before describing the full Gibbs sampling algorithm, there's one more
thing we can do. Using the chain rule of probability, we have
[\(p(\theta, \phi \mid y) = p(\theta \mid \phi, y) \cdot p(\phi
\mid y)\)]{.math .inline}. Notice that the only difference between
[\(p(\theta, \phi \mid y)\)]{.math .inline} and [\(p(\theta \mid
\phi, y)\)]{.math .inline} is multiplication by a factor that doesn't
involve [\(\theta\)]{.math .inline}. Since the [\(g(\theta,
\phi)\)]{.math .inline} function above, when viewed as a function of
[\(\theta\)]{.math .inline} is proportional to both these
expressions, we might as well have replaced it with [\(p(\theta \mid
\phi, y)\)]{.math .inline} in our update for [\(\theta\)]{.math
.inline}.
This distribution [\(p(\theta \mid \phi, y)\)]{.math .inline} is
called the full conditional distribution for [\(\theta\)]{.math
.inline}. Why use it instead of [\(g(\theta, \phi)\)]{.math
.inline}? In some cases, the full conditional distribution is a standard
distribution we know how to sample. If that happens, we no longer need
to draw a candidate and decide whether to accept it. In fact, if we
treat the full conditional distribution as a candidate proposal
distribution, the resulting Metropolis-Hastings acceptance probability
becomes exactly 1.
Gibbs samplers require a little more work up front because you need to
find the full conditional distribution for each parameter. The good news
is that all full conditional distributions have the same starting point:
the full joint posterior distribution. Using the example above, we have
[\[ p(\theta \mid \phi, y) \propto p(\theta, \phi \mid y)
\]]{.math .display} where we simply now treat [\(\phi\)]{.math
.inline} as a known number. Likewise, the other full conditional is
[\(p(\phi \mid \theta, y) \propto p(\theta, \phi \mid
y)\)]{.math .inline} where here, we consider [\(\theta\)]{.math
.inline} to be a known number. We always start with the full posterior
distribution. Thus, the process of finding full conditional
distributions is the same as finding the posterior distribution of each
parameter, pretending that all other parameters are known.
:::
::: {#gibbs-sampler .section .level3}
Gibbs sampler

The idea of Gibbs sampling is that we can update multiple parameters by
sampling just one parameter at a time, cycling through all parameters
and repeating. To perform the update for one particular parameter, we
substitute in the current values of all other parameters.
Here is the algorithm. Suppose we have a joint posterior distribution
for two parameters [\(\theta\)]{.math .inline} and
[\(\phi\)]{.math .inline}, written [\(p(\theta, \phi \mid
y)\)]{.math .inline}. If we can find the distribution of each parameter
at a time, i.e., [\(p(\theta \mid \phi, y)\)]{.math .inline} and
[\(p(\phi \mid \theta, y)\)]{.math .inline}, then we can take turns
sampling these distributions like so:

Using [\(\phi_{i-1}\)]{.math .inline}, draw
[\(\theta_i\)]{.math .inline} from [\(p(\theta \mid \phi =
\phi_{i-1}, y)\)]{.math .inline}.
Using [\(\theta_i\)]{.math .inline}, draw [\(\phi_i\)]{.math
.inline} from [\(p(\phi \mid \theta = \theta_i, y)\)]{.math
.inline}.

Together, steps 1 and 2 complete one cycle of the Gibbs sampler and
produce the draw for [\((\theta_i, \phi_i)\)]{.math .inline} in
one iteration of a MCMC sampler. If there are more than two parameters,
we can handle that also. One Gibbs cycle would include an update for
each of the parameters.
In the following segments, we will provide a concrete example of finding
full conditional distributions and constructing a Gibbs sampler.
:::
:::
::: {#lesson-5.2 .section .level1}
::: {#normal-likelihood-unknown-mean-and-variance .section .level3}
Normal likelihood, unknown mean and variance

Let's return to the example where we have normal likelihood with unknown
mean and unknown variance. The model is [\[ \begin{align} y_i \mid
\mu, \sigma^2 &\overset{\text{iid}}{\sim} \text{N} ( \mu,
\sigma^2 ), \quad i=1,\ldots,n \\ \mu &\sim \text{N}(\mu_0,
\sigma_0^2) \\ \sigma^2 &\sim \text{IG}(\nu_0, \beta_0) \,
. \end{align} \]]{.math .display} We chose a normal prior for
[\(\mu\)]{.math .inline} because, in the case where
[\(\sigma^2\)]{.math .inline} is known, the normal is the conjugate
prior for [\(\mu\)]{.math .inline}. Likewise, in the case where
[\(\mu\)]{.math .inline} is known, the inverse-gamma is the conjugate
prior for [\(\sigma^2\)]{.math .inline}. This will give us
convenient full conditional distributions in a Gibbs sampler.
Let's first work out the form of the full posterior distribution. When
we begin analyzing data, the JAGS software will complete this step for
us. However, it is extremely valuable to see and understand how this
works.
[\[ \begin{align} p( \mu, \sigma^2 \mid y_1, y_2, \ldots, y_n
) &\propto p(y_1, y_2, \ldots, y_n \mid \mu, \sigma^2) p(\mu)
p(\sigma^2) \\ &= \prod_{i=1}^n \text{N} ( y_i \mid \mu,
\sigma^2 ) \times \text{N}( \mu \mid \mu_0, \sigma_0^2)
\times \text{IG}(\sigma^2 \mid \nu_0, \beta_0) \\ &=
\prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp \left[
-\frac{(y_i - \mu)^2}{2\sigma^2} \right] \times
\frac{1}{\sqrt{2\pi\sigma_0^2}} \exp \left[ -\frac{(\mu -
\mu_0)^2}{2\sigma_0^2} \right] \times
\frac{\beta_0^{\nu_0}}{\Gamma(\nu_0)}(\sigma^2)^{-(\nu_0 +
1)} \exp \left[ -\frac{\beta_0}{\sigma^2} \right]
I_{\sigma^2 > 0}(\sigma^2) \\ &\propto (\sigma^2)^{-n/2}
\exp \left[ -\frac{\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2}
\right] \exp \left[ -\frac{(\mu - \mu_0)^2}{2\sigma_0^2}
\right] (\sigma^2)^{-(\nu_0 + 1)} \exp \left[
-\frac{\beta_0}{\sigma^2} \right] I_{\sigma^2 >
0}(\sigma^2) \end{align} \]]{.math .display}
From here, it is easy to continue on to find the two full conditional
distributions we need. First let's look at [\(\mu\)]{.math .inline},
assuming [\(\sigma^2\)]{.math .inline} is known (in which case it
becomes a constant and is absorbed into the normalizing constant):
[\[ \begin{align} p(\mu \mid \sigma^2, y_1, \ldots, y_n)
&\propto p( \mu, \sigma^2 \mid y_1, \ldots, y_n ) \\ &\propto
\exp \left[ -\frac{\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2}
\right] \exp \left[ -\frac{(\mu - \mu_0)^2}{2\sigma_0^2}
\right] \\ &\propto \exp \left[ -\frac{1}{2} \left( \frac{
\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} + \frac{(\mu -
\mu_0)^2}{2\sigma_0^2} \right) \right] \\ &\propto \text{N}
\left( \mu \mid \frac{n\bar{y}/\sigma^2 +
\mu_0/\sigma_0^2}{n/\sigma^2 + 1/\sigma_0^2}, \,
\frac{1}{n/\sigma^2 + 1/\sigma_0^2} \right) \, , \end {align}
\]]{.math .display} which we derived in the supplementary material of
the last course. So, given the data and [\(\sigma^2\)]{.math
.inline}, [\(\mu\)]{.math .inline} follows this normal distribution.
Now let's look at [\(\sigma^2\)]{.math .inline}, assuming
[\(\mu\)]{.math .inline} is known: [\[ \begin{align} p(\sigma^2
\mid \mu, y_1, \ldots, y_n) &\propto p( \mu, \sigma^2 \mid
y_1, \ldots, y_n ) \\ &\propto (\sigma^2)^{-n/2} \exp \left[
-\frac{\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} \right]
(\sigma^2)^{-(\nu_0 + 1)} \exp \left[
-\frac{\beta_0}{\sigma^2} \right] I_{\sigma^2 >
0}(\sigma^2) \\ &\propto (\sigma^2)^{-(\nu_0 + n/2 + 1)} \exp
\left[ -\frac{1}{\sigma^2} \left( \beta_0 +
\frac{\sum_{i=1}^n (y_i - \mu)^2}{2} \right) \right]
I_{\sigma^2 > 0}(\sigma^2) \\ &\propto \text{IG}\left(
\sigma^2 \mid \nu_0 + \frac{n}{2}, \, \beta_0 +
\frac{\sum_{i=1}^n (y_i - \mu)^2}{2} \right) \, . \end{align}
\]]{.math .display}
These two distributions provide the basis of a Gibbs sampler to simulate
from a Markov chain whose stationary distribution is the full posterior
of both [\(\mu\)]{.math .inline} and [\(\sigma^2\)]{.math
.inline}. We simply alternate draws between these two parameters, using
the most recent draw of one parameter to update the other.
We will do this in R in the next segment.
:::
:::
::: {#lesson-5.3 .section .level1}
::: {#gibbs-sampler-in-r .section .level3}
Gibbs sampler in R

To implement the Gibbs sampler we just described, let's return to our
running example where the data are the percent change in total personnel
from last year to this year for [\(n=10\)]{.math .inline} companies.
We'll still use a normal likelihood, but now we'll relax the assumption
that we know the variance of growth between companies,
[\(\sigma^2\)]{.math .inline}, and estimate that variance. Instead
of the [\(t\)]{.math .inline} prior from earlier, we will use the
conditionally conjugate priors, normal for [\(\mu\)]{.math .inline}
and inverse-gamma for [\(\sigma^2\)]{.math .inline}.
The first step will be to write functions to simulate from the full
conditional distributions we derived in the previous segment. The full
conditional for [\(\mu\)]{.math .inline}, given
[\(\sigma^2\)]{.math .inline} and data is [\[ \text{N} \left(
\mu \mid \frac{n\bar{y}/\sigma^2 +
\mu_0/\sigma_0^2}{n/\sigma^2 + 1/\sigma_0^2}, \,
\frac{1}{n/\sigma^2 + 1/\sigma_0^2} \right) \]]{.math .display}
update_mu = function(n, ybar, sig2, mu_0, sig2_0) {
  sig2_1 = 1.0 / (n / sig2 + 1.0 / sig2_0)
  mu_1 = sig2_1 * (n * ybar / sig2 + mu_0 / sig2_0)
  rnorm(n=1, mean=mu_1, sd=sqrt(sig2_1))
}

The full conditional for [\(\sigma^2\)]{.math .inline} given
[\(\mu\)]{.math .inline} and data is [\[ \text{IG}\left(
\sigma^2 \mid \nu_0 + \frac{n}{2}, \, \beta_0 +
\frac{\sum_{i=1}^n (y_i - \mu)^2}{2} \right) \]]{.math
.display}
update_sig2 = function(n, y, mu, nu_0, beta_0) {
  nu_1 = nu_0 + n / 2.0
  sumsq = sum( (y - mu)^2 ) # vectorized
  beta_1 = beta_0 + sumsq / 2.0
  out_gamma = rgamma(n=1, shape=nu_1, rate=beta_1) # rate for gamma is shape for inv-gamma
  1.0 / out_gamma # reciprocal of a gamma random variable is distributed inv-gamma
}

With functions for drawing from the full conditionals, we are ready to
write a function to perform Gibbs sampling.
gibbs = function(y, n_iter, init, prior) {
  ybar = mean(y)
  n = length(y)
  
  ## initialize
  mu_out = numeric(n_iter)
  sig2_out = numeric(n_iter)
  
  mu_now = init$mu
  
  ## Gibbs sampler
  for (i in 1:n_iter) {
    sig2_now = update_sig2(n=n, y=y, mu=mu_now, nu_0=prior$nu_0, beta_0=prior$beta_0)
    mu_now = update_mu(n=n, ybar=ybar, sig2=sig2_now, mu_0=prior$mu_0, sig2_0=prior$sig2_0)
    
    sig2_out[i] = sig2_now
    mu_out[i] = mu_now
  }
  
  cbind(mu=mu_out, sig2=sig2_out)
}

Now we are ready to set up the problem in R.
y = c(1.2, 1.4, -0.5, 0.3, 0.9, 2.3, 1.0, 0.1, 1.3, 1.9)
ybar = mean(y)
n = length(y)

## prior
prior = list()
prior$mu_0 = 0.0
prior$sig2_0 = 1.0
prior$n_0 = 2.0 # prior effective sample size for sig2
prior$s2_0 = 1.0 # prior point estimate for sig2
prior$nu_0 = prior$n_0 / 2.0 # prior parameter for inverse-gamma
prior$beta_0 = prior$n_0 * prior$s2_0 / 2.0 # prior parameter for inverse-gamma

hist(y, freq=FALSE, xlim=c(-1.0, 3.0)) # histogram of the data
curve(dnorm(x=x, mean=prior$mu_0, sd=sqrt(prior$sig2_0)), lty=2, add=TRUE) # prior for mu
points(y, rep(0,n), pch=1) # individual data points
points(ybar, 0, pch=19) # sample mean

{width="672"}
Finally, we can initialize and run the sampler!
set.seed(53)

init = list()
init$mu = 0.0

post = gibbs(y=y, n_iter=1e3, init=init, prior=prior)

head(post)

##             mu      sig2
## [1,] 0.3746992 1.5179144
## [2,] 0.4900277 0.8532821
## [3,] 0.2536817 1.4325174
## [4,] 1.1378504 1.2337821
## [5,] 1.0016641 0.8409815
## [6,] 1.1576873 0.7926196

library("coda")
plot(as.mcmc(post))

{width="672"}
summary(as.mcmc(post))

## 
## Iterations = 1:1000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 1000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##        Mean     SD Naive SE Time-series SE
## mu   0.9051 0.2868  0.00907        0.00907
## sig2 0.9282 0.5177  0.01637        0.01810
## 
## 2. Quantiles for each variable:
## 
##        2.5%    25%    50%   75% 97.5%
## mu   0.3024 0.7244 0.9089 1.090 1.481
## sig2 0.3577 0.6084 0.8188 1.094 2.141

As with the Metropolis-Hastings example, these chains appear to have
converged.
:::
:::
:::