Skip to content

Instantly share code, notes, and snippets.

@ahlusar1989
Created June 17, 2020 13:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ahlusar1989/9672bbc142eaa94ad9e51470eb5e3431 to your computer and use it in GitHub Desktop.
Save ahlusar1989/9672bbc142eaa94ad9e51470eb5e3431 to your computer and use it in GitHub Desktop.
generator title viewport
pandoc
Gibbs sampling
width=device-width, initial-scale=1

::: {.container-fluid .main-container} ::: {#header .fluid-row} Gibbs sampling {#gibbs-sampling .title .toc-ignore}

:::

::: {#lesson-5.1 .section .level1} So far, we have demonstrated MCMC for a single parameter. What if we seek the posterior distribution of multiple parameters, and that posterior distribution does not have a standard form? One option is to perform Metropolis-Hastings (M-H) by sampling candidates for all parameters at once, and accepting or rejecting all of those candidates together. While this is possible, it can get complicated. Another (simpler) option is to sample the parameters one at a time.

As a simple example, suppose we have a joint posterior distribution for two parameters [\(\theta\)]{.math .inline} and [\(\phi\)]{.math .inline}, written [\(p(\theta, \phi \mid y) \propto g(\theta, \phi)\)]{.math .inline}. If we knew the value of [\(\phi\)]{.math .inline}, then we would just draw a candidate for [\(\theta\)]{.math .inline} and use [\(g(\theta, \phi)\)]{.math .inline} to compute our Metropolis-Hastings ratio, and possibly accept the candidate. Before moving on to the next iteration, if we don't know [\(\phi\)]{.math .inline}, then we can perform a similar update for it. Draw a candidate for [\(\phi\)]{.math .inline} using some proposal distribution and again use [\(g(\theta, \phi)\)]{.math .inline} to compute our Metropolis-Hastings ratio. Here we pretend we know the value of [\(\theta\)]{.math .inline} by substituting its current iteration from the Markov chain. Once we've drawn for both [\(\theta\)]{.math .inline} and [\(\phi\)]{.math .inline}, that completes one iteration and we begin the next iteration by drawing a new [\(\theta\)]{.math .inline}. In other words, we're just going back and forth, updating the parameters one at a time, plugging the current value of the other parameter into [\(g(\theta, \phi)\)]{.math .inline}.

This idea of one-at-a-time updates is used in what we call Gibbs sampling, which also produces a stationary Markov chain (whose stationary distribution is the posterior). If you recall, this is the namesake of JAGS, "just another Gibbs sampler."

::: {#full-conditional-distributions .section .level3}

Full conditional distributions

Before describing the full Gibbs sampling algorithm, there's one more thing we can do. Using the chain rule of probability, we have [\(p(\theta, \phi \mid y) = p(\theta \mid \phi, y) \cdot p(\phi \mid y)\)]{.math .inline}. Notice that the only difference between [\(p(\theta, \phi \mid y)\)]{.math .inline} and [\(p(\theta \mid \phi, y)\)]{.math .inline} is multiplication by a factor that doesn't involve [\(\theta\)]{.math .inline}. Since the [\(g(\theta, \phi)\)]{.math .inline} function above, when viewed as a function of [\(\theta\)]{.math .inline} is proportional to both these expressions, we might as well have replaced it with [\(p(\theta \mid \phi, y)\)]{.math .inline} in our update for [\(\theta\)]{.math .inline}.

This distribution [\(p(\theta \mid \phi, y)\)]{.math .inline} is called the full conditional distribution for [\(\theta\)]{.math .inline}. Why use it instead of [\(g(\theta, \phi)\)]{.math .inline}? In some cases, the full conditional distribution is a standard distribution we know how to sample. If that happens, we no longer need to draw a candidate and decide whether to accept it. In fact, if we treat the full conditional distribution as a candidate proposal distribution, the resulting Metropolis-Hastings acceptance probability becomes exactly 1.

Gibbs samplers require a little more work up front because you need to find the full conditional distribution for each parameter. The good news is that all full conditional distributions have the same starting point: the full joint posterior distribution. Using the example above, we have [\[ p(\theta \mid \phi, y) \propto p(\theta, \phi \mid y) \]]{.math .display} where we simply now treat [\(\phi\)]{.math .inline} as a known number. Likewise, the other full conditional is [\(p(\phi \mid \theta, y) \propto p(\theta, \phi \mid y)\)]{.math .inline} where here, we consider [\(\theta\)]{.math .inline} to be a known number. We always start with the full posterior distribution. Thus, the process of finding full conditional distributions is the same as finding the posterior distribution of each parameter, pretending that all other parameters are known. :::

::: {#gibbs-sampler .section .level3}

Gibbs sampler

The idea of Gibbs sampling is that we can update multiple parameters by sampling just one parameter at a time, cycling through all parameters and repeating. To perform the update for one particular parameter, we substitute in the current values of all other parameters.

Here is the algorithm. Suppose we have a joint posterior distribution for two parameters [\(\theta\)]{.math .inline} and [\(\phi\)]{.math .inline}, written [\(p(\theta, \phi \mid y)\)]{.math .inline}. If we can find the distribution of each parameter at a time, i.e., [\(p(\theta \mid \phi, y)\)]{.math .inline} and [\(p(\phi \mid \theta, y)\)]{.math .inline}, then we can take turns sampling these distributions like so:

  1. Using [\(\phi_{i-1}\)]{.math .inline}, draw [\(\theta_i\)]{.math .inline} from [\(p(\theta \mid \phi = \phi_{i-1}, y)\)]{.math .inline}.
  2. Using [\(\theta_i\)]{.math .inline}, draw [\(\phi_i\)]{.math .inline} from [\(p(\phi \mid \theta = \theta_i, y)\)]{.math .inline}.

Together, steps 1 and 2 complete one cycle of the Gibbs sampler and produce the draw for [\((\theta_i, \phi_i)\)]{.math .inline} in one iteration of a MCMC sampler. If there are more than two parameters, we can handle that also. One Gibbs cycle would include an update for each of the parameters.

In the following segments, we will provide a concrete example of finding full conditional distributions and constructing a Gibbs sampler. ::: :::

::: {#lesson-5.2 .section .level1} ::: {#normal-likelihood-unknown-mean-and-variance .section .level3}

Normal likelihood, unknown mean and variance

Let's return to the example where we have normal likelihood with unknown mean and unknown variance. The model is [\[ \begin{align} y_i \mid \mu, \sigma^2 &\overset{\text{iid}}{\sim} \text{N} ( \mu, \sigma^2 ), \quad i=1,\ldots,n \\ \mu &\sim \text{N}(\mu_0, \sigma_0^2) \\ \sigma^2 &\sim \text{IG}(\nu_0, \beta_0) \, . \end{align} \]]{.math .display} We chose a normal prior for [\(\mu\)]{.math .inline} because, in the case where [\(\sigma^2\)]{.math .inline} is known, the normal is the conjugate prior for [\(\mu\)]{.math .inline}. Likewise, in the case where [\(\mu\)]{.math .inline} is known, the inverse-gamma is the conjugate prior for [\(\sigma^2\)]{.math .inline}. This will give us convenient full conditional distributions in a Gibbs sampler.

Let's first work out the form of the full posterior distribution. When we begin analyzing data, the JAGS software will complete this step for us. However, it is extremely valuable to see and understand how this works.

[\[ \begin{align} p( \mu, \sigma^2 \mid y_1, y_2, \ldots, y_n ) &\propto p(y_1, y_2, \ldots, y_n \mid \mu, \sigma^2) p(\mu) p(\sigma^2) \\ &= \prod_{i=1}^n \text{N} ( y_i \mid \mu, \sigma^2 ) \times \text{N}( \mu \mid \mu_0, \sigma_0^2) \times \text{IG}(\sigma^2 \mid \nu_0, \beta_0) \\ &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp \left[ -\frac{(y_i - \mu)^2}{2\sigma^2} \right] \times \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp \left[ -\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right] \times \frac{\beta_0^{\nu_0}}{\Gamma(\nu_0)}(\sigma^2)^{-(\nu_0 + 1)} \exp \left[ -\frac{\beta_0}{\sigma^2} \right] I_{\sigma^2 > 0}(\sigma^2) \\ &\propto (\sigma^2)^{-n/2} \exp \left[ -\frac{\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} \right] \exp \left[ -\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right] (\sigma^2)^{-(\nu_0 + 1)} \exp \left[ -\frac{\beta_0}{\sigma^2} \right] I_{\sigma^2 > 0}(\sigma^2) \end{align} \]]{.math .display}

From here, it is easy to continue on to find the two full conditional distributions we need. First let's look at [\(\mu\)]{.math .inline}, assuming [\(\sigma^2\)]{.math .inline} is known (in which case it becomes a constant and is absorbed into the normalizing constant):

[\[ \begin{align} p(\mu \mid \sigma^2, y_1, \ldots, y_n) &\propto p( \mu, \sigma^2 \mid y_1, \ldots, y_n ) \\ &\propto \exp \left[ -\frac{\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} \right] \exp \left[ -\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right] \\ &\propto \exp \left[ -\frac{1}{2} \left( \frac{ \sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} + \frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right) \right] \\ &\propto \text{N} \left( \mu \mid \frac{n\bar{y}/\sigma^2 + \mu_0/\sigma_0^2}{n/\sigma^2 + 1/\sigma_0^2}, \, \frac{1}{n/\sigma^2 + 1/\sigma_0^2} \right) \, , \end {align} \]]{.math .display} which we derived in the supplementary material of the last course. So, given the data and [\(\sigma^2\)]{.math .inline}, [\(\mu\)]{.math .inline} follows this normal distribution.

Now let's look at [\(\sigma^2\)]{.math .inline}, assuming [\(\mu\)]{.math .inline} is known: [\[ \begin{align} p(\sigma^2 \mid \mu, y_1, \ldots, y_n) &\propto p( \mu, \sigma^2 \mid y_1, \ldots, y_n ) \\ &\propto (\sigma^2)^{-n/2} \exp \left[ -\frac{\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} \right] (\sigma^2)^{-(\nu_0 + 1)} \exp \left[ -\frac{\beta_0}{\sigma^2} \right] I_{\sigma^2 > 0}(\sigma^2) \\ &\propto (\sigma^2)^{-(\nu_0 + n/2 + 1)} \exp \left[ -\frac{1}{\sigma^2} \left( \beta_0 + \frac{\sum_{i=1}^n (y_i - \mu)^2}{2} \right) \right] I_{\sigma^2 > 0}(\sigma^2) \\ &\propto \text{IG}\left( \sigma^2 \mid \nu_0 + \frac{n}{2}, \, \beta_0 + \frac{\sum_{i=1}^n (y_i - \mu)^2}{2} \right) \, . \end{align} \]]{.math .display}

These two distributions provide the basis of a Gibbs sampler to simulate from a Markov chain whose stationary distribution is the full posterior of both [\(\mu\)]{.math .inline} and [\(\sigma^2\)]{.math .inline}. We simply alternate draws between these two parameters, using the most recent draw of one parameter to update the other.

We will do this in R in the next segment. ::: :::

::: {#lesson-5.3 .section .level1} ::: {#gibbs-sampler-in-r .section .level3}

Gibbs sampler in R

To implement the Gibbs sampler we just described, let's return to our running example where the data are the percent change in total personnel from last year to this year for [\(n=10\)]{.math .inline} companies. We'll still use a normal likelihood, but now we'll relax the assumption that we know the variance of growth between companies, [\(\sigma^2\)]{.math .inline}, and estimate that variance. Instead of the [\(t\)]{.math .inline} prior from earlier, we will use the conditionally conjugate priors, normal for [\(\mu\)]{.math .inline} and inverse-gamma for [\(\sigma^2\)]{.math .inline}.

The first step will be to write functions to simulate from the full conditional distributions we derived in the previous segment. The full conditional for [\(\mu\)]{.math .inline}, given [\(\sigma^2\)]{.math .inline} and data is [\[ \text{N} \left( \mu \mid \frac{n\bar{y}/\sigma^2 + \mu_0/\sigma_0^2}{n/\sigma^2 + 1/\sigma_0^2}, \, \frac{1}{n/\sigma^2 + 1/\sigma_0^2} \right) \]]{.math .display}

update_mu = function(n, ybar, sig2, mu_0, sig2_0) {
  sig2_1 = 1.0 / (n / sig2 + 1.0 / sig2_0)
  mu_1 = sig2_1 * (n * ybar / sig2 + mu_0 / sig2_0)
  rnorm(n=1, mean=mu_1, sd=sqrt(sig2_1))
}

The full conditional for [\(\sigma^2\)]{.math .inline} given [\(\mu\)]{.math .inline} and data is [\[ \text{IG}\left( \sigma^2 \mid \nu_0 + \frac{n}{2}, \, \beta_0 + \frac{\sum_{i=1}^n (y_i - \mu)^2}{2} \right) \]]{.math .display}

update_sig2 = function(n, y, mu, nu_0, beta_0) {
  nu_1 = nu_0 + n / 2.0
  sumsq = sum( (y - mu)^2 ) # vectorized
  beta_1 = beta_0 + sumsq / 2.0
  out_gamma = rgamma(n=1, shape=nu_1, rate=beta_1) # rate for gamma is shape for inv-gamma
  1.0 / out_gamma # reciprocal of a gamma random variable is distributed inv-gamma
}

With functions for drawing from the full conditionals, we are ready to write a function to perform Gibbs sampling.

gibbs = function(y, n_iter, init, prior) {
  ybar = mean(y)
  n = length(y)
  
  ## initialize
  mu_out = numeric(n_iter)
  sig2_out = numeric(n_iter)
  
  mu_now = init$mu
  
  ## Gibbs sampler
  for (i in 1:n_iter) {
    sig2_now = update_sig2(n=n, y=y, mu=mu_now, nu_0=prior$nu_0, beta_0=prior$beta_0)
    mu_now = update_mu(n=n, ybar=ybar, sig2=sig2_now, mu_0=prior$mu_0, sig2_0=prior$sig2_0)
    
    sig2_out[i] = sig2_now
    mu_out[i] = mu_now
  }
  
  cbind(mu=mu_out, sig2=sig2_out)
}

Now we are ready to set up the problem in R.

y = c(1.2, 1.4, -0.5, 0.3, 0.9, 2.3, 1.0, 0.1, 1.3, 1.9)
ybar = mean(y)
n = length(y)

## prior
prior = list()
prior$mu_0 = 0.0
prior$sig2_0 = 1.0
prior$n_0 = 2.0 # prior effective sample size for sig2
prior$s2_0 = 1.0 # prior point estimate for sig2
prior$nu_0 = prior$n_0 / 2.0 # prior parameter for inverse-gamma
prior$beta_0 = prior$n_0 * prior$s2_0 / 2.0 # prior parameter for inverse-gamma

hist(y, freq=FALSE, xlim=c(-1.0, 3.0)) # histogram of the data
curve(dnorm(x=x, mean=prior$mu_0, sd=sqrt(prior$sig2_0)), lty=2, add=TRUE) # prior for mu
points(y, rep(0,n), pch=1) # individual data points
points(ybar, 0, pch=19) # sample mean

{width="672"}

Finally, we can initialize and run the sampler!

set.seed(53)

init = list()
init$mu = 0.0

post = gibbs(y=y, n_iter=1e3, init=init, prior=prior)
head(post)
##             mu      sig2
## [1,] 0.3746992 1.5179144
## [2,] 0.4900277 0.8532821
## [3,] 0.2536817 1.4325174
## [4,] 1.1378504 1.2337821
## [5,] 1.0016641 0.8409815
## [6,] 1.1576873 0.7926196
library("coda")
plot(as.mcmc(post))

{width="672"}

summary(as.mcmc(post))
## 
## Iterations = 1:1000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 1000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##        Mean     SD Naive SE Time-series SE
## mu   0.9051 0.2868  0.00907        0.00907
## sig2 0.9282 0.5177  0.01637        0.01810
## 
## 2. Quantiles for each variable:
## 
##        2.5%    25%    50%   75% 97.5%
## mu   0.3024 0.7244 0.9089 1.090 1.481
## sig2 0.3577 0.6084 0.8188 1.094 2.141

As with the Metropolis-Hastings example, these chains appear to have converged. ::: ::: :::

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment