sdwfrost/marglike.Rmd

## marglike.Rmd
Under the Bayesian paradigm, inference is based on the posterior probability over the parameters of interest. It's helpful to think of our inferences being conditional on a given model, $M$ with a parameter vector $\theta \in \Theta$. Given a dataset, $D$, and a model, the posterior distribution of the parameter values is given by Bayes' theorem.

\[
\begin{align}
Pr(\theta|D,M) = \frac{Pr(D|\theta,M)Pr(\theta|M)}{Pr(D|M)}
\end{align}
\]

$Pr(D|\theta,M)$ is the likelihood function, $Pr(\theta|M)$ is the prior probability , and $Pr(D|M)$ is known as the marginal likelihood, predictive probability, or evidence. $Pr(D|M)$ is a normalising constant that ensures that $Pr(\theta|D,M)$ is a probability.

\[
\begin{align}
Pr(D|M) = \int_{\Theta}\Pr(D|\theta,M)Pr(\theta|M) d\theta
\end{align}
\]

However, one often wants to compare the fit of different models. As a function of the model $M$, the marginal likelihood can be interpreted as the likelihood of the model $M$ given the data $D$. Hence, to choose between several models, one simply chooses the one with the highest marginal likelihood. When comparing two models,say $M_0$ and $M_1$, a ratio of marginal likelihoods, known as the Bayes factor, $K$, is usually defined:

\[
\begin{align}
K_{01} = \frac{Pr(D|M_1)}{Pr(D|M_0)}
\end{align}
\]

Interpretations of the Bayes factor have been provided by [Jeffreys](http://books.google.co.uk/books?id=vh9Act9rtzQC) and [Kass and Raftery](http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1995.10476572).

Unfortunately, in most cases, it is not possible to obtain an exact solution of the marginal likelihood. A number of approaches have been described to obtain an approximate numerical estimate of the marginal likelihood; here, I illustrate two approaches based on *tempering*.

# A simple example

Before I describe what tempering means in this context, let's consider a simple example, for which there *is* an analytical solution for the marginal likelihood. Consider the problem of fitting a set of $n=100$ exponential random variables, $X$ with parameter $\lambda=3$.

We can generate these in R as follows.

```{r}
set.seed(1)
lambd <- 3
n <- 100
x <- rexp(n,lambd)
```

The likelihood of the data given the rate parameter $\lambda$ is as follows.

\[
\begin{align}
Pr(X|\lambda) & = \prod_{i=1}^{n=100} \lambda \rm{exp}(-\lambda x_i) \cr
& = \lambda^n \rm{exp}(-\lambda n \bar{x})
\end{align}
\]

where $\bar{x}$ is the sample mean of $X$.

As described in [Wikipedia](http://en.wikipedia.org/wiki/Exponential_distribution) (which is remarkably handy for distributions), if we assume a Gamma($\alpha$,$\beta$) prior on the rate coefficient, the posterior distribution of $\lambda$ is Gamma($\alpha+n$,$\beta+n \bar{x}$), the conjugate prior for an exponential distribution.

\[
\begin{align}
Pr(\lambda|X,\alpha, \beta) & \propto Pr(X|\lambda) \times Pr(\lambda| \alpha, \beta) \cr
& = \lambda^n \rm{exp}(-\lambda n \bar{x}) \times \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} \rm{exp}(-\lambda \beta) \cr
 & = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha+n-1} \rm{exp}(-\lambda (\beta + n \bar{x}))
\end{align}
\]

The marginal likelihood of this model can be calculated by integrating $Pr(X|\lambda) \times Pr(\lambda| \alpha, \beta)$ over $\lambda$.

\[
\begin{align}
\int_{\lambda=0}^{\infty}\frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha+n-1} \rm{exp}(-\lambda (\beta + n \bar{x})) \; d\lambda & = \frac{\beta^\alpha}{\Gamma(\alpha)}  \int_{0}^{\infty}\lambda^{\alpha+n-1} exp(-\lambda (\beta + n \bar{x})) \; d\lambda \cr
& = \frac{\beta^\alpha}{\Gamma(\alpha)} \frac{\Gamma(\alpha+n)}{(\beta+ n \bar{x})^{a+n}}
\end{align}
\]

The log marginal likelihood in R can be calculated as follows.

```{r}
lml <- function(x,alph,bet){
  mux <- mean(x)
  n <- length(x)
  alph*log(bet)-(alph+n)*log(bet+n*mux)+lgamma(alph+n)-lgamma(alph)
}
```

For $\alpha=1$ and $\beta=1$, the log marginal likelihood for these data is around 3.6.

```{r}
alph <- 1
bet <- 1
lml(x,alph,bet)
```

In many cases, however, we don't have an analytical solution to the posterior distribution or the marginal likelihood. To obtain the posterior, we can use MCMC with Metropolis sampling. I first define a Metropolis sampler which returns a vector of parameter values, log likelihood, log prior and log posterior. As described before, we have to be careful, as the parameter of the exponential distribution has to be positive. I use a simple random walk sampler, rejecting any values of $\lambda<0$.

```{r}
met <- function(x,lambda0,alph,bet,sigma,niters){
  lambdvec <- numeric(niters)
  llvec <- numeric(niters)
  lpvec <- numeric(niters)
  lpostvec <- numeric(niters)
  lambd <- lambda0
  ll <- sum(dexp(x,lambd,log=TRUE))
  lp <- dgamma(lambd,shape=alph,rate=bet,log=TRUE)
  lpost <- ll+lp
  for(i in 1:niters){
    lambds <- lambd+rnorm(1,mean=0,sd=sigma)
    if(lambds>0){
      lls <- sum(dexp(x,lambds,log=TRUE))
      lps <- dgamma(lambds,shape=alph,rate=bet,log=TRUE)
      lposts <- lls+lps
      A <- exp(lposts-lpost)
      if(runif(1)<A){
        lambd <- lambds
        ll <- lls
        lp <- lps
        lpost <- lposts
      }
    }
    lambdvec[i] <- lambd
    llvec[i] <- ll
    lpvec[i] <- lp
    lpostvec[i] <- lpost
  }
  return(list(lambdvec,llvec,lpvec,lpostvec))
}
```

To run the sampler, I provide an initial value for $\lambda$, the standard deviation of the normal distribution used for the random walk, and the number of iterations.

```{r}
lambda0 <- 1
sigma <- 1
niters <- 1000000
out <- met(x,lambda0,alph,bet,sigma,niters)
```

Now I can plot out the density and compare it with the analytical solution.

```{r}
hist(out[[1]],100,freq=FALSE,main="",xlab=expression(lambda)) # lambda
mux <- mean(x)
curve(dgamma(x,shape=alph+n,rate=bet+n*mux),add=TRUE,col=2,lwd=2)
```

The fit using MCMC gives us the posterior distribution, but not the marginal likelhood. While there are methods to obtain the log marginal likelihood from the sample from the posterior, they suffer from poor performance due to high (potentially infinite) variance.

# Tempering

Several approaches to calculating marginal likelhoods are based on the idea of tempering, in which we consider running MCMC at a range of different (inverse) 'temperatures', obtaining by raising likelihood to a power between 0 and 1; when the power is 0, we sample from the prior, while when the power is 1, we sample from the posterior. While we use the tempered likelihood to determine acceptance probabilities in the MCMC, we will use samples of the untempered likelihood to compute the marginal likelihood.

```{r}
met.temper <- function(x,lambda0,alph,bet,sigma,temp,niters){
  lambdvec <- numeric(niters)
  lltvec <- numeric(niters)
  llvec <- numeric(niters)
  lpvec <- numeric(niters)
  lpostvec <- numeric(niters)
  lambd <- lambda0
  ll <- sum(dexp(x,lambd,log=TRUE))
  llt <- temp*ll
  lp <- dgamma(lambd,shape=alph,rate=bet,log=TRUE)
  lpost <- llt+lp
  for(i in 1:niters){
    lambds <- lambd+rnorm(1,mean=0,sd=sigma)
    if(lambds>0){
      lls <- sum(dexp(x,lambds,log=TRUE))
      llst <- temp*lls
      lps <- dgamma(lambds,shape=alph,rate=bet,log=TRUE)
      lposts <- llst+lps
      A <- exp(lposts-lpost)
      if(runif(1)<A){
        lambd <- lambds
        ll <- lls
        llt <- llst
        lp <- lps
        lpost <- lposts
      }
    }
    lambdvec[i] <- lambd
    llvec[i] <- ll
    lltvec[i] <- llt
    lpvec[i] <- lp
    lpostvec[i] <- lpost
  }
  return(list(lambdvec,llvec,lltvec,lpvec,lpostvec))
}
```

I first run the chain at a range of temperatures. For efficiency, I start the chain setting $\theta=1$ with an initial value of $\lambda$ obtained from my original MCMC run. Then, with each subsequent value of $\theta$, I start the tempered chain with an initial value chosen from the previous tempered chain. In this way, we reduce the number of simulations needed in order to get a representative sample from the log likelihood of the tempered chains.

```{r}
tempvec <- seq(0,1,by=0.01)^5
numtemp <- length(tempvec)
pplist <- list()
burnin <- 1000
niters <- 10000
for(i in numtemp:1){
  l0 <- tail(out[[1]],1)
  out <- met.temper(x,l0,alph,bet,sigma,tempvec[i],niters+burnin)
  pplist[[i]] <- out
}
```

# Power posteriors

The power posterior approach is based on integrating the expectation of the log likelihood (for a given chain) across the inverse temperatures (see [Friel and Pettit (2008)](http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2007.00650.x/full) and [Lartillot and Philippe (2006)](http://sysbio.oxfordjournals.org/content/55/2/195.short). Friel and Pettitt used a trapezoidal scheme to integrate the likelihood, while a [revised scheme](http://link.springer.com/article/10.1007/s11222-013-9397-1) used an improved trapezoidal scheme, which also requires the variance of the log likelihood.

I first extract the mean and the variance of the log likelihoods.

```{r}
ell <- rep(0,numtemp)
vll <- rep(0,numtemp)
for(i in 1:numtemp){
  out <- pplist[[i]]
  ell[i] <- mean(out[[2]][burnin:niters])
  vll[i] <- var(out[[2]][burnin:niters])
}
```

Let's plot this out.

```{r}
plot(ell~tempvec,type="l",xlab="Temperature",ylab="Mean lnL")
```

You should be able to anticipate a potential problem here with the numerical integration, associated with the rather uninformative prior.

The following uses this information to compute the log marginal likelihood, either using the modified or standard trapezoidal scheme.

```{r}
ppml <- function(ell,vll,tempvec,modify=TRUE){
  N <- length(ell)
  res <- 0
  for(i in 1:(N-1)){
    wts <- tempvec[i+1]-tempvec[i]
    if(modify){ # Modified trapezoidal rule
      res <- res+wts*((ell[i+1]+ell[i])/2.0)-((wts^2)/12)*(vll[i+1]-vll[i])
    }else{
      res <- res+wts*((ell[i+1]+ell[i])/2.0)
    }
  }
  res
}
```

I can also obtain bounds on the log marginal likelihood from using the trapezoidal rule. Note that this does not take into account the error in estimating the expectation of the log likelihood, but it does give a nice easy way to evaluate at least one source of error.

```{r}
boundml <- function(ell,tempvec){
  tempdiff <- tempvec[2:length(tempvec)]-tempvec[1:(length(tempvec)-1)]
  ub <- sum(tempdiff*ell[2:length(ell)])
  lb <- sum(tempdiff*ell[1:(length(ell)-1)])
  c(lb,ub)
}
```

Now I can compute the log marginal likelihood from the series of tempered chains, and compare with the analytical result.

```{r}
ppml(ell,vll,tempvec,FALSE)
ppml(ell,vll,tempvec,TRUE) # modified
boundml(ell,tempvec)
```

# Stepping stone

The [stepping stone approach](http://sysbio.oxfordjournals.org/content/early/2010/12/27/sysbio.syq085.short) also employs tempered distributions, but rather than integrating the expectation of the log likelihood, it employs importance sampling between adjacent tempered chains to calculate a series of normalising constants. The product of these normalising constants gives an estimate of the marginal likelihood.  Here, I estimate the log ratio of marginal likelihoods.

```{r}
lrss <- 0
for(i in 2:numtemp){
  tempdiff <- tempvec[i]-tempvec[i-1]
  logmaxll <- max(pplist[[i]][[2]][(burnin+1):(burnin+niters)])
  oldll <- pplist[[i-1]][[2]][(burnin+1):(burnin+niters)]
  lrss <- lrss+tempdiff*logmaxll
  lrss <- lrss+log((1/length(oldll))*sum(exp(tempdiff*(oldll-logmaxll))))
}
lrss
```

The stepping stone estimator is biased for estimation of the log marginal likelihood, but overall appears to be slightly superior to the power posterior approach. However, as the computational burden is mostly due to the simulation of the tempered distributions, one can easily calculate both estimates.

# Computing Bayes factors

To compare two models, we can calculate a Bayes factor, which is the ratio of their two marginal likelihoods. For simplicity, let us compare two models for the data, in which both assume an exponential distribution, but with different prior values for $\alpha$ and $\beta$, say $\alpha_1=1,\beta_1=1$ versus $\alpha_2=2,\beta_2=0.5$. In principle, we could assume completely different distributions and compare those instead, but this example allows us to use the code above.

```{r}
alph2 <- 2
bet2 <- 0.5
```

The Bayes factor for this new model, compared to the old model, is about 3.05.

```{r}
lbf <- lml(x,alph2,bet2)-lml(x,alph,bet)
lbf # Log Bayes factor
exp(lbf) # Bayes factor
```

## Using marginal likelihoods separately

```{r}
pplist2 <- list()
for(i in 1:numtemp){
  out <- met.temper(x,lambda0,alph2,bet2,sigma,tempvec[i],niters+burnin)
  pplist2[[i]] <- out
}
```

Now I can compare the Bayes factors obtained through simulation. Firstly, using power posteriors.

```{r}
ell2 <- rep(0,numtemp)
vll2 <- rep(0,numtemp)
for(i in 1:numtemp){
  out <- pplist2[[i]]
  ell2[i] <- mean(out[[2]][burnin:niters])
  vll2[i] <- var(out[[2]][burnin:niters])
}
lbf.pp <- ppml(ell2,vll2,tempvec,TRUE)-ppml(ell,vll,tempvec,TRUE) # modified
lbf.pp
exp(lbf.pp)
```

Similarly, one could use a stepping stone approach to calculate the marginal likelihoods for each model, and hence the Bayes factor.

## Computing the Bayes factor directly

One of the problems with calculating the Bayes factor using two separate marginal likelihoods is that errors in estimating the individual marginal likelihoods combine. This can be a serious issue when the prior is vague and/or the difference in the marginal likelihood is small. Another approach is to compute a bridge between the posterior distributions of the two models, so-called ['model-switch integration'](http://sysbio.oxfordjournals.org/content/55/2/195.short). The difference between this and the previous case is that we sample from a tempered posterior distribution, and integrate the difference between the untempered posterior values.


```{r}
met.switch <- function(x,lambda0,alph1,bet1,alph2,bet2,sigma,temp,niters){
  lambdvec <- numeric(niters)
  lpost1vec <- numeric(niters)
  lpost2vec <- numeric(niters)
  lpostvec <- numeric(niters)
  lambd <- lambda0
  ll1 <- sum(dexp(x,lambd,log=TRUE))
  ll2 <- sum(dexp(x,lambd,log=TRUE))
  lp1 <- dgamma(lambd,shape=alph1,rate=bet1,log=TRUE)
  lp2 <- dgamma(lambd,shape=alph2,rate=bet2,log=TRUE)
  lpost1 <- ll1+lp1
  lpost2 <- ll2+lp2
  lpost <- (1-temp)*lpost1+temp*lpost2
  for(i in 1:niters){
    lambds <- lambd+rnorm(1,mean=0,sd=sigma)
    if(lambds>0){
      lls1 <- sum(dexp(x,lambds,log=TRUE))
      lls2 <- sum(dexp(x,lambds,log=TRUE))
      lps1 <- dgamma(lambds,shape=alph1,rate=bet1,log=TRUE)
      lps2 <- dgamma(lambds,shape=alph2,rate=bet2,log=TRUE)
      lposts1 <- lls1+lps1
      lposts2 <- lls2+lps2
      lposts <- (1-temp)*lposts1+temp*lposts2
      A <- exp(lposts-lpost)
      if(runif(1)<A){
        lambd <- lambds
        ll1 <- lls1
        ll2 <- lls2
        lp1 <- lps1
        lp2 <- lps2
        lpost1 <- lposts1
        lpost2 <- lposts2
        lpost <- lposts
      }
    }
    lambdvec[i] <- lambd
    lpost1vec[i] <- lpost1
    lpost2vec[i] <- lpost2
    lpostvec[i] <- lpost
  }
  return(list(lambdvec,lpost1vec,lpost2vec,lpostvec))
}
```

I run the tempered distributions, but now bridging between the two posterior distributions.

```{r}
switchlist <- list()
for(i in 1:numtemp){
  out <- met.switch(x,lambda0,alph,bet,alph2,bet2,sigma,tempvec[i],niters+burnin)
  switchlist[[i]] <- out
}
```

This tends to be more linear than for the standard power posterior/thermodynamic integration approach, so it may make more sense to space out the temperatures more uniformly between 0 and 1. For simplicity, I've just kept the same temperature regime I used previously.

To calculate the log Bayes factor, I first calculate the difference in the posterior probabilities for each model, across the series of chains.

```{r}
elpt <- rep(0,numtemp)
vlpt <- rep(0,numtemp)
for(i in 1:numtemp){
  out <- switchlist[[i]]
  diffp <- out[[3]][burnin:niters]-out[[2]][burnin:niters]
  elpt[i] <- mean(diffp)
  vlpt[i] <- var(diffp)
}
```

I can then use the same code as before to work out the log Bayes factor.

```{r}
lbf.switch <- ppml(elpt,vlpt,tempvec,TRUE)
exp(lbf.switch)
boundml(elpt,tempvec)
```

Note how the model-switch integration gives an estimate much closer to the actual Bayes factor than using the difference between two marginal likelihoods.

I can also use a stepping stone approach.

```{r}
lbfss <- 0
out <- switchlist[[1]]
diffp <- out[[3]][burnin:niters]-out[[2]][burnin:niters]
for(i in 2:numtemp){
  olddiffp <- diffp
  tempdiff <- tempvec[i]-tempvec[i-1]
  out <- switchlist[[i]]
  diffp <- out[[3]][burnin:niters]-out[[2]][burnin:niters]
  logmaxp <- max(diffp)
  lbfss <- lbfss+tempdiff*logmaxp
  lbfss <- lbfss+log((1/length(diffp))*sum(exp(tempdiff*(olddiffp-logmaxp))))
}
lbfss
```

In another post, I'll demonstrate the same principles, but in Julia.
	Under the Bayesian paradigm, inference is based on the posterior probability over the parameters of interest. It's helpful to think of our inferences being conditional on a given model, $M$ with a parameter vector $\theta \in \Theta$. Given a dataset, $D$, and a model, the posterior distribution of the parameter values is given by Bayes' theorem.

	\[
	\begin{align}
	Pr(\theta\|D,M) = \frac{Pr(D\|\theta,M)Pr(\theta\|M)}{Pr(D\|M)}
	\end{align}
	\]

	$Pr(D\|\theta,M)$ is the likelihood function, $Pr(\theta\|M)$ is the prior probability , and $Pr(D\|M)$ is known as the marginal likelihood, predictive probability, or evidence. $Pr(D\|M)$ is a normalising constant that ensures that $Pr(\theta\|D,M)$ is a probability.

	\[
	\begin{align}
	Pr(D\|M) = \int_{\Theta}\Pr(D\|\theta,M)Pr(\theta\|M) d\theta
	\end{align}
	\]

	However, one often wants to compare the fit of different models. As a function of the model $M$, the marginal likelihood can be interpreted as the likelihood of the model $M$ given the data $D$. Hence, to choose between several models, one simply chooses the one with the highest marginal likelihood. When comparing two models,say $M_0$ and $M_1$, a ratio of marginal likelihoods, known as the Bayes factor, $K$, is usually defined:

	\[
	\begin{align}
	K_{01} = \frac{Pr(D\|M_1)}{Pr(D\|M_0)}
	\end{align}
	\]

	Interpretations of the Bayes factor have been provided by [Jeffreys](http://books.google.co.uk/books?id=vh9Act9rtzQC) and [Kass and Raftery](http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1995.10476572).

	Unfortunately, in most cases, it is not possible to obtain an exact solution of the marginal likelihood. A number of approaches have been described to obtain an approximate numerical estimate of the marginal likelihood; here, I illustrate two approaches based on tempering.

	# A simple example

	Before I describe what tempering means in this context, let's consider a simple example, for which there is an analytical solution for the marginal likelihood. Consider the problem of fitting a set of $n=100$ exponential random variables, $X$ with parameter $\lambda=3$.

	We can generate these in R as follows.

	```{r}
	set.seed(1)
	lambd <- 3
	n <- 100
	x <- rexp(n,lambd)
	```

	The likelihood of the data given the rate parameter $\lambda$ is as follows.

	\[
	\begin{align}
	Pr(X\|\lambda) & = \prod_{i=1}^{n=100} \lambda \rm{exp}(-\lambda x_i) \cr
	& = \lambda^n \rm{exp}(-\lambda n \bar{x})
	\end{align}
	\]

	where $\bar{x}$ is the sample mean of $X$.

	As described in [Wikipedia](http://en.wikipedia.org/wiki/Exponential_distribution) (which is remarkably handy for distributions), if we assume a Gamma($\alpha$,$\beta$) prior on the rate coefficient, the posterior distribution of $\lambda$ is Gamma($\alpha+n$,$\beta+n \bar{x}$), the conjugate prior for an exponential distribution.

	\[
	\begin{align}
	Pr(\lambda\|X,\alpha, \beta) & \propto Pr(X\|\lambda) \times Pr(\lambda\| \alpha, \beta) \cr
	& = \lambda^n \rm{exp}(-\lambda n \bar{x}) \times \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} \rm{exp}(-\lambda \beta) \cr
	& = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha+n-1} \rm{exp}(-\lambda (\beta + n \bar{x}))
	\end{align}
	\]

	The marginal likelihood of this model can be calculated by integrating $Pr(X\|\lambda) \times Pr(\lambda\| \alpha, \beta)$ over $\lambda$.

	\[
	\begin{align}
	\int_{\lambda=0}^{\infty}\frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha+n-1} \rm{exp}(-\lambda (\beta + n \bar{x})) \; d\lambda & = \frac{\beta^\alpha}{\Gamma(\alpha)} \int_{0}^{\infty}\lambda^{\alpha+n-1} exp(-\lambda (\beta + n \bar{x})) \; d\lambda \cr
	& = \frac{\beta^\alpha}{\Gamma(\alpha)} \frac{\Gamma(\alpha+n)}{(\beta+ n \bar{x})^{a+n}}
	\end{align}
	\]

	The log marginal likelihood in R can be calculated as follows.

	```{r}
	lml <- function(x,alph,bet){
	mux <- mean(x)
	n <- length(x)
	alphlog(bet)-(alph+n)log(bet+n*mux)+lgamma(alph+n)-lgamma(alph)
	}
	```

	For $\alpha=1$ and $\beta=1$, the log marginal likelihood for these data is around 3.6.

	```{r}
	alph <- 1
	bet <- 1
	lml(x,alph,bet)
	```

	In many cases, however, we don't have an analytical solution to the posterior distribution or the marginal likelihood. To obtain the posterior, we can use MCMC with Metropolis sampling. I first define a Metropolis sampler which returns a vector of parameter values, log likelihood, log prior and log posterior. As described before, we have to be careful, as the parameter of the exponential distribution has to be positive. I use a simple random walk sampler, rejecting any values of $\lambda<0$.

	```{r}
	met <- function(x,lambda0,alph,bet,sigma,niters){
	lambdvec <- numeric(niters)
	llvec <- numeric(niters)
	lpvec <- numeric(niters)
	lpostvec <- numeric(niters)
	lambd <- lambda0
	ll <- sum(dexp(x,lambd,log=TRUE))
	lp <- dgamma(lambd,shape=alph,rate=bet,log=TRUE)
	lpost <- ll+lp
	for(i in 1:niters){
	lambds <- lambd+rnorm(1,mean=0,sd=sigma)
	if(lambds>0){
	lls <- sum(dexp(x,lambds,log=TRUE))
	lps <- dgamma(lambds,shape=alph,rate=bet,log=TRUE)
	lposts <- lls+lps
	A <- exp(lposts-lpost)
	if(runif(1)<A){
	lambd <- lambds
	ll <- lls
	lp <- lps
	lpost <- lposts
	}
	}
	lambdvec[i] <- lambd
	llvec[i] <- ll
	lpvec[i] <- lp
	lpostvec[i] <- lpost
	}
	return(list(lambdvec,llvec,lpvec,lpostvec))
	}
	```

	To run the sampler, I provide an initial value for $\lambda$, the standard deviation of the normal distribution used for the random walk, and the number of iterations.

	```{r}
	lambda0 <- 1
	sigma <- 1
	niters <- 1000000
	out <- met(x,lambda0,alph,bet,sigma,niters)
	```

	Now I can plot out the density and compare it with the analytical solution.

	```{r}
	hist(out[[1]],100,freq=FALSE,main="",xlab=expression(lambda)) # lambda
	mux <- mean(x)
	curve(dgamma(x,shape=alph+n,rate=bet+n*mux),add=TRUE,col=2,lwd=2)
	```

	The fit using MCMC gives us the posterior distribution, but not the marginal likelhood. While there are methods to obtain the log marginal likelihood from the sample from the posterior, they suffer from poor performance due to high (potentially infinite) variance.

	# Tempering

	Several approaches to calculating marginal likelhoods are based on the idea of tempering, in which we consider running MCMC at a range of different (inverse) 'temperatures', obtaining by raising likelihood to a power between 0 and 1; when the power is 0, we sample from the prior, while when the power is 1, we sample from the posterior. While we use the tempered likelihood to determine acceptance probabilities in the MCMC, we will use samples of the untempered likelihood to compute the marginal likelihood.

	```{r}
	met.temper <- function(x,lambda0,alph,bet,sigma,temp,niters){
	lambdvec <- numeric(niters)
	lltvec <- numeric(niters)
	llvec <- numeric(niters)
	lpvec <- numeric(niters)
	lpostvec <- numeric(niters)
	lambd <- lambda0
	ll <- sum(dexp(x,lambd,log=TRUE))
	llt <- temp*ll
	lp <- dgamma(lambd,shape=alph,rate=bet,log=TRUE)
	lpost <- llt+lp
	for(i in 1:niters){
	lambds <- lambd+rnorm(1,mean=0,sd=sigma)
	if(lambds>0){
	lls <- sum(dexp(x,lambds,log=TRUE))
	llst <- temp*lls
	lps <- dgamma(lambds,shape=alph,rate=bet,log=TRUE)
	lposts <- llst+lps
	A <- exp(lposts-lpost)
	if(runif(1)<A){
	lambd <- lambds
	ll <- lls
	llt <- llst
	lp <- lps
	lpost <- lposts
	}
	}
	lambdvec[i] <- lambd
	llvec[i] <- ll
	lltvec[i] <- llt
	lpvec[i] <- lp
	lpostvec[i] <- lpost
	}
	return(list(lambdvec,llvec,lltvec,lpvec,lpostvec))
	}
	```

	I first run the chain at a range of temperatures. For efficiency, I start the chain setting $\theta=1$ with an initial value of $\lambda$ obtained from my original MCMC run. Then, with each subsequent value of $\theta$, I start the tempered chain with an initial value chosen from the previous tempered chain. In this way, we reduce the number of simulations needed in order to get a representative sample from the log likelihood of the tempered chains.

	```{r}
	tempvec <- seq(0,1,by=0.01)^5
	numtemp <- length(tempvec)
	pplist <- list()
	burnin <- 1000
	niters <- 10000
	for(i in numtemp:1){
	l0 <- tail(out[[1]],1)
	out <- met.temper(x,l0,alph,bet,sigma,tempvec[i],niters+burnin)
	pplist[[i]] <- out
	}
	```

	# Power posteriors

	The power posterior approach is based on integrating the expectation of the log likelihood (for a given chain) across the inverse temperatures (see [Friel and Pettit (2008)](http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2007.00650.x/full) and [Lartillot and Philippe (2006)](http://sysbio.oxfordjournals.org/content/55/2/195.short). Friel and Pettitt used a trapezoidal scheme to integrate the likelihood, while a [revised scheme](http://link.springer.com/article/10.1007/s11222-013-9397-1) used an improved trapezoidal scheme, which also requires the variance of the log likelihood.

	I first extract the mean and the variance of the log likelihoods.

	```{r}
	ell <- rep(0,numtemp)
	vll <- rep(0,numtemp)
	for(i in 1:numtemp){
	out <- pplist[[i]]
	ell[i] <- mean(out[[2]][burnin:niters])
	vll[i] <- var(out[[2]][burnin:niters])
	}
	```

	Let's plot this out.

	```{r}
	plot(ell~tempvec,type="l",xlab="Temperature",ylab="Mean lnL")
	```

	You should be able to anticipate a potential problem here with the numerical integration, associated with the rather uninformative prior.

	The following uses this information to compute the log marginal likelihood, either using the modified or standard trapezoidal scheme.

	```{r}
	ppml <- function(ell,vll,tempvec,modify=TRUE){
	N <- length(ell)
	res <- 0
	for(i in 1:(N-1)){
	wts <- tempvec[i+1]-tempvec[i]
	if(modify){ # Modified trapezoidal rule
	res <- res+wts((ell[i+1]+ell[i])/2.0)-((wts^2)/12)(vll[i+1]-vll[i])
	}else{
	res <- res+wts*((ell[i+1]+ell[i])/2.0)
	}
	}
	res
	}
	```

	I can also obtain bounds on the log marginal likelihood from using the trapezoidal rule. Note that this does not take into account the error in estimating the expectation of the log likelihood, but it does give a nice easy way to evaluate at least one source of error.

	```{r}
	boundml <- function(ell,tempvec){
	tempdiff <- tempvec[2:length(tempvec)]-tempvec[1:(length(tempvec)-1)]
	ub <- sum(tempdiff*ell[2:length(ell)])
	lb <- sum(tempdiff*ell[1:(length(ell)-1)])
	c(lb,ub)
	}
	```

	Now I can compute the log marginal likelihood from the series of tempered chains, and compare with the analytical result.

	```{r}
	ppml(ell,vll,tempvec,FALSE)
	ppml(ell,vll,tempvec,TRUE) # modified
	boundml(ell,tempvec)
	```

	# Stepping stone

	The [stepping stone approach](http://sysbio.oxfordjournals.org/content/early/2010/12/27/sysbio.syq085.short) also employs tempered distributions, but rather than integrating the expectation of the log likelihood, it employs importance sampling between adjacent tempered chains to calculate a series of normalising constants. The product of these normalising constants gives an estimate of the marginal likelihood. Here, I estimate the log ratio of marginal likelihoods.

	```{r}
	lrss <- 0
	for(i in 2:numtemp){
	tempdiff <- tempvec[i]-tempvec[i-1]
	logmaxll <- max(pplist[[i]][[2]][(burnin+1):(burnin+niters)])
	oldll <- pplist[[i-1]][[2]][(burnin+1):(burnin+niters)]
	lrss <- lrss+tempdiff*logmaxll
	lrss <- lrss+log((1/length(oldll))sum(exp(tempdiff(oldll-logmaxll))))
	}
	lrss
	```

	The stepping stone estimator is biased for estimation of the log marginal likelihood, but overall appears to be slightly superior to the power posterior approach. However, as the computational burden is mostly due to the simulation of the tempered distributions, one can easily calculate both estimates.

	# Computing Bayes factors

	To compare two models, we can calculate a Bayes factor, which is the ratio of their two marginal likelihoods. For simplicity, let us compare two models for the data, in which both assume an exponential distribution, but with different prior values for $\alpha$ and $\beta$, say $\alpha_1=1,\beta_1=1$ versus $\alpha_2=2,\beta_2=0.5$. In principle, we could assume completely different distributions and compare those instead, but this example allows us to use the code above.

	```{r}
	alph2 <- 2
	bet2 <- 0.5
	```

	The Bayes factor for this new model, compared to the old model, is about 3.05.

	```{r}
	lbf <- lml(x,alph2,bet2)-lml(x,alph,bet)
	lbf # Log Bayes factor
	exp(lbf) # Bayes factor
	```

	## Using marginal likelihoods separately

	```{r}
	pplist2 <- list()
	for(i in 1:numtemp){
	out <- met.temper(x,lambda0,alph2,bet2,sigma,tempvec[i],niters+burnin)
	pplist2[[i]] <- out
	}
	```

	Now I can compare the Bayes factors obtained through simulation. Firstly, using power posteriors.

	```{r}
	ell2 <- rep(0,numtemp)
	vll2 <- rep(0,numtemp)
	for(i in 1:numtemp){
	out <- pplist2[[i]]
	ell2[i] <- mean(out[[2]][burnin:niters])
	vll2[i] <- var(out[[2]][burnin:niters])
	}
	lbf.pp <- ppml(ell2,vll2,tempvec,TRUE)-ppml(ell,vll,tempvec,TRUE) # modified
	lbf.pp
	exp(lbf.pp)
	```

	Similarly, one could use a stepping stone approach to calculate the marginal likelihoods for each model, and hence the Bayes factor.

	## Computing the Bayes factor directly

	One of the problems with calculating the Bayes factor using two separate marginal likelihoods is that errors in estimating the individual marginal likelihoods combine. This can be a serious issue when the prior is vague and/or the difference in the marginal likelihood is small. Another approach is to compute a bridge between the posterior distributions of the two models, so-called ['model-switch integration'](http://sysbio.oxfordjournals.org/content/55/2/195.short). The difference between this and the previous case is that we sample from a tempered posterior distribution, and integrate the difference between the untempered posterior values.


	```{r}
	met.switch <- function(x,lambda0,alph1,bet1,alph2,bet2,sigma,temp,niters){
	lambdvec <- numeric(niters)
	lpost1vec <- numeric(niters)
	lpost2vec <- numeric(niters)
	lpostvec <- numeric(niters)
	lambd <- lambda0
	ll1 <- sum(dexp(x,lambd,log=TRUE))
	ll2 <- sum(dexp(x,lambd,log=TRUE))
	lp1 <- dgamma(lambd,shape=alph1,rate=bet1,log=TRUE)
	lp2 <- dgamma(lambd,shape=alph2,rate=bet2,log=TRUE)
	lpost1 <- ll1+lp1
	lpost2 <- ll2+lp2
	lpost <- (1-temp)lpost1+templpost2
	for(i in 1:niters){
	lambds <- lambd+rnorm(1,mean=0,sd=sigma)
	if(lambds>0){
	lls1 <- sum(dexp(x,lambds,log=TRUE))
	lls2 <- sum(dexp(x,lambds,log=TRUE))
	lps1 <- dgamma(lambds,shape=alph1,rate=bet1,log=TRUE)
	lps2 <- dgamma(lambds,shape=alph2,rate=bet2,log=TRUE)
	lposts1 <- lls1+lps1
	lposts2 <- lls2+lps2
	lposts <- (1-temp)lposts1+templposts2
	A <- exp(lposts-lpost)
	if(runif(1)<A){
	lambd <- lambds
	ll1 <- lls1
	ll2 <- lls2
	lp1 <- lps1
	lp2 <- lps2
	lpost1 <- lposts1
	lpost2 <- lposts2
	lpost <- lposts
	}
	}
	lambdvec[i] <- lambd
	lpost1vec[i] <- lpost1
	lpost2vec[i] <- lpost2
	lpostvec[i] <- lpost
	}
	return(list(lambdvec,lpost1vec,lpost2vec,lpostvec))
	}
	```

	I run the tempered distributions, but now bridging between the two posterior distributions.

	```{r}
	switchlist <- list()
	for(i in 1:numtemp){
	out <- met.switch(x,lambda0,alph,bet,alph2,bet2,sigma,tempvec[i],niters+burnin)
	switchlist[[i]] <- out
	}
	```

	This tends to be more linear than for the standard power posterior/thermodynamic integration approach, so it may make more sense to space out the temperatures more uniformly between 0 and 1. For simplicity, I've just kept the same temperature regime I used previously.

	To calculate the log Bayes factor, I first calculate the difference in the posterior probabilities for each model, across the series of chains.

	```{r}
	elpt <- rep(0,numtemp)
	vlpt <- rep(0,numtemp)
	for(i in 1:numtemp){
	out <- switchlist[[i]]
	diffp <- out[[3]][burnin:niters]-out[[2]][burnin:niters]
	elpt[i] <- mean(diffp)
	vlpt[i] <- var(diffp)
	}
	```

	I can then use the same code as before to work out the log Bayes factor.

	```{r}
	lbf.switch <- ppml(elpt,vlpt,tempvec,TRUE)
	exp(lbf.switch)
	boundml(elpt,tempvec)
	```

	Note how the model-switch integration gives an estimate much closer to the actual Bayes factor than using the difference between two marginal likelihoods.

	I can also use a stepping stone approach.

	```{r}
	lbfss <- 0
	out <- switchlist[[1]]
	diffp <- out[[3]][burnin:niters]-out[[2]][burnin:niters]
	for(i in 2:numtemp){
	olddiffp <- diffp
	tempdiff <- tempvec[i]-tempvec[i-1]
	out <- switchlist[[i]]
	diffp <- out[[3]][burnin:niters]-out[[2]][burnin:niters]
	logmaxp <- max(diffp)
	lbfss <- lbfss+tempdiff*logmaxp
	lbfss <- lbfss+log((1/length(diffp))sum(exp(tempdiff(olddiffp-logmaxp))))
	}
	lbfss
	```

	In another post, I'll demonstrate the same principles, but in Julia.