richarddmorey/binomial-beta.Rnw

## binomial-beta.Rnw
---
title: "Joint and conditional, binomial/beta example"
author: "Richard D. Morey"
date: "4 June 2016"
output: html_document
---

```{r,echo=FALSE,warning=FALSE}
library(rgl)
library(knitr)
library(shape)
knit_hooks$set(webgl = hook_webgl)
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(cache = TRUE)
```

```{r}
N = 10
theta = seq(.001,.999,len=100)
y = 0:N
a = 2
b = 2

```

```{r}
jt.par = expand.grid(y=y,theta=theta)


joint.dens = Vectorize(
  function(y,theta) dbinom(y,N,theta)*dbeta(theta,a,b),
      c("y","theta"))

jt.par$dens = mapply(joint.dens,jt.par$y,jt.par$theta)
```

## The model

We assume that the data have a [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution) with a probability parameter $\theta$, and we will observe $N=`r N`$ binomial trials:
\[
y \sim \mbox{Binomial}(N,\theta)
\]
which implies that
\[
p(y\mid\theta) = \binom{N}{y}\theta^y(1-\theta)^{N-y}
\]

```{r}
thetastar=20
```
This function tells us the probability of various outcomes, assuming we knew the true $\theta$; for instance, the figure below shows the probability of each possible outcome from $y=0$ to $y=`r N`$, assuming that $\theta=`r round(theta[thetastar],2)`$:

```{r}
plot(y,dbinom(y,N,theta[thetastar]),col=y+1,pch=19,cex=1.4,ylab="Conditional density",xlab="y")
abline(h=0,col="gray")
mtext(paste0("theta = ",round(theta[thetastar],2)),3,.1,adj=1,cex=1.3)
```


Later on, it will be helpful to imagine that this is one of several possible sets of probabilities for $y$, embedded in a bivarate space with $y$ on one axis and $\theta$ on the other. The following 3D plot shows this (try rotating and zooming!):

```{r, webgl=TRUE}
plot3d(jt.par$y, jt.par$theta, jt.par$dens,
       xlab="y", ylab="theta",zlab="Joint density",
      col=y+1,
      alpha=jt.par$theta == theta[thetastar])
```

Every value of $\theta$ corresponds to a new set of probabilities for the outcomes. Low values of $\theta$ will yield higher probabilities for smaller numbers of successes, while high values of $\theta$ will yield higher probabilities for the larger numbers of successes. We can visualize this with the following video:

```{r ani1, fig.width=7, fig.height=6, fig.show='animate', interval=.3}
for (th in theta) {
  plot(y,dbinom(y,N,th),col=y+1,pch=19,cex=1.4,ylab="Conditional density",xlab="y",ylim=range(c(jt.par$dens,dbinom(0,N,theta[1]))))
abline(h=0,col="gray")
mtext(paste0("theta = ",round(th,2)),3,.1,adj=1,cex=1.3)
}
```

Of interest to us is how to make an inference about $\theta$ after we observe $y$ successes out of $N=`r N`$ trials.

## Bayes' Theorem

Every Bayesian analysis begins with Bayes' theorem. Since joint probability will be important to us, it will be helpful to think of Bayes theorem as a direct consequence of the definition of conditional probability:
\[
p(\theta\mid y) = \frac{p(\theta, y)}{p(y)}.
\]
This is simply the definition of conditional probability. It implies that
\[
p(\theta\mid y)p(y) = p(\theta, y) = p(y \mid \theta)p(\theta)
\]
which, of course, implies that
\[
p(\theta\mid y) = \frac{p(y \mid \theta)p(\theta)}{p(y)}
\]
This is called Bayes' theorem. We begin with a "prior" distribution $p(\theta)$ that quantifies a "reasonable belief" about $\theta$ (in some sense) before the data, and then arrive at a "posterior" distribution $p(\theta\mid y)$ that quantifies the "reasonable believe" we have about $\theta$ after observing the data.

For demonstration, I will use the following prior distribution for $\theta$:

```{r}
plot(theta,dbeta(theta,a,b),col="darkgray",lwd=2,ylab="Prior density",xlab="theta",ty='l',ylim=c(0,1.5))
abline(h=0,col="gray")
mtext(paste0("beta(",a,",",b,")"),3,.1,adj=1,cex=1.3)
```

This quantifies a belief that $\theta$ is probably somewhere in the middle of the range and discounts extreme values, but not heavily so. It is a $\beta(`r a`,`r b`)$ [distribution](https://en.wikipedia.org/wiki/Beta_distribution), but this detail is not important; what matters is the shape of the curve.

If we look at the numerator on the right-hand side of Bayes theorem, we see the term $p(\theta\mid y)p(y)$. As we pointed out, this is simply the joint probability of $y$ and $\theta$, $p(\theta, y)$. Notice that this is simply the probabilities for all the outcomes, assuming a particular $\theta$, *multiplied by* the prior probability of that $\theta$. We can see how this is done in the video below. The faint points are the probabilities from the previous animation, before they have been multiplied by the prior. The solid points show the new probabilities after multiplication by the prior. Notice that when the prior probability is low, the points are pushed down toward 0; when the prior probability is high, the points are raised.

```{r ani2, fig.width=7, fig.height=8, fig.show='animate', interval=.3}
fadedcols = apply(sapply(y+1,col2rgb)/255,2,function(v) rgb(v[1],v[2],v[3],.2))
for (th in theta) {
  par(mfrow=c(2,1))
  plot(theta,dbeta(theta,a,b),col="darkgray",lwd=2,ylab="Prior density",xlab="theta",ty='l',ylim=c(0,1.5))
  abline(h=0,col="gray")
  mtext(paste0("beta(",a,",",b,")"),3,.1,adj=1,cex=1.3)
  points(th,dbeta(th,a,b),pch=21,col="blue")
  mtext(paste0("Prior density = ",round(dbeta(th,a,b),2)),3,.1,adj=0,cex=1.3)


  plot(y,dbinom(y,N,th),col=fadedcols,pch=19,cex=1.4,ylab="Conditional density",xlab="y",ylim=range(c(jt.par$dens,dbinom(0,N,theta[1]))))
  points(y,dbinom(y,N,th)*dbeta(th,a,b),col=y+1,pch=19,cex=1.4)
#  arrows(y,dbinom(y,N,th),y,dbinom(y,N,th)*dbeta(th,a,b),col=fadedcols)
  abline(h=0,col="gray")
mtext(paste0("theta = ",round(th,2)),3,.1,adj=1,cex=1.3)
mtext(paste0("Prior density = ",round(dbeta(th,a,b),2)),3,.1,adj=0,cex=1.3)

}
```

If we take all the points from the animation above -- that is, for all values of $\theta$ -- and put them into a bivariate graph with $\theta$ and $y$ on two axes, it looks like the following (try rotating and zooming!):

```{r, webgl=TRUE}
plot3d(jt.par$y, jt.par$theta, jt.par$dens,
       xlab="y", ylab="theta",zlab="Joint density",
      col=y+1)
for(ys in y)
lines3d(jt.par$theta*0+jt.par$y[ys+1],
        jt.par$theta[(1:length(theta)-1)*length(y)+ys+1],
        jt.par$dens[(1:length(theta)-1)*length(y)+ys+1],
        col=ys + 1, lwd=3)
```

This is the joint distribution of $y$ and $\theta$ which is notated or $p(y, \theta)$.

## The effect of an observation

```{r}
ystar = 1
```

By Bayes theorem, the effect of observing $y=`r ystar`$ would be to select a slice from the joint distribution above -- in particular, the red curve where $y=`r ystar`$. Selecting only this curve yields a curve over the possible values of $\theta$ (try rotating and zooming!):


```{r, webgl=TRUE}
py = integrate(function(theta) dbinom(ystar,N,theta)*dbeta(theta,a,b),0,1)[[1]]

plot3d(jt.par$y, jt.par$theta, jt.par$dens,
       xlab="y", ylab="theta",zlab="Joint density",
      col=y+1,alpha=0)

lines3d(jt.par$theta*0+jt.par$y[ystar+1],
        jt.par$theta[(1:length(theta)-1)*length(y)+ystar+1],
        jt.par$dens[(1:length(theta)-1)*length(y)+ystar+1],
        col=ystar + 1, lwd=3)
```

This curve is the joint distribution evaluated at all the values where $y=`r ystar`$. It can be more easily shown as a standard 2D plot:

```{r}
plot(theta,py*dbeta(theta,ystar+a,N-ystar+b),col=ystar+1,lwd=2,ylab="Joint density (slice)",xlab="theta",ty='l')
abline(h=0,col="gray")
mtext(paste0("y = ",ystar),3,.1,adj=1,cex=1.3)
mtext("Does not integrate to 1!",3,.1,adj=0,cex=1.3,col="red")
```

Notice, however, that this curve is a slice through the joint distribution; it does not integrate to 1. Remember that the definition of conditional probability (and Bayes' theorem) specifies that we need to divide by $p(y)$ to get the posterior distribution:
\[
p(\theta\mid y) = \frac{p(y,\theta)}{p(y)}
\]

What is $p(y)$? By definition, it is the constant that makes the curve integrate to 1 (that is, it makes the curve into a probability distribution). We can also think of it as the weighted average (weighted over all the possible $\theta$ values) probability of obtaining $y=`r ystar`$. In notation,
\[
\begin{eqnarray}
p(y)&=&\int_0^1 p(y\mid\theta)p(\theta)\,d\theta\\
&=&E_{p(\theta)}\left[p(y\mid\theta)\right]
\end{eqnarray}
\]
For our particular value observation, it is
\[
\begin{eqnarray}
p(y)&=&\int_0^1 \binom{`r N`}{`r ystar`}\theta^{`r ystar`}(1-\theta)^{`r N-ystar`}\frac{\theta^{`r a`-1}(1-\theta)^{`r b`-1}}{Be(`r a`,`r b`)}\,d\theta\\
&=& \int_0^1`r choose(N,ystar)`\times`r 1/beta(a,b)`\times\theta^{`r ystar`+`r a` - 1}(1-\theta)^{`r N-ystar`+`r b`-1}\,d\theta\\
&=&`r py`
\end{eqnarray}
\]

This can be easily understood through sampling in R.
```{r echo=TRUE}
## Number of simulations to do
M = 100000

## Sample from the prior; a and b were defined before
theta.sample = rbeta(M, a, b)

## Sample data, assuming the sampled prior
## N was defined before
y.sample = rbinom(M, N, theta.sample)

## What is the *marginal* probability of observing the y we observed?
## y was defined before
mean(y.sample == ystar)

```
The sampled number above may differ slightly from the exact one above, due to sampling variability.

## The posterior distribution

Now that we have our normalizing constant $p(y)$, we can simply take our curve and divide, which will yield a new curve that is a probability distribution:

```{r}
plot(theta,dbeta(theta,ystar+a,N-ystar+b),col=ystar+1,lwd=2,ylab="Conditional density",xlab="theta",ty='l')
abline(h=0,col="gray")
mtext(paste0("y = ",ystar),3,.1,adj=1,cex=1.3)
mtext("Integrates to 1.",3,.1,adj=0,cex=1.3,col="blue")
```

The curve above represents the posterior distribution for $\theta$, after having observed $y=`r ystar`$.
	---
	title: "Joint and conditional, binomial/beta example"
	author: "Richard D. Morey"
	date: "4 June 2016"
	output: html_document
	---

	```{r,echo=FALSE,warning=FALSE}
	library(rgl)
	library(knitr)
	library(shape)
	knit_hooks$set(webgl = hook_webgl)
	knitr::opts_chunk$set(echo = FALSE)
	knitr::opts_chunk$set(cache = TRUE)
	```

	```{r}
	N = 10
	theta = seq(.001,.999,len=100)
	y = 0:N
	a = 2
	b = 2

	```

	```{r}
	jt.par = expand.grid(y=y,theta=theta)


	joint.dens = Vectorize(
	function(y,theta) dbinom(y,N,theta)*dbeta(theta,a,b),
	c("y","theta"))

	jt.par$dens = mapply(joint.dens,jt.par$y,jt.par$theta)
	```

	## The model

	We assume that the data have a [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution) with a probability parameter $\theta$, and we will observe $N=`r N`$ binomial trials:
	\[
	y \sim \mbox{Binomial}(N,\theta)
	\]
	which implies that
	\[
	p(y\mid\theta) = \binom{N}{y}\theta^y(1-\theta)^{N-y}
	\]

	```{r}
	thetastar=20
	```
	This function tells us the probability of various outcomes, assuming we knew the true $\theta$; for instance, the figure below shows the probability of each possible outcome from $y=0$ to $y=`r N`$, assuming that $\theta=`r round(theta[thetastar],2)`$:

	```{r}
	plot(y,dbinom(y,N,theta[thetastar]),col=y+1,pch=19,cex=1.4,ylab="Conditional density",xlab="y")
	abline(h=0,col="gray")
	mtext(paste0("theta = ",round(theta[thetastar],2)),3,.1,adj=1,cex=1.3)
	```


	Later on, it will be helpful to imagine that this is one of several possible sets of probabilities for $y$, embedded in a bivarate space with $y$ on one axis and $\theta$ on the other. The following 3D plot shows this (try rotating and zooming!):

	```{r, webgl=TRUE}
	plot3d(jt.par$y, jt.par$theta, jt.par$dens,
	xlab="y", ylab="theta",zlab="Joint density",
	col=y+1,
	alpha=jt.par$theta == theta[thetastar])
	```

	Every value of $\theta$ corresponds to a new set of probabilities for the outcomes. Low values of $\theta$ will yield higher probabilities for smaller numbers of successes, while high values of $\theta$ will yield higher probabilities for the larger numbers of successes. We can visualize this with the following video:

	```{r ani1, fig.width=7, fig.height=6, fig.show='animate', interval=.3}
	for (th in theta) {
	plot(y,dbinom(y,N,th),col=y+1,pch=19,cex=1.4,ylab="Conditional density",xlab="y",ylim=range(c(jt.par$dens,dbinom(0,N,theta[1]))))
	abline(h=0,col="gray")
	mtext(paste0("theta = ",round(th,2)),3,.1,adj=1,cex=1.3)
	}
	```

	Of interest to us is how to make an inference about $\theta$ after we observe $y$ successes out of $N=`r N`$ trials.

	## Bayes' Theorem

	Every Bayesian analysis begins with Bayes' theorem. Since joint probability will be important to us, it will be helpful to think of Bayes theorem as a direct consequence of the definition of conditional probability:
	\[
	p(\theta\mid y) = \frac{p(\theta, y)}{p(y)}.
	\]
	This is simply the definition of conditional probability. It implies that
	\[
	p(\theta\mid y)p(y) = p(\theta, y) = p(y \mid \theta)p(\theta)
	\]
	which, of course, implies that
	\[
	p(\theta\mid y) = \frac{p(y \mid \theta)p(\theta)}{p(y)}
	\]
	This is called Bayes' theorem. We begin with a "prior" distribution $p(\theta)$ that quantifies a "reasonable belief" about $\theta$ (in some sense) before the data, and then arrive at a "posterior" distribution $p(\theta\mid y)$ that quantifies the "reasonable believe" we have about $\theta$ after observing the data.

	For demonstration, I will use the following prior distribution for $\theta$:

	```{r}
	plot(theta,dbeta(theta,a,b),col="darkgray",lwd=2,ylab="Prior density",xlab="theta",ty='l',ylim=c(0,1.5))
	abline(h=0,col="gray")
	mtext(paste0("beta(",a,",",b,")"),3,.1,adj=1,cex=1.3)
	```

	This quantifies a belief that $\theta$ is probably somewhere in the middle of the range and discounts extreme values, but not heavily so. It is a $\beta(`r a`,`r b`)$ [distribution](https://en.wikipedia.org/wiki/Beta_distribution), but this detail is not important; what matters is the shape of the curve.

	If we look at the numerator on the right-hand side of Bayes theorem, we see the term $p(\theta\mid y)p(y)$. As we pointed out, this is simply the joint probability of $y$ and $\theta$, $p(\theta, y)$. Notice that this is simply the probabilities for all the outcomes, assuming a particular $\theta$, multiplied by the prior probability of that $\theta$. We can see how this is done in the video below. The faint points are the probabilities from the previous animation, before they have been multiplied by the prior. The solid points show the new probabilities after multiplication by the prior. Notice that when the prior probability is low, the points are pushed down toward 0; when the prior probability is high, the points are raised.

	```{r ani2, fig.width=7, fig.height=8, fig.show='animate', interval=.3}
	fadedcols = apply(sapply(y+1,col2rgb)/255,2,function(v) rgb(v[1],v[2],v[3],.2))
	for (th in theta) {
	par(mfrow=c(2,1))
	plot(theta,dbeta(theta,a,b),col="darkgray",lwd=2,ylab="Prior density",xlab="theta",ty='l',ylim=c(0,1.5))
	abline(h=0,col="gray")
	mtext(paste0("beta(",a,",",b,")"),3,.1,adj=1,cex=1.3)
	points(th,dbeta(th,a,b),pch=21,col="blue")
	mtext(paste0("Prior density = ",round(dbeta(th,a,b),2)),3,.1,adj=0,cex=1.3)


	plot(y,dbinom(y,N,th),col=fadedcols,pch=19,cex=1.4,ylab="Conditional density",xlab="y",ylim=range(c(jt.par$dens,dbinom(0,N,theta[1]))))
	points(y,dbinom(y,N,th)*dbeta(th,a,b),col=y+1,pch=19,cex=1.4)
	# arrows(y,dbinom(y,N,th),y,dbinom(y,N,th)*dbeta(th,a,b),col=fadedcols)
	abline(h=0,col="gray")
	mtext(paste0("theta = ",round(th,2)),3,.1,adj=1,cex=1.3)
	mtext(paste0("Prior density = ",round(dbeta(th,a,b),2)),3,.1,adj=0,cex=1.3)

	}
	```

	If we take all the points from the animation above -- that is, for all values of $\theta$ -- and put them into a bivariate graph with $\theta$ and $y$ on two axes, it looks like the following (try rotating and zooming!):

	```{r, webgl=TRUE}
	plot3d(jt.par$y, jt.par$theta, jt.par$dens,
	xlab="y", ylab="theta",zlab="Joint density",
	col=y+1)
	for(ys in y)
	lines3d(jt.par$theta*0+jt.par$y[ys+1],
	jt.par$theta[(1:length(theta)-1)*length(y)+ys+1],
	jt.par$dens[(1:length(theta)-1)*length(y)+ys+1],
	col=ys + 1, lwd=3)
	```

	This is the joint distribution of $y$ and $\theta$ which is notated or $p(y, \theta)$.

	## The effect of an observation

	```{r}
	ystar = 1
	```

	By Bayes theorem, the effect of observing $y=`r ystar`$ would be to select a slice from the joint distribution above -- in particular, the red curve where $y=`r ystar`$. Selecting only this curve yields a curve over the possible values of $\theta$ (try rotating and zooming!):


	```{r, webgl=TRUE}
	py = integrate(function(theta) dbinom(ystar,N,theta)*dbeta(theta,a,b),0,1)[[1]]

	plot3d(jt.par$y, jt.par$theta, jt.par$dens,
	xlab="y", ylab="theta",zlab="Joint density",
	col=y+1,alpha=0)

	lines3d(jt.par$theta*0+jt.par$y[ystar+1],
	jt.par$theta[(1:length(theta)-1)*length(y)+ystar+1],
	jt.par$dens[(1:length(theta)-1)*length(y)+ystar+1],
	col=ystar + 1, lwd=3)
	```

	This curve is the joint distribution evaluated at all the values where $y=`r ystar`$. It can be more easily shown as a standard 2D plot:

	```{r}
	plot(theta,py*dbeta(theta,ystar+a,N-ystar+b),col=ystar+1,lwd=2,ylab="Joint density (slice)",xlab="theta",ty='l')
	abline(h=0,col="gray")
	mtext(paste0("y = ",ystar),3,.1,adj=1,cex=1.3)
	mtext("Does not integrate to 1!",3,.1,adj=0,cex=1.3,col="red")
	```

	Notice, however, that this curve is a slice through the joint distribution; it does not integrate to 1. Remember that the definition of conditional probability (and Bayes' theorem) specifies that we need to divide by $p(y)$ to get the posterior distribution:
	\[
	p(\theta\mid y) = \frac{p(y,\theta)}{p(y)}
	\]

	What is $p(y)$? By definition, it is the constant that makes the curve integrate to 1 (that is, it makes the curve into a probability distribution). We can also think of it as the weighted average (weighted over all the possible $\theta$ values) probability of obtaining $y=`r ystar`$. In notation,
	\[
	\begin{eqnarray}
	p(y)&=&\int_0^1 p(y\mid\theta)p(\theta)\,d\theta\\
	&=&E_{p(\theta)}\left[p(y\mid\theta)\right]
	\end{eqnarray}
	\]
	For our particular value observation, it is
	\[
	\begin{eqnarray}
	p(y)&=&\int_0^1 \binom{`r N`}{`r ystar`}\theta^{`r ystar`}(1-\theta)^{`r N-ystar`}\frac{\theta^{`r a`-1}(1-\theta)^{`r b`-1}}{Be(`r a`,`r b`)}\,d\theta\\
	&=& \int_0^1`r choose(N,ystar)`\times`r 1/beta(a,b)`\times\theta^{`r ystar`+`r a` - 1}(1-\theta)^{`r N-ystar`+`r b`-1}\,d\theta\\
	&=&`r py`
	\end{eqnarray}
	\]

	This can be easily understood through sampling in R.
	```{r echo=TRUE}
	## Number of simulations to do
	M = 100000

	## Sample from the prior; a and b were defined before
	theta.sample = rbeta(M, a, b)

	## Sample data, assuming the sampled prior
	## N was defined before
	y.sample = rbinom(M, N, theta.sample)

	## What is the marginal probability of observing the y we observed?
	## y was defined before
	mean(y.sample == ystar)

	```
	The sampled number above may differ slightly from the exact one above, due to sampling variability.

	## The posterior distribution

	Now that we have our normalizing constant $p(y)$, we can simply take our curve and divide, which will yield a new curve that is a probability distribution:

	```{r}
	plot(theta,dbeta(theta,ystar+a,N-ystar+b),col=ystar+1,lwd=2,ylab="Conditional density",xlab="theta",ty='l')
	abline(h=0,col="gray")
	mtext(paste0("y = ",ystar),3,.1,adj=1,cex=1.3)
	mtext("Integrates to 1.",3,.1,adj=0,cex=1.3,col="blue")
	```

	The curve above represents the posterior distribution for $\theta$, after having observed $y=`r ystar`$.