kagaya/model_selection_by_WAIC.rmd

## model_selection_by_WAIC.rmd
---
title: "an example of model comparison by WAIC (draft)"
author: "Katsushi Kagaya"
date: "`r Sys.time()`"
output:
  html_document: default
---

# motivation and aims

How we assess the appropriateness of a statistical model is a challenge in data analysis. The information criterion WAIC (Widely Applicable Information Criterion) is designed to assess the appropriateness of arbitrarily constructed statistical models by scientist in terms of the predictability (Watanabe, 2018). It can be understood as an extended version of AIC (Akaike, 1974) for Bayesian inference and applicable to the models whose posterior distribution does not resemble any normal distribution. Furthermore,
the WAIC is rigorously supported by the mathematical and theoretical work (Watanabe, 2010). Here we try to evaluate relatively simple hierarchical models using WAIC.

The aims of this document:

- generate data from a hierarchical (multi level) statistical model
- make candidate models and apply them with the data to infer the model
- assess the appropriateness of the candidate models with WAIC
- show unsuccessful and successful cases of model selection
- plot predictive distributions on each candidate model

I will use the same symbols described in the theory by Prof. Watanabe's book (Watanabe, 2018).


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# load libraries
```{r, message=FALSE}
library(tidyverse)
library(rstan)

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
```

# define functions
```{r}
check_rhat <- function(fit){
  # check the convergence of the MCMC samples
  all(summary(fit)$summary[,"Rhat"] <= 1.10, na.rm=T)
}


get_waic <- function(stanfit){
  # compute WAIC from the returned object from Stan
  # the log likelihood must be named as 'log_lik'
  waic <- function(log_lik) {
    Tn <- - mean(log(colMeans(exp(log_lik))))
    fV <- mean(colMeans(log_lik^2) - colMeans(log_lik)^2)
    waic <- Tn + fV
    waic
  }

  stanfit %>% rstan::extract() %>% .$log_lik %>% waic()
}

nth_pred_Y <- function(pY, n){
  # sample from predictive distribution
  d <- density(pY)
  nthY <- c()
  nthY[1] <- d$y %>% order(decreasing=T) %>% nth(n) %>% d$x[.]
  nthY[2] <- d$y %>% order(decreasing=T) %>% nth(n) %>% d$y[.]
  nthY
}

generate_data <- function(N, G, SD, mu_a, sd_a, mu_b, sd_b, SEED, Xrange=c(0,20)){
  # generate data on a 'true' linear hierarchical model q(x)
  set.seed(SEED)
  a <- rnorm(G, mean=mu_a, sd=sd_a)
  b <- rnorm(G, mean=mu_b, sd=sd_b)

  X <- matrix(nrow=N, ncol=G)
  for (g in 1:G) {
    X[,g] <- sort(runif(N, Xrange[1], Xrange[2]))
  }

  Mu <- matrix(nrow=N, ncol=G)

  for (g in 1:G){
    Mu[,g] <- a[g] * X[,g] + b[g]
  }

  Y <- matrix(nrow=N, ncol=G)

  for (g in 1:G){
    Y[,g] <- Mu[,g] + rnorm(N, mean=0, sd=SD)
  }

  df <- tibble(X=as.vector(X),
               Y=as.vector(Y),
               ID=rep(1:G, each=N),
               N=N,
               G=G)
  df
}

```


# prepare data
```{r}
df <- generate_data(N=15, G=5, SD=5,
                     mu_a=0.001, sd_a=0.01,
                     mu_b=5, sd_b=11,
                     SEED=2579)

data <- list(
  N=nrow(df),
  G=df$G %>% unique(),
  X=df$X,
  Y=df$Y,
  ID=df$ID,
  pX=seq(0, 20, length.out=100))

tdf <- generate_data(N=1000, G=100, SD=5,
                     mu_a=0.001, sd_a=0.01,
                     mu_b=5, sd_b=11,
                     SEED=2579)

ggplot(df) +
  geom_point(aes(X, Y, shape=factor(ID))) +
  geom_point(aes(X, Y), pch=16, size=0.1, alpha=0.1, data=tdf) +
    geom_abline(aes(slope=0.001, intercept=5), alpha=0.2) +
  theme_classic() +
  scale_shape_manual(values=c(1:data$G))


```


The dataframe 'df' is the generated data for modeling. It will be used for data visualization. The list 'data' is for the input to Stan script. The 'tdf' is the generated data for the plot to show approximately true distribution as many tiny dots. The 'data' will be read in the 'data block' in the Stan scripts below.

Note that the parameters are set artificially and the realized data are generated from the 'true' distribution $q(x)$ using a pseudo random number generator. In the sense, this is an artificial case of analysis in the real world. In our usual data analysis, we do not know the shape of $q(x)$. The sample $X^n=\{X_1,X_2,...,X_n\}, i.i.d$ is just given as a set of random variables from $q(x)$. Here we consider that the simulation is appropriate and we pretend not to know the 'true' $q(x)$. We will try to infer $q(x)$ by using a sample  $X^n$ and a statistical model $p(x|w)$ with the tunable parameter set $W$ as a random variable.

We define the posterior distribution $p(w|X^n)$ as,
$$
p(w|X^n) = \frac{\varphi(w) \cdot \prod_{i=1}^{n} p(X_i|w)}{\int \varphi(w) \cdot \prod_{i=1}^{n} p(X_i|w)\mathrm{d}w},
$$
where $\varphi(w)$ is a prior distribution of the parameter $W$ as a random variable.

Furthermore, we are going to construct the predictive distribution $p(x|X^n)$ defined as,

$$
p(x|X^n) = \int p(x|w)p(w|X^n)\mathrm{d}w.
$$

The $p(x|X^n)$ can be understood as an average of the model $p(x|w)$ with the posterior distribution $p(w|X^n)$. The degree of approximation of $p(x|X^n)$ to $q(x)$ (generalization error, KL divergence) will be infered by WAIC.


# plot generated data for each group
```{r}
ggplot(df) +
  geom_point(aes(X, Y, shape=factor(ID))) +
  facet_wrap(~ID) +
  theme_classic() +
  scale_shape_manual(values=c(1:data$G))

```


# 1 unsuccessful case

## hierarchical model (two hierarchical parameters)

### Stan code
```{stan, output.var='dump', eval=FALSE}
functions {
 // non-relatated part with the data, X, Y
 // target function p1(x)
  real f(real mu, real s, real x){
    return(normal_lpdf(x | mu, s));
  }

  real log_lik_Simpson_f(real mu, real s, real lower, real upper, int M){
    vector[M+1] lp;
    real h;

    h = (upper - lower)/M;
    lp[1] = f(mu, s, lower);
    for (m in 1:(M/2))
      lp[2*m]   = log(4) + f(mu, s, lower + h*(2*m-1));
    for (m in 1:(M/2-1))
      lp[2*m+1] = log(2) + f(mu, s, lower + h*2*m);
    lp[M+1] = f(mu, s, upper);
    return(log(h/3) + log_sum_exp(lp));
  }

  // data (X, Y) related part; loop for 1:N; double numerical integrations for a and b
  // target function p2(x)
  real f2(real Y, real X, real s, real ak, real bk){
    return(normal_lpdf(Y | ak*X+bk, s));
  }

  // integration for a
  real log_lik_Simpson_f2(real Y, real X, real s,
                          real a_lower, real a_upper,
                          real bk, int M){
    vector[M+1] lp;
    real h;

    h = (a_upper - a_lower)/M;
    lp[1] = f2(Y, X, s,       a_lower, bk);
    for (m in 1:(M/2))
      lp[2*m]   = log(4) + f2(Y, X, s,      a_lower + h*(2*m-1), bk);
    for (m in 1:(M/2-1))
      lp[2*m+1] = log(2) + f2(Y, X, s,      a_lower + h*2*m, bk);
    lp[M+1] = f2(Y, X, s,      a_upper, bk);
    return(log(h/3) + log_sum_exp(lp));
  }

  // integration for b
  real log_lik_Simpson_rest(real Y, real X, real s,
                            real a_lower, real a_upper,
                            real b_lower, real b_upper, int M){
    vector[M+1] lp;
    real h;

    h = (b_upper - b_lower)/M;
    lp[1] = log_lik_Simpson_f2(Y, X, s, a_lower, a_upper, b_lower, M);
    for (m in 1:(M/2))
      lp[2*m]   = log(4) + log_lik_Simpson_f2(Y, X, s,  a_lower, a_upper, b_lower + h*(2*m-1), M);
    for (m in 1:(M/2-1))
      lp[2*m+1] = log(2) + log_lik_Simpson_f2(Y, X, s, a_lower, a_upper, b_lower + h*2*m, M);
    lp[M+1] = log_lik_Simpson_f2(Y, X, s, a_lower, a_upper, b_upper, M);
    return(log(h/3) + log_sum_exp(lp));
  }
}

data {
  int N;
  int G;
  real X[N];
  real Y[N];
  int<lower=1, upper=G> ID[N];
  real pX[100];
}

parameters {
  real a[G];
  real b[G];
  real a0;
  real b0;
  real<lower=0> as;
  real<lower=0> bs;
  real<lower=0> s;
}

model {
  for (g in 1:G){
    a[g] ~ normal(a0, as);
    b[g] ~ normal(b0, bs);
  }
  for (n in 1:N) {
    Y[n] ~ normal(a[ID[n]]*X[n] + b[ID[n]], s);
  }
}

generated quantities {
  real pY[100];
  real pa;
  real pb;
  real log_lik[N];
  real log_lik_a;
  real log_lik_b;
  int M;

  M = 100; // number of divisions for integration

  pa = normal_rng(a0, as);
  pb = normal_rng(b0, bs);
  for (i in 1:100)
    pY[i] = normal_rng(pa*pX[i]+pb, s);

  log_lik_a = log_lik_Simpson_f(a0, as, a0-5*as, a0+5*as, M);
  log_lik_b = log_lik_Simpson_f(b0, bs, b0-5*bs, b0+5*bs, M);
  for (n in 1:N){
    log_lik[n] = log_lik_a + log_lik_b +
                 log_lik_Simpson_rest(Y[n], X[n], s,
                                      a0-5*as, a0+5*as,
                                      b0-5*bs, b0+5*bs, M);
  }
}

```

The pairs $\{x_n, Y^n\}, n=1,...N$ are data. Here we assume that the $Y^n$ are random variables but $x_n$ are given. The $\{a, a0, as, b, b0, bs, s\}$ are parameters. Because we want to compare the hierarchical model with non-hierarchical models (defined later) in terms of prediction, we focus on the prediction of 'obtaining a new group and a new pair $\{x_{new}, Y_{new}\}$ (Watanabe, 2018; McElreath, 2016). Therefore, to build up the predictive distribution of this focus after gaining joint posterior draws from $p(w|x_n, Y^n)$, we need to integrate out the parameters $\{a, b\}$. Then, we generate the $Y_{new}$ from the predictive distribution  $p(y|x_{new}, x_n, Y^n)$ built as

$$
p(y|x_i, x_n, Y^n)=\int p(y|x_i,w)p(w|x_n, Y^n)\mathrm{d}w\\
=\int \{\int \int N(y|x_i,a,b,s)N(a|a0,as)N(b|b0,bs)\mathrm{d}a\mathrm{d}b\}p(w|x_n, Y^n)\mathrm{d}w'
$$

We performed numerical integration of the enclosed part with the brace using [Simpson's rule](https://en.wikipedia.org/wiki/Simpson%27s_rule). The computation is implemented in the function block. The range of the integration was defined as 10 sigmas range over the normal distribution.

The realized values 'pY' from the random variable $Y_{new}$ is generated in the generated quantities block, using newly supplied data $x_i$ ('pX' in the code) and the posterior distributions of the parameters W'=$\{a0, as, b0, bs, s\}$. Note that the prior distribution $\varphi(w)$ is defined as default non-informative uniform distribution.

### sampling

To perform sampling, please save the Stan code above as "model_linear.stan" and uncomment the first 3 lines of the next chunk.
```{r}
# stanmodel <- stan_model(file="model_linear.stan") # in R
# fit <- sampling(stanmodel, data=data, iter=2000, warmup=500, seed=2870)
# save(fit, file="fit_model_linear.rdata")
load("fit_model_linear.rdata")
```


## alternative model (one hierarchical parameter)
### Stan code
```{stan, output.var='dump', eval=FALSE}
functions {
 // non-relatated part with the data, Y
 // target function p1(x)
  real f(real b0, real s0, real x){
    return(normal_lpdf(x | b0, s0));
  }

  real log_lik_Simpson_f(real b0, real s0, real lower, real upper, int M){
    vector[M+1] lp;
    real h;

    h = (upper - lower)/M;
    lp[1] = f(b0, s0, lower);
    for (m in 1:(M/2))
      lp[2*m]   = log(4) + f(b0, s0, lower + h*(2*m-1));
    for (m in 1:(M/2-1))
      lp[2*m+1] = log(2) + f(b0, s0, lower + h*2*m);
    lp[M+1] = f(b0, s0, upper);
    return(log(h/3) + log_sum_exp(lp));
  }

  // data (Y) related part; loop for 1:N; numerical integrations for b
  // target function p2(x)
  real f2(real Y, real s, real bk){
    return(normal_lpdf(Y | bk, s));
  }

  // integration for b
  real log_lik_Simpson_f2(real Y, real s,
                          real b_lower, real b_upper,
                          int M){
    vector[M+1] lp;
    real h;

    h = (b_upper - b_lower)/M;
    lp[1] = f2(Y, s,       b_lower);
    for (m in 1:(M/2))
      lp[2*m]   = log(4) + f2(Y, s,      b_lower + h*(2*m-1));
    for (m in 1:(M/2-1))
      lp[2*m+1] = log(2) + f2(Y, s,      b_lower + h*2*m);
    lp[M+1] = f2(Y, s,      b_upper);
    return(log(h/3) + log_sum_exp(lp));
  }

}

data {
  int N;
  int G;
  real X[N];
  real Y[N];
  int<lower=1, upper=G> ID[N];
  real pX[100];
}

parameters {
  real b[G];
  real b0;
  real<lower=0> bs;
  real<lower=0> s;
}

model {
  for (g in 1:G){
    b[g] ~ normal(b0, bs);
  }
  for (n in 1:N) {
    Y[n] ~ normal(b[ID[n]], s);
  }
}

generated quantities {
  real pY[100];
  real pb;
  real log_lik[N];
  real log_lik_b;
  int M;

  M = 100; // number of divisions for integration
  pb = normal_rng(b0, bs);
  for (i in 1:100)
    pY[i] = normal_rng(pb, s);

  log_lik_b = log_lik_Simpson_f(b0, bs, b0-5*bs, b0+5*bs, M);
  for (n in 1:N){
    log_lik[n] = log_lik_b +
                 log_lik_Simpson_f2(Y[n], s,
                                      b0-5*bs, b0+5*bs, M);
  }
}

```

This model is a version in which the slope parameter $a$ in the previous model is absent. Therefore, we just need to integrate out $b$.

### sampling
```{r}
# stanmodel_alt <- stan_model(file="model_linear_alt.stan")
# fit_alt <- sampling(stanmodel_alt, data=data, iter=2000, warmup=500, seed=2870)
# save(fit_alt, file="fit_model_linear_alt.rdata")
load("fit_model_linear_alt.rdata")
```


## alternative model 2 (normal linear regression)
### Stan code
```{stan, output.var='dump', eval=FALSE}
data {
  int N;
  int G;
  real X[N];
  real Y[N];
  int<lower=1, upper=G> ID[N];
  real pX[100];
}

parameters {
  real a;
  real b;
  real<lower=0> s;
}

model {
  for (n in 1:N) {
    Y[n] ~ normal(a*X[n] + b, s);
  }
}

generated quantities {
  real pY[100];
  real log_lik[N];

  for (i in 1:100)
    pY[i] = normal_rng(a*pX[i]+b, s);

  for (n in 1:N){
    log_lik[n] = normal_lpdf(Y[n] | a*X[n] + b, s);
  }
}

```

### sampling
```{r}
# stanmodel_alt2 <- stan_model(file="model_linear_alt2.stan") # in R
# fit_alt2 <- sampling(stanmodel_alt2, data=data, iter=2000, warmup=500, seed=2870)
# save(fit_alt2, file="fit_model_linear_alt2.rdata")
load("fit_model_linear_alt2.rdata")
```


## alternative model 3 (constant model)

### Stan code
```{stan, output.var='dump', eval=FALSE}
data {
  int N;
  int G;
  real X[N];
  real Y[N];
  int<lower=1, upper=G> ID[N];
  real pX[100];
}

parameters {
  real b;
  real<lower=0> s;
}

model {
  for (n in 1:N) {
    Y[n] ~ normal(b, s);
  }
}

generated quantities {
  real pY[100];
  real log_lik[N];

  for (i in 1:100)
    pY[i] = normal_rng(b, s);

  for (n in 1:N){
    log_lik[n] = normal_lpdf(Y[n] | b, s);
  }
}

```

This is also a non-hierarchical model only with the parameter $b$.

### sampling
```{r}
# stanmodel_alt3 <- stan_model(file="model_linear_alt3.stan") # in R
# fit_alt3 <- sampling(stanmodel_alt3, data=data, iter=2000, warmup=500, seed=2870)
# save(fit_alt3, file="fit_model_linear_alt3.rdata")
load("fit_model_linear_alt3.rdata")
```


# WAIC
We compute the WAIC values to evaluate the four candidate models in terms of the predictability. The WAIC is an information criterion to infer the prediction loss (cf. http://watanabe-www.math.dis.titech.ac.jp/users/swatanab/waicwbic_e.html)


```{r}
w0 <- get_waic(fit)
w1 <- get_waic(fit_alt)
w2 <- get_waic(fit_alt2)
w3 <- get_waic(fit_alt3)
wx <- c(w0, w1, w2, w3)
barplot(wx, ylab = "WAIC", names.arg=paste("model", 1:4))
wx
```

The model 2 is the best performed model among the four models. Thus, the model selection failed here. However, note that the WAIC is also a random variable dependent on the sample. Therefore, it 'probabilistically' fails.

This is not a specific flaw of WAIC. This is a general limitation of the statistical inference itself using limited number of data. As you may notice that, I intentionally set the parameters $mu\_a$ and $sd\_a$ small to make WAIC fails to select model 1.

# predictions based on the four models
We extract the posterior parameter information.
```{r}
ms <- rstan::extract(fit)
ms1 <- rstan::extract(fit_alt)
ms2 <- rstan::extract(fit_alt2)
ms3 <- rstan::extract(fit_alt3)
```

## sample 10 points from posterior predictive distribution
The predictive distribution is stored as 'pY' in the object. We just use 10 sample points from the highest density in the decreasing order for the overlay to the plot later.

```{r}

pred_Y <- matrix(nrow=100, ncol=10)
pred_Z <- matrix(nrow=100, ncol=10)
for (j in 1:10) {
  for (i in 1:100) {
    pY <- nth_pred_Y(ms$pY[,i], j)
    pred_Y[i,j] <- pY[1]
    pred_Z[i,j] <- pY[2]
  }
}
for (i in 1:100){
  pred_Z[i,] <- pred_Z[i,] / max(pred_Z[i,])
}


pred_Y1 <- matrix(nrow=100, ncol=10)
pred_Z1 <- matrix(nrow=100, ncol=10)
for (j in 1:10) {
  for (i in 1:100) {
    pY <- nth_pred_Y(ms1$pY[,i], j)
    pred_Y1[i,j] <- pY[1]
    pred_Z1[i,j] <- pY[2]
  }
}
for (i in 1:100){
  pred_Z1[i,] <- pred_Z1[i,] / max(pred_Z1[i,])
}


pred_Y2 <- matrix(nrow=100, ncol=10)
pred_Z2 <- matrix(nrow=100, ncol=10)
for (j in 1:10) {
  for (i in 1:100) {
    pY <- nth_pred_Y(ms2$pY[,i], j)
    pred_Y2[i,j] <- pY[1]
    pred_Z2[i,j] <- pY[2]
  }
}
for (i in 1:100){
  pred_Z2[i,] <- pred_Z2[i,] / max(pred_Z2[i,])
}


pred_Y3 <- matrix(nrow=100, ncol=10)
pred_Z3 <- matrix(nrow=100, ncol=10)
for (j in 1:10) {
  for (i in 1:100) {
    pY <- nth_pred_Y(ms3$pY[,i], j)
    pred_Y3[i,j] <- pY[1]
    pred_Z3[i,j] <- pY[2]
  }
}
for (i in 1:100){
  pred_Z3[i,] <- pred_Z3[i,] / max(pred_Z3[i,])
}


```


## plot of model1 prediction
```{r}
plt0 <- ggplot(df) +
  geom_point(aes(X, Y, shape=factor(ID))) +
  geom_point(aes(X, Y), pch=16, size=0.1, alpha=0.1, data=tdf) +
  geom_abline(aes(slope=0.001, intercept=5), alpha=0.2) +
  theme_classic() +
  scale_shape_manual(values=c(0:data$G))

plt <- plt0

for (j in 1:10) {
  plt <- plt +
    geom_point(aes(X, Y, colour=Z),
               alpha=0.5,
               pch=16,
               data=tibble(X=data$pX, Y=pred_Y[,j], Z=pred_Z[,j]))
}

plt + scale_colour_gradient(low="white", high="blue")
```

The solid line indicates the 'true' line. The blue prediction plot does not appear to match the line.

## plot of model 2 prediction
```{r}
plt1 <- plt0

for (j in 1:10) {
  plt1 <- plt1 +
    geom_point(aes(X, Y, colour=Z),
               alpha=0.5,
               pch=16,
               data=tibble(X=data$pX, Y=pred_Y1[,j], Z=pred_Z1[,j]))
}

plt1 + scale_colour_gradient(low="white", high="red")
```

The red prediction plot a little overlaps the true line more than the prediction plot of the model 1.

## plot model 3 prediction
```{r}
plt2 <- plt0

for (j in 1:10) {
  plt2 <- plt2 +
    geom_point(aes(X, Y, colour=Z),
               alpha=0.5,
               pch=16,
               data=tibble(X=data$pX, Y=pred_Y2[,j], Z=pred_Z2[,j]))
}

plt2 + scale_colour_gradient(low="white", high="orange")

```

The uncertainty of the prediction plot is smaller than that of the mode 1 and 2. The slope of the prediction plot is larger than the true line.

## plot model 4 prediction
```{r}
plt3 <- plt0

for (j in 1:10) {
  plt3 <- plt3 +
    geom_point(aes(X, Y, colour=Z),
               alpha=0.5,
               pch=16,
               data=tibble(X=data$pX, Y=pred_Y3[,j], Z=pred_Z3[,j]))
}

plt3 + scale_colour_gradient(low="white", high="green")

```

The uncertainty of the prediction plot is smaller than that of the model 1 and 2.

# 2 successful case

We show here a successful case of model selection to the sample. We set larger 'mu_a' and 'sd_a'.

```{r}
df2 <- generate_data(N=15, G=5, SD=10,
                     mu_a=1, sd_a=1,
                     mu_b=5, sd_b=11,
                     SEED=2789)

tdf2 <- generate_data(N=1000, G=100, SD=10,
                     mu_a=1, sd_a=1,
                     mu_b=5, sd_b=11,
                     SEED=2789)

data2 <- list(
  N=nrow(df2),
  G=df2$G %>% unique(),
  X=df2$X,
  Y=df2$Y,
  ID=df2$ID,
  pX=seq(0, 20, length.out=100))

ggplot(df2) +
  geom_point(aes(X, Y, shape=factor(ID))) +
  geom_point(aes(X, Y), pch=16, size=0.1, alpha=0.1, data=tdf2) +
  theme_classic() +
  scale_shape_manual(values=c(1:data$G))

```


# generated data for each group
```{r}
ggplot(df2) +
  geom_point(aes(X, Y, shape=factor(ID))) +
  theme_classic() +
  scale_shape_manual(values=c(0:data2$G)) +
  facet_wrap(~ID)
```


### sampling
```{r}
# stanmodel <- stan_model(file="model_linear.stan")
# fitS <- sampling(stanmodel, data=data2, iter=2000, warmup=500)
# save(fitS, file="fit_model_linearS.rdata")
load("fit_model_linearS.rdata")
```

### sampling
```{r}
# stanmodel_alt <- stan_model(file="model_linear_alt.stan")
# fit_altS <- sampling(stanmodel_alt, data=data2, iter=2000, warmup=500)
# save(fit_altS, file="fit_model_linear_altS.rdata")
load("fit_model_linear_altS.rdata")
```

### sampling
```{r}
# stanmodel_alt2 <- stan_model(file="model_linear_alt2.stan")
# fit_alt2S <- sampling(stanmodel_alt2, data=data2, iter=2000, warmup=500)
# save(fit_alt2S, file="fit_model_linear_alt2S.rdata")
load("fit_model_linear_alt2S.rdata")
```

### sampling
```{r}
# stanmodel_alt3 <- stan_model(file="model_linear_alt3.stan")
# fit_alt3S <- sampling(stanmodel_alt3, data=data2, iter=2000, warmup=500)
# save(fit_alt3S, file="fit_model_linear_alt3S.rdata")
load("fit_model_linear_alt3S.rdata")
```

```{r}
w0S <- get_waic(fitS)
w1S <- get_waic(fit_altS)
w2S <- get_waic(fit_alt2S)
w3S <- get_waic(fit_alt3S)
wxS <- c(w0S, w1S, w2S, w3S)
barplot(wxS, ylab = "WAIC", names.arg=paste("model", 1:4))
wxS
```

The model 1 remarkably increase the predictability when comparing with other three models.

# predictions based on the four models
```{r}
msS <- rstan::extract(fitS)
ms1S <- rstan::extract(fit_altS)
ms2S <- rstan::extract(fit_alt2S)
ms3S <- rstan::extract(fit_alt3S)
```

## sample 10 points from posterior predictive distribution
```{r}

pred_YS <- matrix(nrow=100, ncol=10)
pred_ZS <- matrix(nrow=100, ncol=10)
for (j in 1:10) {
  for (i in 1:100) {
    pY <- nth_pred_Y(msS$pY[,i], j)
    pred_YS[i,j] <- pY[1]
    pred_ZS[i,j] <- pY[2]
  }
}
for (i in 1:100){
  pred_ZS[i,] <- pred_ZS[i,] / max(pred_ZS[i,])
}

pred_Y1S <- matrix(nrow=100, ncol=10)
pred_Z1S <- matrix(nrow=100, ncol=10)
for (j in 1:10) {
  for (i in 1:100) {
    pY <- nth_pred_Y(ms1S$pY[,i], j)
    pred_Y1S[i,j] <- pY[1]
    pred_Z1S[i,j] <- pY[2]
  }
}
for (i in 1:100){
  pred_Z1S[i,] <- pred_Z1S[i,] / max(pred_Z1S[i,])
}


pred_Y2S <- matrix(nrow=100, ncol=10)
pred_Z2S <- matrix(nrow=100, ncol=10)
for (j in 1:10) {
  for (i in 1:100) {
    pY <- nth_pred_Y(ms2S$pY[,i], j)
    pred_Y2S[i,j] <- pY[1]
    pred_Z2S[i,j] <- pY[2]
  }
}
for (i in 1:100){
  pred_Z2S[i,] <- pred_Z2S[i,] / max(pred_Z2S[i,])
}


pred_Y3S <- matrix(nrow=100, ncol=10)
pred_Z3S <- matrix(nrow=100, ncol=10)
for (j in 1:10) {
  for (i in 1:100) {
    pY <- nth_pred_Y(ms3S$pY[,i], j)
    pred_Y3S[i,j] <- pY[1]
    pred_Z3S[i,j] <- pY[2]
  }
}
for (i in 1:100){
  pred_Z3S[i,] <- pred_Z3S[i,] / max(pred_Z3S[i,])
}

```

## plot of model 1 prediction
```{r}
plt0S <- ggplot(df2) +
  geom_point(aes(X, Y, shape=factor(ID))) +
  geom_point(aes(X, Y), pch=16, size=0.1, alpha=0.1, data=tdf2) +
  geom_abline(aes(slope=1, intercept=5), alpha=0.2) +
  theme_classic() +
  scale_shape_manual(values=c(0:data$G))

pltS <- plt0S

for (j in 1:10) {
  pltS <- pltS +
    geom_point(aes(X, Y, colour=Z),
               alpha=0.5,
               pch=16, size=1,
               data=tibble(X=data$pX, Y=pred_YS[,j], Z=pred_ZS[,j]))
}

pltS + scale_colour_gradient(low="white", high="blue")

```

## plot of model 2 prediction
```{r}
plt1S <- plt0S

for (j in 1:10) {
  plt1S <- plt1S +
    geom_point(aes(X, Y, colour=Z),
               alpha=0.5,
               pch=16, size=1,
               data=tibble(X=data$pX, Y=pred_Y1S[,j], Z=pred_Z1S[,j]))
}

plt1S + scale_colour_gradient(low="white", high="red")

```

## plot of model 3 prediction
```{r}
plt2S <- plt0S

for (j in 1:10) {
  plt2S <- plt2S +
    geom_point(aes(X, Y, colour=Z),
               alpha=0.5,
               pch=16, size=1,
               data=tibble(X=data$pX, Y=pred_Y2S[,j], Z=pred_Z2S[,j]))
}

plt2S + scale_colour_gradient(low="white", high="orange")
```

## plot of model 4 prediction
```{r}
plt3S <- plt0S

for (j in 1:10) {
  plt3S <- plt3S +
    geom_point(aes(X, Y, colour=Z),
               alpha=0.5,
               pch=16, size=1,
               data=tibble(X=data$pX, Y=pred_Y3S[,j], Z=pred_Z3S[,j]))
}

plt3S + scale_colour_gradient(low="white", high="green")

```

# next problems
- compute the next quantities, KL divergence and $WAIC - (-\frac{1}{n}\sum_{i=1}^n log(q(X_i)))$, to confirm when the WAIC fails for model selection depending on the variability of data, cf. [the advice by G. Kuroki san](https://mobile.twitter.com/genkuroki/status/1089151066019487744)
- compute other information criteria such as LOOCV, WBIC, DIC when the sample size and/or the number of groups change
- change the parameters and see how the posterior and predictive distribution transform
- change the seeds for the random generators of MCMC sampling and see the effect

# references

@article{watanabe2010asymptotic,
  title={Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory},
  author={Watanabe, Sumio},
  journal={Journal of Machine Learning Research},
  volume={11},
  number={Dec},
  pages={3571--3594},
  year={2010}
}

@article{akaike1974new,
  title={A new look at the statistical model identification},
  author={Akaike, Hirotugu},
  journal={IEEE transactions on automatic control},
  volume={19},
  number={6},
  pages={716--723},
  year={1974},
  publisher={Ieee}
}


## books and websites
1. The support site of the book (Mathematical Theory of Bayesian Statistics, CRC Press, 2018) by Prof. Watanabe,
http://watanabe-www.math.dis.titech.ac.jp/users/swatanab/mt_bs.html

2.  The introduction of WAIC and WBIC by Prof. Watanabe,
http://watanabe-www.math.dis.titech.ac.jp/users/swatanab/waicwbic_e.html

3. The differences of the behavior of the other information criterion, LOOCV, for several models (including simple hierarchical model) are examined in the blog post (in Japanese) by K. Matsuura san,
http://statmodeling.hatenablog.com/entry/comparison-of-LOOCV-and-WAIC

4. Matsuura, K. 2016. Bayesian Statistical Modeling Using Stan and R. Wonderful R Series, Volume 2. Kyoritsu Shuppan Co., Ltd. [in Japanese]

5. Kuroki, G. 2017-. https://nbviewer.jupyter.org/gist/genkuroki/8a342d0b7b249e279dd8ad6ae283c5db

6. McElreath, R., 2016, Statistical Rethinking, CRC Press

## versions of this document

The gist URL,
https://gist.github.com/kagaya/57178d9e81d114106f6ffc7dd1ebf190


# session info
```{r}
sessionInfo()
```