Skip to content

Instantly share code, notes, and snippets.

@sims1253
Last active March 22, 2023 15:51
Show Gist options
  • Save sims1253/54cc09c2b1e404d6cee8e5619a62240f to your computer and use it in GitHub Desktop.
Save sims1253/54cc09c2b1e404d6cee8e5619a62240f to your computer and use it in GitHub Desktop.
---
title: "Calibration"
output: html_document
date: "2023-02-21"
---
---
title: "calibration"
format: html
editor: visual
---
# Setup
```{r}
library(brms)
library(bayesim)
library(bayesfam)
library(dplyr)
library(purrr)
library(patchwork)
library(ggplot2)
library(ggthemes)
set.seed(1235813)
```
```{r}
plot_dist <- function(dist, bounds, pars, prefix = "d", parnames = NULL,
package = NULL, user_theme = ggthemes::theme_tufte, ...) {
`%>%` <- dplyr::`%>%`
pos <- -1
if (!is.null(package)) {
pos <- asNamespace(package)
}
ddist <- get(paste0(prefix, dist), pos = pos, mode = "function")
df <- data.frame(x = seq(bounds[1], bounds[2], 0.001))
if (!is.null(parnames)) {
parnames <- paste0(parnames, " = ")
}
cnames <- rep(NA, length(pars))
for (i in seq_along(pars)) {
tmp <- do.call(ddist, c(list(df$x), pars[[i]], list(...)))
cnames[i] <- paste0("$", parnames, pars[[i]], "$", collapse = ", ")
df[paste0(parnames, pars[[i]], collapse = ", ")] <- tmp
}
df <- df %>%
tidyr::gather("pars", "dens", -x) %>%
dplyr::mutate(pars = factor(pars, unique(pars)))
gg <- ggplot2::ggplot(df, ggplot2::aes(x, dens, color = pars)) +
user_theme() +
ggplot2::geom_line(size = 1) +
ggplot2::scale_color_viridis_d(labels = unname(latex2exp::TeX(cnames))) +
ggplot2::labs(x = "x", y = "", color = "") +
ggplot2::theme(
axis.text.y = ggplot2::element_blank(),
axis.ticks.y = ggplot2::element_blank(),
axis.line.y = ggplot2::element_blank(),
legend.position = "bottom",
legend.text = ggplot2::element_text(size = 10)
)
if (prefix == "p") {
gg <- gg +
ggplot2::scale_y_continuous(breaks = c(0, 0.5, 1)) +
ggplot2::theme(
axis.ticks.y = ggplot2::element_line(),
axis.text.y = ggplot2::element_text(),
axis.line.y = ggplot2::element_line()
)
} else if (prefix == "q") {
gg <- gg +
ggplot2::scale_y_continuous() +
ggplot2::theme(
axis.ticks.y = ggplot2::element_line(),
axis.text.y = ggplot2::element_text(),
axis.line.y = ggplot2::element_line()
)
}
return(gg)
}
```
# Likelihoods
## Gamma
```{r}
p1 <- plot_dist(
dist = "gamma_mean",
bounds = c(0.00001, 20),
parnames = c("mu", "a"),
package = "bayesim",
pars = list(
c(1, 1),
c(10, 10),
c(10, 40)
),
user_theme = partial(theme_tufte, base_family = "")
) +
ggtitle("Gamma") +
guides(color = "none") +
coord_cartesian(ylim = c(0, 0.4)) +
geom_vline(xintercept = 0, linetype = "dotted", color = "grey")
p1
```
## Weibull
```{r}
p2 <- plot_dist(
dist = "weibull_median",
bounds = c(0.00001, 20),
parnames = c("mu", "k"),
package = "bayesim",
pars = list(
c(1, 1),
c(10, 4),
c(10, 8)
),
user_theme = partial(theme_tufte, base_family = "")
) +
ggtitle("Weibull") +
guides(color = "none") +
coord_cartesian(ylim = c(0, 0.4)) +
geom_vline(xintercept = 0, linetype = "dotted", color = "grey")
p2
```
## Lognormal
```{r}
p3 <- plot_dist(
dist = "lognormal",
bounds = c(0.00001, 20),
parnames = c("mu", "sigma"),
package = "bayesim",
pars = list(
c(log(1), 1),
c(log(10), 0.35),
c(log(10), 0.15)
),
user_theme = partial(theme_tufte, base_family = "")
) +
ggtitle("Log-Normal") +
guides(color = "none") +
coord_cartesian(ylim = c(0, 0.4)) +
geom_vline(xintercept = 0, linetype = "dotted", color = "grey")
p3
```
## Softplusnormal
```{r}
p4 <- plot_dist(
dist = "softplusnormal",
bounds = c(0.00001, 20),
parnames = c("mu", "sigma"),
package = "bayesim",
pars = list(
c(softplus(1), 2),
c(softplus(10), 4),
c(softplus(10), 2)
),
user_theme = partial(theme_tufte, base_family = "")
) +
ggtitle("Softplus-Normal") +
guides(color = "none") +
coord_cartesian(ylim = c(0, 0.4)) +
geom_vline(xintercept = 0, linetype = "dotted", color = "grey")
p4
```
## Frechet
```{r}
p5 <- plot_dist(
dist = "frechet_median",
bounds = c(0.0001, 20),
parnames = c("mu", "nu"),
package = "bayesim",
pars = list(
c(1, 2),
c(10, 5),
c(10, 10)
),
user_theme = partial(theme_tufte, base_family = "")
) +
ggtitle("Fréchet") +
guides(color = "none") +
coord_cartesian(ylim = c(0, 0.4)) +
geom_vline(xintercept = 0, linetype = "dotted", color = "grey")
p5
```
## Inverse Gaussian
```{r}
p6 <- plot_dist(
dist = "inv_gaussian",
bounds = c(0.0001, 20),
parnames = c("mu", "shape"),
package = "brms",
pars = list(
c(1, 1),
c(10, 10),
c(10, 100)
),
user_theme = partial(theme_tufte, base_family = "")
) +
ggtitle("Inverse Gaussian") +
guides(color = "none") +
coord_cartesian(ylim = c(0, 0.4)) +
geom_vline(xintercept = 0, linetype = "dotted", color = "grey")
p6
```
## Beta Prime
```{r}
p7 <- plot_dist(
dist = "betaprime",
bounds = c(0.0001, 20),
parnames = c("mu", "phi"),
package = "bayesim",
pars = list(
c(1, 1),
c(10, 20),
c(10, 50)
),
user_theme = partial(theme_tufte, base_family = "")
) +
ggtitle("Beta prime") +
guides(color = "none") +
coord_cartesian(ylim = c(0, 0.4)) +
geom_vline(xintercept = 0, linetype = "dotted", color = "grey")
p7
```
##Gompertz
```{r}
p8 <- plot_dist(
dist = "gompertz",
bounds = c(0.00001, 20),
parnames = c("mu", "eta"),
package = "bayesim",
pars = list(
c(1, 0.2),
c(10, 0.3),
c(10, 0.6)
),
user_theme = partial(theme_tufte, base_family = "")
) +
ggtitle("Gompertz") +
guides(color = "none") +
coord_cartesian(ylim = c(0, 0.4)) +
geom_vline(xintercept = 0, linetype = "dotted", color = "grey")
p8
```
## Combined
```{r}
(((p1 + p2)/(p5 + p6)/(p7 + p8) /(p3 + p4)) ) + theme(text=element_text(size=12))
ggsave("~/Pictures/dist_overvoew.png", width = 210, height = (297/4)*2.2, units = "mm", dpi = 300)
```
# Sim Setup
```{r}
RESULT_PATH <- "~/Documents/dr/foo"
NCORES <- 10 # physical cores -1 or less
CLUSTER_TYPE <- "FORK" # or PSOCK if you have Windows
SEED <- 1339
set.seed(SEED)
options(error = recover) # for easier debugging
DEBUG <- FALSE # unless you want every single simulation step written to disk
DATASET_N <- 10 # 5-10-ish for calibration, 200 for the final experiment
DATA_GEN_FUN = basedag_data
stan_pars <- list(
backend = "rstan",
cmdstan_path = NULL,
cmdstan_write_path = NULL,
warmup = 500,
iter = 1500,
chains = 1,
init = 0.1
)
metrics <- c(
"v_mean",
"v_sd",
"v_median",
"v_mad",
"v_pos_prob",
"v_quantiles",
"v_bias",
"v_rmse",
"v_mae",
"v_mse",
"v_true_percentile",
"divergent_transitions_rel",
"bad_pareto_ks",
"time_per_sample",
"rhat",
"ess_bulk",
"ess_tail",
"elpd_loo",
"elpd_test",
"rmse_loo",
"rmse_test",
"r2_loo",
"r2_test",
"data_gen",
"fit_gen"
)
VARS_OF_INTEREST = list(list(c("b_x")))
QUANTILES = list(list(seq(0.1, 0.9, length.out = 9)))
```
# Data Gen Setup
```{r}
data_generation_configuration <- expand.grid(
z1_x_coef = NA,
z1_y_coef = NA,
z2_y_coef = 0.5,
z3_x_coef = 0.8,
x_z4_coef = NA,
y_z4_coef = NA,
sigma_z1 = 0.5,
sigma_z2 = 0.5,
sigma_z3 = 0.5,
sigma_z4 = 0.5,
sigma_x = 0.5,
data_N = 100,
dataset_N = DATASET_N,
data_family = c(
"gamma",
"weibull",
"lognormal",
"softplusnormal",
"frechet",
"betaprime",
"gompertz"
),
data_link = c(
"log",
"softplus",
"identity"
),
lb = 0.000001,
ub = Inf,
resample = 1.3,
x_y_coef = c(NA, 0),
y_intercept = NA,
sigma_y = NA,
shape = c(
"ramp",
"asymmetric",
"symmetric"
),
noise_sd = 0.1,
stringsAsFactors = FALSE
)
data_generation_configuration <- filter(
data_generation_configuration,
!(data_link == "identity" &
data_family != "lognormal" &
data_family != "softplusnormal")
)
data_generation_configuration <- filter(
data_generation_configuration,
!(data_link != "identity" &
(data_family == "lognormal" | data_family == "softplusnormal"))
)
z1_x_coef_list <- list(
"log" = 0.6,
"softplus" = 1.2
)
z1_y_coef_list <- list(
"log" = 0.8,
"softplus" = 1.2
)
x_z4_coef_list <- list(
"log" = 0.5,
"softplus" = 0.5
)
y_z4_coef_list <- list(
"log" = 1,
"softplus" = 0.5
)
sigma_y_list <- list(
"gamma" = c(1, 10, 40),
"weibull" = c(1, 4, 8),
"lognormal" = c(1, 0.35, 0.15),
"softplusnormal" = c(2, 4, 2),
"frechet" = c(2, 5, 10),
"inverse.gaussian" = c(1, 10, 1000),
"betaprime" = c(1, 20, 50),
"gompertz" = c(0.2, 0.3, 0.6)
)
y_intercept_list <- list(
"log" = log(c(1, 10, 10)),
"softplus" = softplus(c(1, 10, 10))
)
x_y_coef_list <- list(
"log" = list(
"ramp" = 0.5,
"asymmetric" = 0.2,
"symmetric" = 0.1
),
"softplus" = list(
"ramp" = 0.9,
"asymmetric" = 1.4,
"symmetric" = 0.8
)
)
for (i in seq_len(nrow(data_generation_configuration))) {
family <- data_generation_configuration$data_family[[i]]
shape <- data_generation_configuration$shape[[i]]
switch(family,
"lognormal" = link <- "log",
"softplusnormal" = link <- "softplus",
link <- data_generation_configuration$data_link[[i]]
)
data_generation_configuration$z1_x_coef[[i]] <- z1_x_coef_list[[link]]
data_generation_configuration$z1_y_coef[[i]] <- z1_y_coef_list[[link]]
data_generation_configuration$x_z4_coef[[i]] <- x_z4_coef_list[[link]]
data_generation_configuration$y_z4_coef[[i]] <- y_z4_coef_list[[link]]
if (is.na(data_generation_configuration$x_y_coef[[i]])) {
data_generation_configuration$x_y_coef[[i]] <- x_y_coef_list[[link]][[shape]]
}
if (shape == "ramp") {
data_generation_configuration$sigma_y[[i]] <- sigma_y_list[[family]][[1]]
data_generation_configuration$y_intercept[[i]] <- y_intercept_list[[link]][[1]]
}
if (shape == "asymmetric") {
data_generation_configuration$sigma_y[[i]] <- sigma_y_list[[family]][[2]]
data_generation_configuration$y_intercept[[i]] <- y_intercept_list[[link]][[2]]
}
if (shape == "symmetric") {
data_generation_configuration$sigma_y[[i]] <- sigma_y_list[[family]][[3]]
data_generation_configuration$y_intercept[[i]] <- y_intercept_list[[link]][[3]]
}
}
data_generation_configuration$id <- as.numeric(
rownames(data_generation_configuration)
)
```
# Fit Gen Setup
```{r}
fit_configuration <- expand.grid(
fit_family = c(
"gompertz",
"gamma",
"weibull",
"lognormal",
"softplusnormal",
"frechet",
"betaprime",
"gaussian"
),
fit_link = c(
"log",
"softplus",
"identity"
),
formula = c(
"y ~ x + z1 + z2",
"y ~ x + z2",
"y ~ x + z1",
"y ~ x + z1 + z2 + z3",
"y ~ x + z1 + z2 + z4"
),
stringsAsFactors = FALSE
)
fit_configuration <- filter(
fit_configuration,
!(fit_link == "identity" &
fit_family != "gaussian" &
fit_family != "lognormal" &
fit_family != "softplusnormal" &
fit_family != "lognormal_custom")
)
fit_configuration <- filter(
fit_configuration,
!(fit_link != "identity" &
(fit_family == "lognormal" |
fit_family == "softplusnormal" |
fit_family == "lognormal_custom"))
)
fit_configuration$prior = c() # should stay empty for this
```
# Simulation
```{r}
start_time <- Sys.time()
result_df <- full_simulation(
data_gen_confs = data_generation_configuration,
data_gen_fun = DATA_GEN_FUN,
fit_confs = fit_configuration,
metrics = metrics,
ncores_simulation = 1,
cluster_type = CLUSTER_TYPE,
stan_pars = stan_pars,
seed = SEED,
result_path = RESULT_PATH,
debug = DEBUG,
vars_of_interest = VARS_OF_INTEREST,
quantiles = QUANTILES
)
end_time <- Sys.time()
print(end_time - start_time)
```
```{r}
cols <- c("formula", "data_family_transformed", "data_link_transformed", "shape", "zero_effect")
p1 <- result_df %>%
filter(fit_link_transformed == data_link_transformed,
fit_family_transformed == data_family_transformed) %>%
filter(divergents == 0) %>%
filter(rhat <= 1.01) %>%
filter(ess_bulk > 400) %>%
filter(ess_tail > 400) %>%
mutate(sig95 = as.numeric(pq_0.025 > 0 | pq_0.975 < 0),
zero_effect = ifelse(x_y_coef != 0, "FPR", "Power")) %>%
group_by(across(all_of(cols))) %>%
summarise(sig95 = mean(sig95)) %>%
ggplot() +
geom_bar(aes(data_family_transformed, x = sig95), stat = "identity") +
geom_vline(xintercept = 0.05) +
geom_vline(xintercept = 0.95) +
facet_grid(zero_effect + shape ~ formula + data_link_transformed) +
ggtitle("95% CI coverage of true x_y_coef") +
ylab("Formula") +
xlab("CI Calibration")
p1
```
Explanation:
- If the true effect (x_y_coef) = 0, having no 0 in the CI (sig95) is a false positive (or alpha error)
- If the true effect (x_y_coef) != 0, having no 0 in the CI (sig95) is a true positive (or power)
- As we use 95% CIs, we would like the alpha error to be 5% and the power to be 95%. This won't perfectly work but we try to kinda get there. Mainly by changing the coefficients of the individual parameters, eg. xy_coef.
- As this can only be fairly measured, we only compare models where link and family were the same as the data generating ones)
- We also know that some formulas are causally misspecified leading to reduced precision and bias.
- y~x+z1+z2 is the ideal formula. We want our parameters to be such that we get the 5% FPR (false positive rate) and 95% power. Larger effects are easier to find, but they exist in relation to the rest of the model so we have to find a balance between the effect of interest (xy_coef) and all the z-coefs. We used something between 0.2 and 0.8 in the other experiments, as a guide for the order of magnitute.
- y~x + z1 and y~ x+ z1 + z2 + z3 are unbiased but have less precision. We want them to be close to the 5% and 95% but they should naturally perform slightly worse (ie. higher FPR and lower power) than the ideal formula. No fiddling should be necessary after the ideal formula works.
- y ~ x + z2 and y ~ x + z1 + z2 + z4 are biased formulas. Here everything goes bad. We want them to show higher FPR than the oder three formulas. This will require fiddling with the related coefficients of z1 and z4. However we do not want either FPR or power to be 0 or 1, as we loose information about how good/bad we are when running into the boundaries. If in doubt, try aiming for an FPR over 20% and a power over 70% but don't sweat it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment