Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save seabbs/027cd1c439e8acf1d598cc03ef33aaa4 to your computer and use it in GitHub Desktop.
Save seabbs/027cd1c439e8acf1d598cc03ef33aaa4 to your computer and use it in GitHub Desktop.
This gist shows how to estimate a doubly censored (i.e daily data) and right truncated (i.e due to epidemic phase) distribution using the brms package.
# Load packages
library(brms)
library(cmdstanr)
library(data.table) # here we use the development version of data.table install it with data.table::update_dev_pkg
library(purrr)
# Set up parallel cores
options(mc.cores = 4)
# Simulate some truncated and truncation data
init_cases <- 100
growth_rate <- 0.1
max_t <- 20
samples <- 400
# note we actually won't end up with this many samples as some will be truncated
logmean <- 1.6
logsd <- 0.6
# Simulate the underlying outbreak structure assuming exponential growth
cases <- data.table(
cases = (init_cases * exp(growth_rate * (1:max_t))) |>
map_dbl(~ rpois(1, .)),
time = 1:max_t
)
plot(cases$cases)
# Make a case line list
linelist <- cases |>
DT(, .(id = 1:cases), by = time)
# Simulate the observation process for the line list
obs <- data.table(
time = sample(linelist$time, samples, replace = FALSE),
delay = rlnorm(samples, logmean, logsd)
) |>
# Add a new ID
DT(, id := 1:.N) |>
# When would data be observed
DT(, obs_delay := time + delay) |>
# Integerise delay
DT(, daily_delay := floor(delay)) |>
# Day after observations
DT(, day_after_delay := ceiling(delay)) |>
# Time observe for
DT(, obs_time := max_t - time) |>
# We don't know this exactly so need to censor
# Set to the midday point as average across day
DT(, censored_obs_time := obs_time - 0.5) |>
DT(, censored := "interval")
# Make event based data for latent modelling
obs <- obs |>
DT(, primary_event := floor(time)) |>
DT(, secondary_event := floor(obs_delay)) |>
DT(, max_t := max_t)
# Truncate observations
truncated_obs <- obs |>
DT(obs_delay <= max_t)
double_truncated_obs <- truncated_obs |>
# The lognormal family in brms does not support 0 so also truncate delays > 1
# This seems like it could be improved
DT(daily_delay >= 1)
# Fit lognormal model with no corrections
naive_model <- brm(
bf(daily_delay ~ 1, sigma ~ 1), data = double_truncated_obs,
family = lognormal(), backend = "cmdstanr", adapt_delta = 0.9
)
# We see that the log mean is truncated
# the sigma_intercept needs to be exponentiated to return the log sd
summary(naive_model)
# Adjust for truncation
trunc_model <- brm(
bf(daily_delay | trunc(lb = 1, ub = censored_obs_time) ~ 1, sigma ~ 1),
data = double_truncated_obs, family = lognormal(),
backend = "cmdstanr", adapt_delta = 0.9
)
# Getting closer to recovering our simulated estimates
summary(trunc_model)
# Correct for censoring
censor_model <- brm(
bf(daily_delay | cens(censored, day_after_delay) ~ 1, sigma ~ 1),
data = double_truncated_obs, family = lognormal(),
backend = "cmdstanr", adapt_delta = 0.9
)
# Less close than truncation but better than naive model
summary(censor_model)
# Correct for double interval censoring and truncation
censor_trunc_model <- brm(
bf(
daily_delay | trunc(lb = 1, ub = censored_obs_time) +
cens(censored, day_after_delay) ~ 1,
sigma ~ 1
),
data = double_truncated_obs, family = lognormal(), backend = "cmdstanr"
)
# Recover underlying distribution
# As the growth rate increases and with short delays we may still see a bias
# as we have a censored observation time
summary(censor_trunc_model)
# Model censoring as a latent process (WIP)
# For this model we need to use a custom brms family and so
# the code is significantly more complex.
# Custom family for latent censoring and truncation
fit_latent_lognormal <- function(fn = brm, ...) {
latent_lognormal <- custom_family(
"latent_lognormal",
dpars = c("mu", "sigma", "pwindow", "swindow"),
links = c("identity", "log", "identity", "identity"),
lb = c(NA, 0, 0, 0),
ub = c(NA, NA, 1, 1),
type = "real",
vars = c("vreal1[n]", "vreal2[n]")
)
stan_funs <- "
real latent_lognormal_lpdf(real y, real mu, real sigma, real pwindow,
real swindow, real sevent,
real end_t) {
real p = y + pwindow;
real s = sevent + swindow;
real d = s - p;
real obs_time = end_t - p;
return lognormal_lpdf(d | mu, sigma) - lognormal_lcdf(obs_time | mu, sigma);
}
"
stanvars <- stanvar(block = "functions", scode = stan_funs)
# Set up shared priors ----------------------------------------------------
priors <- c(
prior(uniform(0, 1), class = "b", dpar = "pwindow", lb = 0, ub = 1),
prior(uniform(0, 1), class = "b", dpar = "swindow", lb = 0, ub = 1)
)
fit <- fn(family = latent_lognormal, stanvars = stanvars, prior = priors, ...)
return(fit)
}
# Fit latent lognormal model
latent_model <- fit_latent_lognormal(
bf(primary_event | vreal(secondary_event, max_t) ~ 1, sigma ~ 1,
pwindow ~ 0 + as.factor(id), swindow ~ 0 + as.factor(id)),
data = truncated_obs, backend = "cmdstanr", fn = brm,
adapt_delta = 0.95
)
# Should also see parameter recovery using this method though
# run-times are much higher and the model is somewhat unstable.
summary(latent_model)
@seabbs
Copy link
Author

seabbs commented Oct 17, 2022

The final model generates the following stan code:

// generated with brms 2.18.0
functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=-1,upper=2> cens[N];  // indicates censoring
  vector[N] rcens;  // right censor points for interval censoring
  real ub[N];  // upper truncation bounds
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  real Intercept;  // temporary intercept for centered predictors
  real Intercept_sigma;  // temporary intercept for centered predictors
}
transformed parameters {
  real lprior = 0;  // prior contributions to the log posterior
  lprior += student_t_lpdf(Intercept | 3, 1.9, 2.5);
  lprior += student_t_lpdf(Intercept_sigma | 3, 0, 2.5);
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] sigma = rep_vector(0.0, N);
    mu += Intercept;
    sigma += Intercept_sigma;
    sigma = exp(sigma);
    for (n in 1:N) {
    // special treatment of censored data
      if (cens[n] == 0) {
        target += lognormal_lpdf(Y[n] | mu[n], sigma[n]) -
        lognormal_lcdf(ub[n] | mu[n], sigma[n]);
      } else if (cens[n] == 1) {
        target += lognormal_lccdf(Y[n] | mu[n], sigma[n]) -
        lognormal_lcdf(ub[n] | mu[n], sigma[n]);
      } else if (cens[n] == -1) {
        target += lognormal_lcdf(Y[n] | mu[n], sigma[n]) -
        lognormal_lcdf(ub[n] | mu[n], sigma[n]);
      } else if (cens[n] == 2) {
        target += log_diff_exp(
          lognormal_lcdf(rcens[n] | mu[n], sigma[n]),
          lognormal_lcdf(Y[n] | mu[n], sigma[n])
        ) -
        lognormal_lcdf(ub[n] | mu[n], sigma[n]);
      }
    }
  }
  // priors including constants
  target += lprior;
}

We are using "interval" censoring which corresponds to the following part of the likelihood which matches our expectations:

 target += log_diff_exp(
          lognormal_lcdf(rcens[n] | mu[n], sigma[n]),
          lognormal_lcdf(Y[n] | mu[n], sigma[n])
        ) -
        lognormal_lcdf(ub[n] | mu[n], sigma[n]);

As this is implemented in brms we could now estimate arbitrary delays for example across strata or with covariates. We could also change out the distribution we are fitting or add other model structures as desired.

@seabbs
Copy link
Author

seabbs commented Oct 18, 2022

Latent variable model stan code:

/ generated with brms 2.18.0
functions {
  real latent_lognormal_lpdf(real y, real mu, real sigma, real pwindow,
                             real swindow, real sevent, real end_t) {
    real p = y + pwindow;
    real s = sevent + swindow;
    real d = s - p;
    real obs_time = end_t - p;
    return lognormal_lpdf(d | mu, sigma)
           - lognormal_lcdf(obs_time | mu, sigma);
  }
}
data {
  int<lower=1> N; // total number of observations
  vector[N] Y; // response variable
  // data for custom real vectors
  array[N] real vreal1;
  // data for custom real vectors
  array[N] real vreal2;
  int<lower=1> K_pwindow; // number of population-level effects
  matrix[N, K_pwindow] X_pwindow; // population-level design matrix
  int<lower=1> K_swindow; // number of population-level effects
  matrix[N, K_swindow] X_swindow; // population-level design matrix
  int prior_only; // should the likelihood be ignored?
}
transformed data {
  
}
parameters {
  real Intercept; // temporary intercept for centered predictors
  real Intercept_sigma; // temporary intercept for centered predictors
  vector<lower=0, upper=1>[K_pwindow] b_pwindow; // population-level effects
  vector<lower=0, upper=1>[K_swindow] b_swindow; // population-level effects
}
transformed parameters {
  real lprior = 0; // prior contributions to the log posterior
  lprior += student_t_lpdf(Intercept | 3, 11, 5.9);
  lprior += student_t_lpdf(Intercept_sigma | 3, 0, 2.5);
  lprior += uniform_lpdf(b_pwindow | 0, 1)
            - 200
              * log_diff_exp(uniform_lcdf(1 | 0, 1), uniform_lcdf(0 | 0, 1));
  lprior += uniform_lpdf(b_swindow | 0, 1)
            - 200
              * log_diff_exp(uniform_lcdf(1 | 0, 1), uniform_lcdf(0 | 0, 1));
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] sigma = rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] pwindow = rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] swindow = rep_vector(0.0, N);
    mu += Intercept;
    sigma += Intercept_sigma;
    pwindow += X_pwindow * b_pwindow;
    swindow += X_swindow * b_swindow;
    sigma = exp(sigma);
    for (n in 1 : N) {
      target += latent_lognormal_lpdf(Y[n] | mu[n], sigma[n], pwindow[n], swindow[n], vreal1[n], vreal2[n]);
    }
  }
  // priors including constants
  target += lprior;
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept;
  // actual population-level intercept
  real b_sigma_Intercept = Intercept_sigma;
}

The key things to note here are the custom likelihood:

  real latent_lognormal_lpdf(real y, real mu, real sigma, real pwindow,
                             real swindow, real sevent, real end_t) {
    real p = y + pwindow;
    real s = sevent + swindow;
    real d = s - p;
    real obs_time = end_t - p;
    return lognormal_lpdf(d | mu, sigma)
           - lognormal_lcdf(obs_time | mu, sigma);
  }

and the use of uniform priors on pwindow and swindow to provide the latent variable space for censored observations.

@seabbs
Copy link
Author

seabbs commented Oct 18, 2022

Note: Potentially here the censoring for observed delays is overly narrow

# Day after observations
  DT(, day_after_delay := ceiling(delay)) |>

I think that you could argue it should be floor(delay) + 2 as both primary and secondary events can be censored. Looking at this nowish.

@parksw3
Copy link

parksw3 commented Oct 18, 2022

Hm, maybe we need to be more careful with what we mean by day_after_delay or censoring in general. Is it something we simulate vs what we assume when we fit vs what we observe? For example, if the secondary event occurs at t=4.5, then it's between 4 and 5. But what do we actually observe in real data? Does this get reported as day 4 or day 5? It's probably a bit random. Maybe somebody showed up to the testing clinic as it opened up on day 4, so it gets reported on day 4. Or maybe this person showed up to the testing clinic as it closed on day 4, so it gets reported the next day on day 5. Let's say that this gets reported as day 5. What prior assumption do we use? Uniform between days 4 and 6? In this case, it is floor(delay) and floor(delay)+2. But what is it gets reported as day 4? We probably want to prior between day 3 (floor(delay)-1) and day 5 (floor(delay)+1).

So I think using t_report -1 and t_report + 1 is a good idea for the prior but generally a good idea to distinguish these differences more clearly.

@parksw3
Copy link

parksw3 commented Oct 18, 2022

For now, I'm going to use ptime and stime to refer to the true timing of primary and secondary events. Then, ptime_daily = floor(ptime) is how primary event is recorded on a daily scale. Using floor because symptoms occurring at 2AM Monday (2/24) or 10PM Monday (22/24) should be both recorded as Monday (putting aside biases in people's memories). Then, for inference, we can use ptime_lwr = ptime_daily-1 and ptime_upr = ptime_daily+1 to put bounds.

@parksw3
Copy link

parksw3 commented Oct 18, 2022

Actually, I feel like ptime_lwr = ptime_daily-1 and ptime_upr = ptime_daily+1 contradicts my data generating assumption. I'm assuming that anything that happens between 12:01AM Monday and 11:59PM Monday should be recorded as in happening on Monday. In this case, the appropriate assumption for the prior of our fit would be ptime_lwr = ptime_daily and ptime_upr = ptime_daily+1.

Overall, all I'm saying is that we need to match data generating assumption with the fitting assumption...

@seabbs
Copy link
Author

seabbs commented Oct 21, 2022

Yes, I agree. I think for a generic use case we need to make some kind of simplifying assumption (the day is correct but we don't know when in the day) but for specific use cases there are many scenarios in which we have more information and/or there is something more complex going on (like onset actually being the date of the report at the clinic or just being uncertain as due to symptom detection).

I am not sure the default expectation should be +-1 around the day as I think most reporting frameworks are going to act more like the. first suggestion (though has coded here the simulator is the latter and so this needs to be changed). Definitely something to generally address and discuss though as easy to imagine many settings where what you suggest is happening.

I like the suggest to use ptime and stime. For my own state of mind going to overhaul this code tomorrow before playing more with yours.

Yes I agree on it being a contradiction to your (but as see above not currently mine) data generating assumption.

Agree! Or maybe show what happens when they don't match (i.e is bad censoring worse than no censoring) that could really be getting in the weeds though.

@sbfnk
Copy link

sbfnk commented Oct 26, 2022

I've updated this for how think the censoring operates, see
https://gist.github.com/sbfnk/569ad82641d73286a355317e53d911cf

Catching up with the conversation above it seems to me that the relevant censoring interval, assuming all dates are reported correctly (as discrete_time = floor(continuous_time)) is [max(discrete_delay - 1, 0), discrete_delay + 1] as that's the range of possible continuous delays that could lead to a discrete delay of discrete_delay (i.e. the range of values for t2 - t1 that is consistent with floor(t2) - floor(t1) == discrete_delay). Note that this means censoring for a discrete delay of 0 is a special case, which I think makes sense.

@parksw3
Copy link

parksw3 commented Oct 27, 2022

I wrote this before reading @sbfnk's code and now realized why it's useful if people don't want to deal with the latent approach I'm confused why the censoring interval need to ever span 2 days. If all dates are reported correctly, such that p_dis = floor(p_con) and s_dis=floor(s_con), then the censoring interval for the first event is p_dis to p_dis+1 and for the second event is s_dis to s_dis+1 regardless of what the delay is. Maybe I'm missing something, but I think framing censoring in terms of p_dis and s_dis seems most clear to me rather than having to rely on the delay itself.

When p_dis - s_dis = 0, then the bounds we want to put on s_con is between p_lat and s_dis+1 where p_lat is the latent variable estimator for p_con. So we need a little more tweaking of the brms code than what we @seabbs initially wrote.

I also chatted with Jonathan on this topic another day and think we figured out why uniform prior is bad and good. Let's think of a generic case where all individuals who experience the first event between time p_1 and p_2 (neither of which need to be discrete values), get their primary event reported at time p_dis. Then, for a cohort of individuals who report their event at p_dis, the distribution of true event time should be proportional to the incidence between time p_1 and p_2. Same goes for the second event. So far we've been dealing with the exponential cases with daily time steps, which is a special case of this (p_1=floor(p_con) and p_2=p_1+1) and assuming a uniform prior for the inference. In this case, both p_lat and s_lat give biased estimates of p_con and s_con in the same direction. In this case, biases cancel out such that s_lat - p_lat give unbiased estimates of s_con - p_con!! So if we were dealing with a case where the primary event was happening during the growth phase and the secondary event was happening during the decay phase, we'll see the uniform bias pop out. Not sure how this works for the censoring without the latent case, but will work out simulations very very very soon (sorry, need to send something to Bryan this week).

Also invited @sbfnk to the repo (https://github.com/parksw3/dynamicaltruncation).

@seabbs
Copy link
Author

seabbs commented Oct 27, 2022

Nice - yes I agree its +-1 padding. Useful to phrase it as the flooring operation for both dates. I also like the use of bpmodels as the simulator (I was actually just looking at this for the write-up).

So if we were dealing with a case where the primary event was happening during the growth phase and the secondary event was happening during the decay phase, we'll see the uniform bias pop out. Not sure how this works for the censoring without the latent case, but will work out simulations very very very soon (sorry, need to send something to Bryan this week).

This is a really interesting point @parksw3!

@sbfnk see a crazy named branch of that repo for some fleshing out of the analysis skeleton.

@parksw3
Copy link

parksw3 commented Oct 27, 2022

I merged both branches so you can also look at the main branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment