rasmusab/a-bayesian-model-to-calculate-whether-my-wife-is-pregnant.Rmd

## a-bayesian-model-to-calculate-whether-my-wife-is-pregnant.Rmd
---
title: "A Bayesian Model to Calculate Whether My Wife is Pregnant or Not"
author: "Rasmus Bååth"
output: html_document
---

```{r echo = FALSE}
library(knitr)
options(digits=2)
```

On the 21st of February, 2015, my wife had not had her period for 33 days, and as we were trying to conceive, this was good news! An average period is around a month, and if you are a couple trying to go triple, then a missing period is a good sign something is going on. But at 33 days, this was not yet a missing period, just a late one, so *how* good news was it? Pretty good, *really* good, or just *meh*?

To get at this I developed a simple Bayesian model that, given the number of days since your last period and your history of period onsets, calculates the probability that you are going to be pregnant this period cycle. In this post I will describe what data I used, the priors I used, the model assumptions, and how to fit it in R using importance sampling. And finally I show you why the result of the model really didn't matter in the end. Also I'll give you a handy script if you want to calculate this for yourself. :)

The data
----------------

During the last part of 2014 my wife kept a journal of her period onsets, which was good luck for me, else I would end up with [tiny data](/blog/2014/10/tiny-data-and-the-socks-of-karl-broman/) again. In total we had the dates for eight period onsets, but the data I used was not the onsets but the number of days between the onsets:

```{r}
period_onset <- as.Date(c("2014-07-02", "2014-08-02", "2014-08-29", "2014-09-25",
                          "2014-10-24", "2014-11-20", "2014-12-22", "2015-01-19"))
days_between_periods <- as.numeric(diff(period_onset))
```


```{r echo = FALSE, fig.with = 6, fig.height = 3.5 }
old_par = par(mar = c(4.2, 4.2, 1, 1))
plot(period_onset[-1], days_between_periods,
     xlim=as.numeric(c(min(period_onset[-1]), max(period_onset[-1] + 26))), ylim = c(20, 35),
     xlab = "Period onset date", ylab = "Days since last period",
     type = "b", col = "red1", pch  = 19, xaxt = "n")
axis(1, period_onset, format(period_onset, "%b %d"), cex.axis = 0.9)
text(max(period_onset[-1] + 24), 28, "?", cex = 2, col = "red1")
par(old_par)
```

So the onsets occur pretty regularly, hovering around a cycle of 28 days. The last onset was on the 19th of January, so on the 21st of February there had been 33 days since the last onset.

Conceiving a model
----------

I was constructing a model covering period cycles, pregnancies and infertility, and as such it was obviously going to make *huge* simplifications. Some general assumptions I made were:

* The woman and the man have no prior reasons for being infertile as a couple.
* The woman has regular periods.
* The couple trying to conceive are _actively_ trying to conceive. Say, two three times a week as recommended by [Wilcox et al. (2000)](http://dx.doi.org/10.1136/bmj.321.7271.1259).
* If there is pregnancy, there are no more periods.

Now to the specific assumptions I made:

* The number of days between periods (`days_between_periods`) is assumed to be normally distributed with unknown mean (`mean_period`) and standard deviation (`sd_period`).
* The probability of getting pregnant during a cycle is assumed to be `0.19` (more about where this number comes from below) *if* you are fertile as a couple (`is_fertile`). Unfortunately not all couples are fertile, and if you are not then the probability of getting pregnant is 0. If fertility is coded as 0-1 then this can be compactly written as `0.19 * is_fertile`.
* The probability of failing to conceive for a certain number of periods (`n_non_pregnant_periods`)
 is then `(1 - 0.19 * is_fertile)^n_non_pregnant_periods`
* Finally, if you are not going to be pregnant this cycle, the number of days from your last to your next period (`next_period`) is going to be more than the current number of days since the last period (`days_since_last_period`). That is, the probability of `next_period < days_since_last_period` is zero. This sounds strange because it is so obvious, but we're going to need it in the model.

That was basically it! But in order to fit this I was going to need a *likelihood function*, a function that, given fixed parameters and some data, calculates the probability of the data given those parameters or, more commonly, something proportional to a probability, that is, a *likelihood*. And as this *likelihood* can be extremely tiny I needed to calculate it on the log scale to avoid numerical problems. When crafting a log likelihood function in R, the general pattern is this:

* The function will take the data and the parameters as arguments.
* You initialize the likelihood to 1.0, corresponding to 0.0 on the log scale (`log_like <- 0.0`).
* Using the  probability density functions in R (such as `dnorm`, `dbinom` and `dpois`) you calculate the likelihoods of the different the parts of the model. You then multiply these likelihoods together. On the log scale this corresponds to adding the log likelihoods to `log_like`.
* To make the `d*` functions return log likelihoods just add the argument `log = TRUE`. Also remember that a likelihood of 0.0 corresponds to a log likelihood of `-Inf`.

So, a log likelihood function corresponding to the model above would then be:

```{r}
calc_log_like <- function(days_since_last_period, days_between_periods,
                          mean_period, sd_period, next_period,
                          is_fertile, is_pregnant) {
  n_non_pregnant_periods <- length(days_between_periods)
  log_like <- 0
  if(n_non_pregnant_periods > 0) {
    log_like <- log_like + sum( dnorm(days_between_periods, mean_period, sd_period, log = TRUE) )
  }
  log_like <- log_like + log( (1 - 0.19 * is_fertile)^n_non_pregnant_periods )
  if(!is_pregnant && next_period < days_since_last_period) {
    log_like <- -Inf
  }
  log_like
}
```

Here the data is the scalar `days_since_last_period` and the vector `days_between_periods`, and the rest of the arguments are the parameters to be estimated. Using this function I could now get the log likelihood for any data + parameter combination. However, I still only had half a model, I also needed priors!


Priors on periods, pregnancy and fertility
-------------------------------

To complete this model I needed priors on all the parameters. That is, I had to specify what information the model has before seeing the data. Specifically, I needed priors on `mean_period`, `sd_period`, `is_fertile`, and `is_pregnant` (while `next_period` is also a parameter, I didn't need to give it an explicit prior as its distribution is completely specified by `mean_period` and `sd_period`). I also needed to find a value for the probability of becoming pregnant in a cycle (which I set to `0.19` above). Did I use vague, "objective" priors here? No, I went looking in the fertility literature to something more informative!

For the distribution of the `days_between_periods` the parameters were `mean_period` and `sd_period`. Here I used estimates from the article *The normal variabilities of the menstrual cycle*
([Cole et al, 2009](http://dx.doi.org/10.1016/j.fertnstert.2007.11.073)) which measured the regularity of periods in 184 women aged 18-36 years. The grand mean number of days between periods was here 27.7 days, with the SD of the per participant mean being 2.4. The group SD of the number of days between periods was 1.6. Given these estimates I then decided to put a Normal(27.7, 2.4) distribution over `mean_period` and a Half-Normal distribution with mean 1.6 over `sd_period`, corresponding to a Half-Normal with a SD of 2.05. Here they are:

```{r echo=FALSE}
sample_from_prior <- function(n) {
  prior <- data.frame(mean_period = rnorm(n, 27.7, 2.4),
                      sd_period   = abs(rnorm(n, 0, 2.05)),
                      is_fertile  = rbinom(n, 1, 0.95))
  prior$is_pregnant <- rbinom(n, 1, 0.19 * prior$is_fertile)
  prior$next_period <- rnorm(n, prior$mean_period, prior$sd_period)
  prior$next_period[prior$is_pregnant == 1] <- NA
  prior
}
```

```{r fig.height=4, fig.width=9, echo=FALSE, warning=FALSE, message=FALSE}
prior <- sample_from_prior(100000)
library(ggplot2)
library(gridExtra)


mu_plot <- qplot(mean_period, data = prior, geom = "histogram", alpha = I(0.5), fill = I("#00BFC4"),xlim = c(20, 35)) +
  xlab("Mean number of days between periods") + ylab("Density") +
  theme(legend.position="none", axis.ticks.y = element_blank(), axis.text.y =  element_blank())

sigma_plot <- qplot(sd_period, data = prior, geom = "histogram",alpha = I(0.5),  fill = I("#00BFC4"), xlim = c(0, 7)) +
  xlab("SD of the number of days between periods") + ylab("Density") +
  theme( axis.ticks.y = element_blank(), axis.text.y =  element_blank())
# Funkar av nån anledning inte med main längre....
grid.arrange(top = "Prior mean and SD", mu_plot, sigma_plot, ncol = 2)
```

For the parameters `is_fertile` and `is_pregnant` I based the priors on *frequencies*. The proportion of couples that are fertile is tricky to define, as there different definitions of *infertility*. [Van Geloven et al.  (2013)](http://dx.doi.org/10.1093/humrep/des428) made a small literature review and got that between 2% and 5% of all couples could be considered infertile. As I've seen numbers as high as 10%, I decided to go with the higher end of this range and put a prior probability of 100% - 5% = 95% that a couple is fertile.

`is_pregnant` is a binary parameter standing for whether the couple are going get (or already are) pregnant the current cycle. The prior I used here was the probability of getting pregnant in a cycle. This probability is obviously 0.0 if the couple is infertile, but how large a proportion of active, fertile couples get pregnant in a period cycle? Unfortunately I didn't find a source that explicitly stated this, but I found something close. On page 53 in *Increased Infertility With Age in Men and Women* [Dunson et al. (2004)](http://dx.doi.org/10.1097/01.AOG.0000100153.24061.45) give the proportion of couples trying to conceive who did not get pregnant within 12 cycles, stratified by the age of the woman:

```{r}
prop_not_preg_12_cycles <- c("19-26 years" = 0.08,
                             "27-34 years" = 0.13,
                             "35-39 years" = 0.18)
```

Using some back-of-the-R-script calculations I calculated the probability to conceive in a cycle: As these proportions presumably include infertile couples I started by subtracting 0.05, the proportion of couples that I assumed are infertile. My wife was in the *27-34 years* bracket so the probability of us not conceiving within 12 cycles, given that we are fertile, was then 0.13 - 0.05. If *p* is is the probability of not getting pregnant during one cycle, then $p^{12} = 0.13 - 0.05$ is the probability of not getting pregnant during twelve cycles and, as *p* is positive, we have that $p = (0.135 - 0.05)^{1/12}$. The probability of getting pregnant in one cycle is then *1 - p* and the probabilities for the three age groups are:

```{r}
1 - (prop_not_preg_12_cycles - 0.05)^(1/12)
```

So that's where the 19% percent probability of conceiving came from in the log likelihood function above, and 19% is what I used as a prior for `is_pregnant`. Now I had priors for all parameters and I could construct a function that returned samples from the prior:

```{r echo=FALSE}
set.seed(12345)
```


```{r}
sample_from_prior <- function(n) {
  prior <- data.frame(mean_period = rnorm(n, 27.7, 2.4),
                      sd_period   = abs(rnorm(n, 0, 2.05)),
                      is_fertile  = rbinom(n, 1, 0.95))
  prior$is_pregnant <- rbinom(n, 1, 0.19 * prior$is_fertile)
  prior$next_period <- rnorm(n, prior$mean_period, prior$sd_period)
  prior$next_period[prior$is_pregnant == 1] <- NA
  prior
}
```

It takes one argument (`n`) and returns a `data.frame` with `n` rows, each row being a sample from the prior. Let's try it out:

```{r}
sample_from_prior(n = 4)
```

Notice that `is_pregnant` can only be `1` if `is_fertile` is `1`, and that there is no `next_period` if the couple `is_pregnant`.

Fitting the model using importance sampling
------------------

I had now collected the triforce of Bayesian statistics: The prior, the likelihood and the data. There are many algorithms I could have used to fit this model, but here a particularly convenient method was to use *importance sampling*. [I've written about importance sampling before](/blog/2013/12/shaping_up_laplace_approximation/), but let's recap: Importance sampling is a Monte Carlo that is *very easy* to setup and that can work well if (1) the parameters space is small and (2) the priors are not too dissimilar from the posterior. As my parameter space *was* small and because I used pretty informative priors I though importance sampling would suffice here. The three basic steps in importance sampling are:

1. Generate a large sample from the prior. (This I could do using `sample_from_prior`.)
2. Assign a weight to each draw from the prior that is proportional to the likelihood of the data given those parameters. (This I could do using `calc_log_like`.)
3. Normalize the weights to sum to one so that they now form a probability distribution over the prior sample. Finally, resample the prior sample according to this probability distribution. (This I could do using the R function `sample`.)

(Note that there are some variations to this procedure, but when used to fit a Bayesian model this is a common version of importance sampling.)

The result of using importance sampling is a new sample which, if the importance sampling worked OK, can be treated as a sample from the posterior. That is, it represents what the model knows after having seen the data. Since I already had defined `sample_from_prior` and `calc_log_like`, defining a function in R doing importance sampling was trivial:

```{r}
sample_from_posterior <- function(days_since_last_period, days_between_periods, n_samples) {
  prior <- sample_from_prior(n_samples)
  log_like <- sapply(1:n_samples, function(i) {
    calc_log_like(days_since_last_period, days_between_periods,
                  prior$mean_period[i], prior$sd_period[i], prior$next_period[i],
                  prior$is_fertile[i], prior$is_pregnant[i])
  })
  posterior <- prior[ sample(n_samples, replace = TRUE, prob = exp(log_like)), ]
  posterior
}
```

The result: probable pregnancy
--------------

So, on the 21st of February, 2015, my wife had not had her period for 33 days. Was this good news? Let's run the model and find out!

```{r echo=FALSE}
set.seed(12345)
```


```{r cache=TRUE}
post <- sample_from_posterior(33, days_between_periods, n_samples = 100000)
```

`post` is now a long data frame where the distribution of the parameter values represent the posterior information regarding those parameters.

```{r}
head(post)
```

Let's start by looking at the mean and standard deviation of the number of days between each period:

```{r fig.height=4, fig.width=9, echo=FALSE, warning=FALSE, message=FALSE}
prior <- sample_from_prior(100000)
library(ggplot2)
library(gridExtra)

df <- data.frame(value = c(prior$mean_period, post$mean_period, prior$sd_period, post$sd_period),
           param = rep(c("mean", "sd"), each = nrow(prior) * 2),
           source = rep(c("prior", "posterior", "prior", "posterior"), each = nrow(prior)))

mean_df <- data.frame(value = c(prior$mean_period, post$mean_period),
                      source = rep(c("prior", "posterior"), c( nrow(prior), nrow(post))))
mu_plot <- qplot(value, data = mean_df, geom = "histogram", fill = source,
                 alpha = I(0.5), xlim = c(20, 35), position="identity") +
  xlab("Mean number of days between periods") +  ylab("Density") +
  theme( legend.position="none", axis.ticks.y = element_blank(), axis.text.y =  element_blank()) +
   scale_fill_discrete(name="")
sd_df <- data.frame(value = c(prior$sd_period, post$sd_period),
                    source = rep(c("prior", "posterior"), c( nrow(prior), nrow(post))))
sigma_plot <- qplot(value, data = sd_df, geom = "histogram", fill = source,
                    alpha = I(0.5), xlim = c(0, 7), position="identity") +
  xlab("SD of the number of days between periods") + ylab("Density") +
  theme(legend.justification=c(1,1), legend.position=c(1,1),
        axis.ticks.y = element_blank(), axis.text.y =  element_blank()) +
   scale_fill_discrete(name="")
grid.arrange(mu_plot, sigma_plot, ncol = 2)
```

As expected the posteriors are more narrow than the priors and, looking at the posteriors, it's probable that the mean period cycle is around 29 days with a SD of 2-3 days. Now to the important questions: What's the probability that we are a fertile couple and what's the probability that we were pregnant on the 21st of February? To calculate this we can just take `post$is_fertile` and `post$is_pregnant` and calculate the proportion of `1`s in these vectors. A quick way of doing this is just to take the `mean`:

```{r}
mean(post$is_fertile)
mean(post$is_pregnant)
```

So it *was* pretty good news: It's very probable that we are a fertile couple and the probability that we were pregnant was `r paste0(round(mean(post$is_pregnant) * 100), "%")`! Using this model I could also see how the probability of us being pregnant would change if the period onset would  stay away a couple of days more:

```{r cache=TRUE}
post <- sample_from_posterior(34, days_between_periods, n_samples = 100000)
mean(post$is_pregnant)
```

```{r cache=TRUE}
post <- sample_from_posterior(35, days_between_periods, n_samples = 100000)
mean(post$is_pregnant)
```

Yeah, while we are at it, why not see how the probability of us being fertile and pregnant changed during the months we were trying to conceive:

```{r echo=FALSE}
# Time series
# est_by_t <- t(sapply(seq(1, 233, 1), function(t) {
#   t_days_between_periods <- days_between_periods[cumsum(days_between_periods) < t]
#   t_days_since_last_period <- t - sum(t_days_between_periods)
#   post <- sample_from_posterior(t_days_since_last_period, t_days_between_periods, n_samples = 100000)
#   c(t = t, colMeans(post, na.rm = TRUE))
# }))
#
# write.csv(est_by_t, "in_construction/preg_calc/est_by_t.csv")

est_by_t <- read.csv("est_by_t.csv")
```

```{r echo=FALSE, fig.height=7.3, fig.width=6}
old_par <- par(mar = c(2.5, 2.2, 2.5, 0.5), mfcol = c(2,1))
t_dates <- period_onset[1] - 1 + est_by_t[, "t"]

plot(t_dates, est_by_t[, "is_pregnant"], type = "l", ylim = c(0, 1),
     col = "blue3", lwd = 2, xaxt = "n", yaxt = "n", yaxs = "i",
     ylab = "", xlab = "", main = "Probability of being pregnant the current cycle")
axis(1, c(period_onset, t_dates[length(t_dates)] + 5 ), c(format(period_onset, "%b %d"), "?"), cex.axis = 0.8)
axis(2, seq(0, 1, 0.2), paste0(seq(0, 100, 20), "%"))

plot(t_dates, est_by_t[, "is_fertile"], type = "l", col = "green3", lwd = 2,
     xaxt = "n", yaxt = "n", yaxs = "i", ylim = c(0.70, 1),
     ylab = "", xlab = "", main = "Probability of being fertile")
axis(1, c(period_onset, t_dates[length(t_dates)] + 5 ), c(format(period_onset, "%b %d"), "?"), cex.axis = 0.8)
axis(2, seq(0, 1, 0.1), paste0(seq(0, 100, 10), "%"))

par(old_par)
```

So, this make sense. as the time since the last period gets longer the probability that we are going to be pregnant the current cycle increases, but as soon as there is a period onset that probability falls down to baseline again. We see the same pattern for the probability of being fertile, but for every period cycle we didn't get pregnant the probability of us being fertile gets slightly lower. Both these graphs are a bit jagged, but this is just due to the variability of the importance sampling algorithm. Also note that, while the graphs above are pretty, there is no real *use* looking at the probability over time, the only thing that's informative matters is the *current* probability.

Some critique of the model, but which doesn't really matter
------------------------------------------------

* It's of course possible to get much better priors than my back-of-the-envelope calculation here. There are also many more predictors that one could include, like the age of the man, health factors, and so on.
* The probability of getting pregnant each month could/should be made uncertain rather than giving it a fixed value, as I did. But I thought that would be one parameter to many given the little data I had.
* Nothing is *really* normally distributed and neither is the length between periods. Here I think that assumption works well enough, but there are much more complex models of period cycle length, for example, that by [Bortot et al (2010)](http://dx.doi.org/10.1093/biostatistics/kxq020).
* My model is so simple I suspect that you could almost fit it analytically. But I'm lazy and my computer is not, so importance sampling it was!

But all of this doesn't really matter now. And the probabilities I calculated do not matter either. Before-the-fact probabilities reflect uncertainty regarding an outcome or a statement, but after-the-fact there is no uncertainty left. What uncertainty the probabilities represented is gone. I'm certain that my wife and I *were* pregnant on the 21st of February, and I know this because one week ago, on the 29th of October, we received this little guy:

![](my_sleeping_baby.jpeg)

If probabilities matter to *you* you can find a script implementing this model [here](https://gist.github.com/rasmusab/1baf88af49012474a658) which you can run on your own data, or a friend's. In this post I left out the code creating the plots, but all of that can be found [here](https://gist.githubusercontent.com/rasmusab/d9c63fb1817a16c6630a/raw/dca8f1f1f7a9a7720793b85170104a4e6934efdd/a-bayesian-model-to-calculate-whether-my-wife-is-pregnant.Rmd). And, as this post was really just an excuse to post baby photos, here are some more of me and my son checking out the statistical literature:

![](baby_vs_bayes_vs_fisher_small.jpeg)

<a rel="nofollow" href="http://www.amazon.com/gp/product/0124058884/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0124058884&linkCode=as2&tag=sumsarnet-20&linkId=7ZXLZCQRGTD4SVND">Doing Bayesian Data Analysis</a> makes him a bit sleepy, but that's OK, he'll come around! Looking at Fisher's <a rel="nofollow" href="http://www.amazon.com/gp/product/B000O9J92M/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=B000O9J92M&linkCode=as2&tag=sumsarnet-20&linkId=ZQVUDBEQYBKRHXQ6">Statistical Methods for Research Workers</a> on the other hand makes him *furious*...

References
----------------

Bortot, P., Masarotto, G., & Scarpa, B. (2010). Sequential predictions of menstrual cycle lengths. *Biostatistics*, 11(4), 741-755. doi: [10.1093/biostatistics/kxq020](http://dx.doi.org/10.1093/biostatistics/kxq020)

Cole, L. A., Ladner, D. G., & Byrn, F. W. (2009). The normal variabilities of the menstrual cycle. *Fertility and sterility*, 91(2), 522-527. doi: [10.1016/j.fertnstert.2007.11.073](http://dx.doi.org/10.1016/j.fertnstert.2007.11.073)

Dunson, D. B., Baird, D. D., & Colombo, B. (2004). Increased infertility with age in men and women. *Obstetrics & Gynecology*, 103(1), 51-56. doi: [10.1097/01.AOG.0000100153.24061.45](http://dx.doi.org/10.1097/01.AOG.0000100153.24061.45)

Van Geloven, N., Van der Veen, F., Bossuyt, P. M. M., Hompes, P. G., Zwinderman, A. H., & Mol, B. W. (2013). Can we distinguish between infertility and subfertility when predicting natural conception in couples with an unfulfilled child wish?. *Human Reproduction*, 28(3), 658-665. doi: [10.1093/humrep/des428](http://dx.doi.org/10.1093/humrep/des428)

Wilcox, A. J., Dunson, D., & Baird, D. D. (2000). The timing of the “fertile window” in the menstrual cycle: day specific estimates from a prospective study. *Bmj*, 321(7271), 1259-1262. doi: [10.1136/bmj.321.7271.1259](http://dx.doi.org/10.1136/bmj.321.7271.1259), [pdf](http://www.researchgate.net/profile/David_Dunson/publication/12245875_The_timing_of_the_fertile_window_in_the_menstrual_cycle_day_specific_estimates_from_a_prospective_study/links/00463518b94bc50828000000.pdf)