--- | |
title: "Fitting with uncertainty" | |
author: "Michael Kuhn" | |
date: "24 Aug 2015" | |
output: html_document | |
--- | |
In this toy example, we assume that we've independently measured values $x$ and $y$ and want to find a linear relationship between them, accounting for measurement uncertainty. Each $x$ and $y$ value is assigned a different uncertainty, and the challenge is to take this information into account. A standard linear model will treat all points equally. | |
When we treat each measurement as a multivariate normal distribution, we can find a point along the proposed fitted line that is maximizing the probability density function (i.e. that has a maximum likelihood). Thus, given a slope and intercept, we virtually move all measurements to their most likely point along the line, and use the likelihood at this point. | |
```{r} | |
library(rstan) | |
library(ggplot2) | |
set.seed(42) | |
N <- 10 | |
``` | |
We create `r N` points: | |
```{r} | |
x0 <- seq(-5, 5, 10.1/N) | |
y0 <- x0 | |
sigma_x <- abs( rnorm( length(x0), 1) ) | |
x <- x0 + unlist(lapply(sigma_x, function(sigma) rnorm(1, 0, sigma))) | |
sigma_y <- abs( rnorm( length(y0), 1) ) | |
y <- y0 + unlist(lapply(sigma_y, function(sigma) rnorm(1, 0, sigma))) | |
data <- data.frame( x, sigma_x, y, sigma_y ) | |
data_list <- list( x=x, sigma_x=sigma_x, y=y, sigma_y = sigma_y, N=N) | |
ggplot( data, aes(x, y) ) + geom_point() + geom_abline(color="red") + geom_smooth(method="lm", se=F) + geom_errorbar(aes(ymin=y-sigma_y, ymax=y+sigma_y)) + geom_errorbarh(aes(xmin=x-sigma_x, xmax=x+sigma_x)) | |
``` | |
(Red: actual dependency, blue: simple linear fit) | |
The simple linear model in STAN: | |
```{r} | |
simple_code <- " | |
data { | |
int<lower=0> N; | |
vector[N] x; | |
vector[N] y; | |
} | |
parameters { | |
real slope; | |
real intercept; | |
real<lower=0> sigma; | |
} | |
model { | |
slope ~ normal(0, 10); | |
intercept ~ normal(0, 10); | |
sigma ~ cauchy(0, 25); | |
y ~ normal(x*slope+intercept, sigma); | |
} | |
" | |
``` | |
And another model, taking uncertainty into account. Here, based on the proposed slope and intercept, the probability density function is maximized for each point: | |
```{r} | |
uncertainty_code <- " | |
data { | |
int<lower=0> N; | |
vector[N] x; | |
vector[N] y; | |
vector<lower=0>[N] sigma_x; | |
vector<lower=0>[N] sigma_y; | |
} | |
parameters { | |
real slope; | |
real intercept; | |
} | |
transformed parameters { | |
vector[N] xx; | |
vector[N] yy; | |
vector[N] sx2; | |
vector[N] sy2; | |
sx2 <- sigma_x .* sigma_x; | |
sy2 <- sigma_y .* sigma_y; | |
// find maximum PDF given slope and intercept | |
xx <- ( (sy2 .* x) + slope * sx2 .* ( y - intercept ) ) ./ ( slope * sx2 + sy2); | |
yy <- slope * xx + intercept; | |
} | |
model { | |
slope ~ normal(0, 10); | |
intercept ~ normal(0, 10); | |
xx ~ normal(x, sigma_x); | |
yy ~ normal(y, sigma_y); | |
} | |
" | |
``` | |
Let's generate fits: | |
```{r results="hide"} | |
fit <- stan(model_code = uncertainty_code, data = data_list) | |
fit_simple <- stan(model_code = simple_code, data = data_list) | |
``` | |
Comparing the parameter distributions, taking uncertainty into account produces a much better fit: | |
```{r} | |
slopes <- do.call(cbind, extract(fit, "slope")) | |
intercepts <- do.call(cbind, extract(fit, "intercept")) | |
slopes_simple <- do.call(cbind, extract(fit_simple, "slope")) | |
intercepts_simple <- do.call(cbind, extract(fit_simple, "intercept")) | |
df_coef <- rbind( | |
data.frame( kind = "simple", slope = slopes_simple, intercept = intercepts_simple ), | |
data.frame( kind = "uncertainty", slope = slopes, intercept = intercepts ) | |
) | |
ggplot(df_coef, aes(slope, color=kind)) + geom_density() | |
ggplot(df_coef, aes(intercept, color=kind)) + geom_density() | |
``` | |
Here are some sample fits: | |
```{r} | |
samples <- sample.int(length(slopes), 50) | |
df_lines <- data.frame(slope=slopes[samples], intercept=intercepts[samples]) | |
ggplot( data, aes(x, y) ) + geom_point() + geom_abline(data=df_lines, aes(slope=slope, intercept=intercept), color="grey") + coord_fixed() + geom_abline(color="red") + geom_smooth(method="lm", se=F) | |
``` | |
And a view of the "moved" points: | |
```{r} | |
xx <- do.call(cbind, extract(fit, "xx")) | |
yy <- do.call(cbind, extract(fit, "yy")) | |
df_points <- data.frame( x=c(x, xx[samples[1],]), y=c(y, yy[samples[1],]), kind = rep( c("measured", "inferred"), each=N), group = rep(1:N,2) ) | |
ggplot( df_points, aes(x, y, group=group)) + geom_point(aes(color=kind)) + geom_path() + geom_abline(color="green") + coord_fixed() | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment