Created
August 24, 2015 13:03
-
-
Save mkuhn/a286115d15ec550a35cc to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Fitting with uncertainty" | |
author: "Michael Kuhn" | |
date: "24 Aug 2015" | |
output: html_document | |
--- | |
In this toy example, we assume that we've independently measured values $x$ and $y$ and want to find a linear relationship between them, accounting for measurement uncertainty. Each $x$ and $y$ value is assigned a different uncertainty, and the challenge is to take this information into account. A standard linear model will treat all points equally. | |
When we treat each measurement as a multivariate normal distribution, we can find a point along the proposed fitted line that is maximizing the probability density function (i.e. that has a maximum likelihood). Thus, given a slope and intercept, we virtually move all measurements to their most likely point along the line, and use the likelihood at this point. | |
```{r} | |
library(rstan) | |
library(ggplot2) | |
set.seed(42) | |
N <- 10 | |
``` | |
We create `r N` points: | |
```{r} | |
x0 <- seq(-5, 5, 10.1/N) | |
y0 <- x0 | |
sigma_x <- abs( rnorm( length(x0), 1) ) | |
x <- x0 + unlist(lapply(sigma_x, function(sigma) rnorm(1, 0, sigma))) | |
sigma_y <- abs( rnorm( length(y0), 1) ) | |
y <- y0 + unlist(lapply(sigma_y, function(sigma) rnorm(1, 0, sigma))) | |
data <- data.frame( x, sigma_x, y, sigma_y ) | |
data_list <- list( x=x, sigma_x=sigma_x, y=y, sigma_y = sigma_y, N=N) | |
ggplot( data, aes(x, y) ) + geom_point() + geom_abline(color="red") + geom_smooth(method="lm", se=F) + geom_errorbar(aes(ymin=y-sigma_y, ymax=y+sigma_y)) + geom_errorbarh(aes(xmin=x-sigma_x, xmax=x+sigma_x)) | |
``` | |
(Red: actual dependency, blue: simple linear fit) | |
The simple linear model in STAN: | |
```{r} | |
simple_code <- " | |
data { | |
int<lower=0> N; | |
vector[N] x; | |
vector[N] y; | |
} | |
parameters { | |
real slope; | |
real intercept; | |
real<lower=0> sigma; | |
} | |
model { | |
slope ~ normal(0, 10); | |
intercept ~ normal(0, 10); | |
sigma ~ cauchy(0, 25); | |
y ~ normal(x*slope+intercept, sigma); | |
} | |
" | |
``` | |
And another model, taking uncertainty into account. Here, based on the proposed slope and intercept, the probability density function is maximized for each point: | |
```{r} | |
uncertainty_code <- " | |
data { | |
int<lower=0> N; | |
vector[N] x; | |
vector[N] y; | |
vector<lower=0>[N] sigma_x; | |
vector<lower=0>[N] sigma_y; | |
} | |
parameters { | |
real slope; | |
real intercept; | |
} | |
transformed parameters { | |
vector[N] xx; | |
vector[N] yy; | |
vector[N] sx2; | |
vector[N] sy2; | |
sx2 <- sigma_x .* sigma_x; | |
sy2 <- sigma_y .* sigma_y; | |
// find maximum PDF given slope and intercept | |
xx <- ( (sy2 .* x) + slope * sx2 .* ( y - intercept ) ) ./ ( slope * sx2 + sy2); | |
yy <- slope * xx + intercept; | |
} | |
model { | |
slope ~ normal(0, 10); | |
intercept ~ normal(0, 10); | |
xx ~ normal(x, sigma_x); | |
yy ~ normal(y, sigma_y); | |
} | |
" | |
``` | |
Let's generate fits: | |
```{r results="hide"} | |
fit <- stan(model_code = uncertainty_code, data = data_list) | |
fit_simple <- stan(model_code = simple_code, data = data_list) | |
``` | |
Comparing the parameter distributions, taking uncertainty into account produces a much better fit: | |
```{r} | |
slopes <- do.call(cbind, extract(fit, "slope")) | |
intercepts <- do.call(cbind, extract(fit, "intercept")) | |
slopes_simple <- do.call(cbind, extract(fit_simple, "slope")) | |
intercepts_simple <- do.call(cbind, extract(fit_simple, "intercept")) | |
df_coef <- rbind( | |
data.frame( kind = "simple", slope = slopes_simple, intercept = intercepts_simple ), | |
data.frame( kind = "uncertainty", slope = slopes, intercept = intercepts ) | |
) | |
ggplot(df_coef, aes(slope, color=kind)) + geom_density() | |
ggplot(df_coef, aes(intercept, color=kind)) + geom_density() | |
``` | |
Here are some sample fits: | |
```{r} | |
samples <- sample.int(length(slopes), 50) | |
df_lines <- data.frame(slope=slopes[samples], intercept=intercepts[samples]) | |
ggplot( data, aes(x, y) ) + geom_point() + geom_abline(data=df_lines, aes(slope=slope, intercept=intercept), color="grey") + coord_fixed() + geom_abline(color="red") + geom_smooth(method="lm", se=F) | |
``` | |
And a view of the "moved" points: | |
```{r} | |
xx <- do.call(cbind, extract(fit, "xx")) | |
yy <- do.call(cbind, extract(fit, "yy")) | |
df_points <- data.frame( x=c(x, xx[samples[1],]), y=c(y, yy[samples[1],]), kind = rep( c("measured", "inferred"), each=N), group = rep(1:N,2) ) | |
ggplot( df_points, aes(x, y, group=group)) + geom_point(aes(color=kind)) + geom_path() + geom_abline(color="green") + coord_fixed() | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment