mkuhn/fitting_with_uncertainty.Rmd

## fitting_with_uncertainty.Rmd
---
title: "Fitting with uncertainty"
author: "Michael Kuhn"
date: "24 Aug 2015"
output: html_document
---

In this toy example, we assume that we've independently measured values $x$ and $y$ and want to find a linear relationship between them, accounting for measurement uncertainty. Each $x$ and $y$ value is assigned a different uncertainty, and the challenge is to take this information into account. A standard linear model will treat all points equally.

When we treat each measurement as a multivariate normal distribution, we can find a point along the proposed fitted line that is maximizing the probability density function (i.e. that has a maximum likelihood). Thus, given a slope and intercept, we virtually move all measurements to their most likely point along the line, and use the likelihood at this point.

```{r}

library(rstan)
library(ggplot2)

set.seed(42)

N <- 10


```

We create `r N` points:

```{r}


x0 <- seq(-5, 5, 10.1/N)
y0 <- x0

sigma_x <- abs( rnorm( length(x0), 1) )
x <- x0 + unlist(lapply(sigma_x, function(sigma) rnorm(1, 0, sigma)))

sigma_y <- abs( rnorm( length(y0), 1) )
y <- y0 + unlist(lapply(sigma_y, function(sigma) rnorm(1, 0, sigma)))

data <- data.frame( x, sigma_x, y, sigma_y )

data_list <- list( x=x, sigma_x=sigma_x, y=y, sigma_y = sigma_y, N=N)

ggplot( data, aes(x, y) ) + geom_point() + geom_abline(color="red") + geom_smooth(method="lm", se=F) + geom_errorbar(aes(ymin=y-sigma_y, ymax=y+sigma_y)) + geom_errorbarh(aes(xmin=x-sigma_x, xmax=x+sigma_x))

```

(Red: actual dependency, blue: simple linear fit)

The simple linear model in STAN:

```{r}

simple_code <- "
data {
  int<lower=0> N;

  vector[N] x;
  vector[N] y;
}

parameters {
  real slope;
  real intercept;
  real<lower=0> sigma;
}

model {
  slope ~ normal(0, 10);
  intercept ~ normal(0, 10);
  sigma ~ cauchy(0, 25);

  y ~ normal(x*slope+intercept, sigma);
}
"

```

And another model, taking uncertainty into account. Here, based on the proposed slope and intercept, the probability density function is maximized for each point:

```{r}

uncertainty_code <- "
data {
  int<lower=0> N;

  vector[N] x;
  vector[N] y;
  vector<lower=0>[N] sigma_x;
  vector<lower=0>[N] sigma_y;
}

parameters {
  real slope;
  real intercept;
}

transformed parameters {
  vector[N] xx;
  vector[N] yy;

  vector[N] sx2;
  vector[N] sy2;

  sx2 <- sigma_x .* sigma_x;
  sy2 <- sigma_y .* sigma_y;

  // find maximum PDF given slope and intercept
  xx <- ( (sy2 .* x) + slope * sx2 .* ( y - intercept ) ) ./ ( slope * sx2 + sy2);
  yy <- slope * xx + intercept;
}

model {
  slope ~ normal(0, 10);
  intercept ~ normal(0, 10);

  xx ~ normal(x, sigma_x);
  yy ~ normal(y, sigma_y);
}
"

```

Let's generate fits:

```{r results="hide"}
fit <- stan(model_code = uncertainty_code, data = data_list)
fit_simple <- stan(model_code = simple_code, data = data_list)
```

Comparing the parameter distributions, taking uncertainty into account produces a much better fit:

```{r}

slopes <- do.call(cbind, extract(fit, "slope"))
intercepts <- do.call(cbind, extract(fit, "intercept"))

slopes_simple <- do.call(cbind, extract(fit_simple, "slope"))
intercepts_simple <- do.call(cbind, extract(fit_simple, "intercept"))

df_coef <- rbind(
  data.frame( kind = "simple", slope = slopes_simple, intercept = intercepts_simple ),
  data.frame( kind = "uncertainty", slope = slopes, intercept = intercepts )
)

ggplot(df_coef, aes(slope, color=kind)) + geom_density()
ggplot(df_coef, aes(intercept, color=kind)) + geom_density()
```

Here are some sample fits:

```{r}
samples <- sample.int(length(slopes), 50)
df_lines <- data.frame(slope=slopes[samples], intercept=intercepts[samples])

ggplot( data, aes(x, y) ) + geom_point() + geom_abline(data=df_lines, aes(slope=slope, intercept=intercept), color="grey") + coord_fixed() + geom_abline(color="red") + geom_smooth(method="lm", se=F)
```

And a view of the "moved" points:

```{r}
xx <- do.call(cbind, extract(fit, "xx"))
yy <- do.call(cbind, extract(fit, "yy"))

df_points <- data.frame( x=c(x, xx[samples[1],]), y=c(y, yy[samples[1],]), kind = rep( c("measured", "inferred"), each=N), group = rep(1:N,2)  )

ggplot( df_points, aes(x, y, group=group)) + geom_point(aes(color=kind)) + geom_path() + geom_abline(color="green") + coord_fixed()
```
	---
	title: "Fitting with uncertainty"
	author: "Michael Kuhn"
	date: "24 Aug 2015"
	output: html_document
	---

	In this toy example, we assume that we've independently measured values $x$ and $y$ and want to find a linear relationship between them, accounting for measurement uncertainty. Each $x$ and $y$ value is assigned a different uncertainty, and the challenge is to take this information into account. A standard linear model will treat all points equally.

	When we treat each measurement as a multivariate normal distribution, we can find a point along the proposed fitted line that is maximizing the probability density function (i.e. that has a maximum likelihood). Thus, given a slope and intercept, we virtually move all measurements to their most likely point along the line, and use the likelihood at this point.

	```{r}

	library(rstan)
	library(ggplot2)

	set.seed(42)

	N <- 10


	```

	We create `r N` points:

	```{r}


	x0 <- seq(-5, 5, 10.1/N)
	y0 <- x0

	sigma_x <- abs( rnorm( length(x0), 1) )
	x <- x0 + unlist(lapply(sigma_x, function(sigma) rnorm(1, 0, sigma)))

	sigma_y <- abs( rnorm( length(y0), 1) )
	y <- y0 + unlist(lapply(sigma_y, function(sigma) rnorm(1, 0, sigma)))

	data <- data.frame( x, sigma_x, y, sigma_y )

	data_list <- list( x=x, sigma_x=sigma_x, y=y, sigma_y = sigma_y, N=N)

	ggplot( data, aes(x, y) ) + geom_point() + geom_abline(color="red") + geom_smooth(method="lm", se=F) + geom_errorbar(aes(ymin=y-sigma_y, ymax=y+sigma_y)) + geom_errorbarh(aes(xmin=x-sigma_x, xmax=x+sigma_x))

	```

	(Red: actual dependency, blue: simple linear fit)

	The simple linear model in STAN:

	```{r}

	simple_code <- "
	data {
	int<lower=0> N;

	vector[N] x;
	vector[N] y;
	}

	parameters {
	real slope;
	real intercept;
	real<lower=0> sigma;
	}

	model {
	slope ~ normal(0, 10);
	intercept ~ normal(0, 10);
	sigma ~ cauchy(0, 25);

	y ~ normal(x*slope+intercept, sigma);
	}
	"

	```

	And another model, taking uncertainty into account. Here, based on the proposed slope and intercept, the probability density function is maximized for each point:

	```{r}

	uncertainty_code <- "
	data {
	int<lower=0> N;

	vector[N] x;
	vector[N] y;
	vector<lower=0>[N] sigma_x;
	vector<lower=0>[N] sigma_y;
	}

	parameters {
	real slope;
	real intercept;
	}

	transformed parameters {
	vector[N] xx;
	vector[N] yy;

	vector[N] sx2;
	vector[N] sy2;

	sx2 <- sigma_x .* sigma_x;
	sy2 <- sigma_y .* sigma_y;

	// find maximum PDF given slope and intercept
	xx <- ( (sy2 .* x) + slope * sx2 .* ( y - intercept ) ) ./ ( slope * sx2 + sy2);
	yy <- slope * xx + intercept;
	}

	model {
	slope ~ normal(0, 10);
	intercept ~ normal(0, 10);

	xx ~ normal(x, sigma_x);
	yy ~ normal(y, sigma_y);
	}
	"

	```

	Let's generate fits:

	```{r results="hide"}
	fit <- stan(model_code = uncertainty_code, data = data_list)
	fit_simple <- stan(model_code = simple_code, data = data_list)
	```

	Comparing the parameter distributions, taking uncertainty into account produces a much better fit:

	```{r}

	slopes <- do.call(cbind, extract(fit, "slope"))
	intercepts <- do.call(cbind, extract(fit, "intercept"))

	slopes_simple <- do.call(cbind, extract(fit_simple, "slope"))
	intercepts_simple <- do.call(cbind, extract(fit_simple, "intercept"))

	df_coef <- rbind(
	data.frame( kind = "simple", slope = slopes_simple, intercept = intercepts_simple ),
	data.frame( kind = "uncertainty", slope = slopes, intercept = intercepts )
	)

	ggplot(df_coef, aes(slope, color=kind)) + geom_density()
	ggplot(df_coef, aes(intercept, color=kind)) + geom_density()
	```

	Here are some sample fits:

	```{r}
	samples <- sample.int(length(slopes), 50)
	df_lines <- data.frame(slope=slopes[samples], intercept=intercepts[samples])

	ggplot( data, aes(x, y) ) + geom_point() + geom_abline(data=df_lines, aes(slope=slope, intercept=intercept), color="grey") + coord_fixed() + geom_abline(color="red") + geom_smooth(method="lm", se=F)
	```

	And a view of the "moved" points:

	```{r}
	xx <- do.call(cbind, extract(fit, "xx"))
	yy <- do.call(cbind, extract(fit, "yy"))

	df_points <- data.frame( x=c(x, xx[samples[1],]), y=c(y, yy[samples[1],]), kind = rep( c("measured", "inferred"), each=N), group = rep(1:N,2) )

	ggplot( df_points, aes(x, y, group=group)) + geom_point(aes(color=kind)) + geom_path() + geom_abline(color="green") + coord_fixed()
	```