saptarshiguha/change_report.Rmd

## change_report.Rmd
---
title: Detecting Changes in Histograms
description: |
    Detecting Useful Changes in Histograms with small to many bins
author:
  - name: Saptarshi Guha
    affiliation: Product Metrics
date: "`r Sys.Date()`"
output:
  radix::radix_article:
    toc: true
---

```{r main, child =  '/home/sguha/mz/template_radix.Rmd'}

```


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```


## Background

At Mozilla, a lot of our measurements are stored as histograms. For example,
consider
[`TIME_TO_DOM_COMPLETE`](https://telemetry.mozilla.org/probe-dictionary/?search=time_to_dom_complete_ms&searchtype=in_name&channel=nightly)
which measures the time from NavigationStart
to domComplete. The actual time is however bucketed into one of 100 bins (per
profile). These histograms can then be aggregated to the profile level
(i.e. compute an normalized  histogram for the profile)


> The word *normalized* here means the final histogram is a density. By converting
> into a density, profiles with more observations are treated the same as
> profiles with few observations


The densities are then normalized across profiles. We use equal weighting across
profiles (i.e. $\frac{1}{n}$ weight) treating all profiles equally. The
resulting density is a mixture across profiles and as can be seen
[here](https://mzl.la/2zipuUm), it appears that it is mixture of two broad types
of densities.


> Is this mixture arising from  different types of profiles? or do profiles
> experience a mixture of `TIME_TO_DOM_COMPLETE`. I believe Tim Smith has a plot
> of the density of the average `X` (for some `X`, ask Tim which one) experienced
> by a profile (i.e. he reduced every profile to one value, the mean, of
> `TIME_TO_DOM_COMPLETE`) and this is very mixturish. Hence i assume, rather than
> a profile experience a mixture, we have different types of profiles. Possibly
> some hidden variable is at work here.


## The Goal

There are a few scenarious, but all involve comparing two groups

(1) in an experiment, with two or more branches, how can i compare histogram `H`
for branch `T` vs branch `C`?
(2) for different versions of Firefox,say the first $n$ days of Fx58 vs first
$n$ days of Fx59, how can we compare `T` and `C` for histogram `H`
(3) for builds of Nightly, a time series of sorts, how can we compare builds?
For example, today's build vs yesterdays build? or todays vs last 3 builds?

## Some problems

- we have decided to not aggregate a profiles histogram into one value. From an
  engineers point of view, they are likely interested in seeing the distribution
  of values experienced by a user, not the profiles'  average value
- though the example histogram above has 100 bins, well populated, many among the
[Top
Tier](https://docs.google.com/spreadsheets/d/1557qp6KBPBdf9CsH-GICs6_BqH0zu-njmRYoflWkQHE/edit#gid=0)
have 20 buckets and it doesn't make sense to treat these as continuous
variables
- some histograms don't have many bins populated or these very sparsely populated

Can we create a methodology that can compare multinomial distributions (i.e. can
we compare pmfs) and identify when the distributions are not equivalent?

 **What is Equivalence?** Now is a good time to introduce equivalence. We wont
 articulate the definition
 here, but will describe the motivation. We want

 - a few bins have very different proportions
 - many bins have different proportions (need not be very big)
 - we might (or might not) treat all bins equally (i.e. the tail bins or bins
  with low $p$ might have lesser weight i.e. differences in the tails where the
  mass is small to begin with get less weight)

## Some Approaches

Thought of two approaches before this (i..e tweaking existing methodologies)

### Chi Squared
- A standard option is the Chi Squared goodness of fit test. The problem with
  this is that rejects for even the smallest difference for datasets our size
- this approach could be modified by using a very very small p-value, but this
  is hacky and not directly tied to what sort of practically significant
  difference we are looking for
- it's quite difficult coming up with null distribution of pmfs (probablity mass
  functions) that look 'equivalent' to the test distribution

The ChiSquared formula $\sum \frac{(E_i-O_i)^2}{E_i}$ where $O$ is Test counts and $E$
is Control counts, can be restated as (assuming equal sample size,$N$) as $N\sum
(p_{cj}-p_{tj})\frac{ p_{cj}-p_{tj}}{p_{cj}}$ i.e. the weighted sum of the
relative differences in bin proportions weighted by the actual bin proportion
difference.

<aside>
larger discrepancies are often found in tails (larger sampling error) but they
  are down weighted because $p_{cj}-p_{tj}$ will be smaller there
</aside>

### Multinomial Logit

- use a multionomial logit model  of $P(\text{X in bucket j}|\text{X in bucket
  0})$ and test for whether branch indicator is significant or not
- then test if individual coefficients for buckets are above some threshold

It's kind of hacky again and there is a need to test many coefficients and then
surface them up according to the equivalence definition above


## Bootstrapping a Custom Test Statistic

The approach we take is to bootstrap a custom a test statistic, however there
are many varieties. We will compare two histograms with $J$ bins, and $p_{jt}$
being the proportion in the $j$'th bin for Test and $p_{jc}$ the bin proportion
for Control. They both sum to 1 and there are $N_t$ and $N_c$ observations in
each.

We consider a bin to be different if there exists a bin  $|\frac{p_{jt}-p_{jc}}{p_{jc}}|>0.05$
i.e a bin is different if there is a 5% change in the bin proportion. The 5% is
taken from past Shield study experience as a typical useful difference the
investigator sought.

### 0

The Chi Squared approach is fairly reasonable but we can modify to it to include
a signal as described above. Define $T_0$ as

$$
T_0 =  \sum I_{ | \frac{ p_{cj}-p_{tj}}{p_{cj}}|>0.05} (p_{cj}-p_{tj})\frac{ p_{cj}-p_{tj}}{p_{cj}}
$$

$T_0$ is the standard Chi Squared formula modified to only include those bins
where the cell differences are greater than 5%.

### 1


Define $CRR_j =  |\frac{p_{jt}-p_{jc}}{p_{jc}}|$ (i.e. one minus the relative
risk)

Then the statistic $T_{1u}$ is

$$
T_{1u} = \frac{1}{J} \sum_{j=1}^{j=J} I_{ CRR_j > 0.05 } CRR_j
$$

and  $T_{1w}$ as

$$
T_{1w} = \sum_{j=1}^{j=J} \frac{p_{jc} + p{jt}}{2}I_{ CRR_j > 0.05 } CRR_j
$$


$T_{1u}$ computes the average releative difference for those bins where this
relative difference is more than 5% (in other words, it computes the average
complementary [relative risk](https://en.wikipedia.org/wiki/Relative_risk) for
the bins that exceed a threshold). Each bin is treated equally, i.e the right
most bin gets as much weight as the center of the distribution. One side effect
of this is often in the tails, the relative change might be large (because of
small cell sizes) and we might be less interested in these bins with small
counts and small proportion contribution to the distribution.

In other words, for a change to be useful in the tail of the distribution the
relative difference must be very large, and hence $T_{1w}$ which weights the
useful relative differences by the average bin proportion adjusts for this.

#### Adjustments

The relative change can often be 'noisy', leading to a wide bootstrap
distribution. One adjustment (no theoretical backing :) is to take the log of
relative change i.e.


$$
T_{1lu} = \frac{1}{J} \sum_{j=1}^{j=J} I_{ CRR_j > 0.05 } log(CRR_j)
$$

and  $T_{1lw}$ as

$$
T_{1lw} = \sum_{j=1}^{j=J} \frac{p_{jc} + p_{jt}}{2} I_{ CRR_j > 0.05 } log(CRR_j)
$$


> **Adjusting for Empty Cells**
> because some cells might be empty we need to adjust for the proportions being
> estimated as zero. An easy fix is to use a Bayesian approach
> 'Multinomial-Dirichlet Prior(non informative version) ' to estimate bin
> probabilities and avoid zero estimates.

### 2

The relative change can be 'wild' i.e. in some sense a wide variance. One
approach to stabilize this is using the log odds. This however means we need
change our 'significance' from the 5% threshold to something on the odd scale.

The following graph plots $p$ on the x-axis and the corresponding odds ratio for
5% change in $p$. The second graph plots in the other direction i.e. for 5%
change in odds ratio (note when odds ratio is 1, then Test and Control are
same), what are the corresponding $p$.

> !!!! INSERT GRAPH

Given $logodds(j) = log(p_{jc}/(1-p_{jc}))-log( p_{jt}/(1-p_{jt}) )$, we have

$$
T_{2u} = \frac{1}{J} \sum_{j=1}^{j=J} I_{ |logodds(j)| > log(0.05) } logodds(j)
$$

and  $T_{2w}$ as

$$
T_{2w} =  \sum_{j=1}^{j=J} \frac{1}{2}(p_{jt}+p_{jc}) I_{ |logodds(j)| > log(0.05) } logodds(j)
$$

## Bootstrapping the Test

The standard approach is given two samples, `T` and `C`, of sample sizes $N_t$ and
$N_c$, we

1. Estimate cell proportions (either simple frequentist and run the risk of zero
   estimates or use a simple non informative bayesian approach)
2. compute observed statistic $L$ (one of the above)

Now we compute the null distribution

1. pool the two datasets together
3. resample $N_t+N_c$ observations
4. randomly assign to `T` and `C`,
5. compute the statistic
6. repeat 1- 5 many many times to get the bootstrap null distribution

Compute the probability that we would see our observed statistic and if this is
very small we reject.This implies it is vary unlikely to have seen such a large
value of our observed statistic assuming that `T` and `C` were the same.

We can also identity the 'hotspots' in the Test bins which are different and
because of the thresholding  we need not really worry about multiple comparison
issues (because we enforce the large threshold of at least 5%, they are most
likely true and unlikely to be false).


## Examples

```{pydbx x0,cache=TRUE}
histo55 = spark.read.parquet("s3://mozilla-metrics/user/cdowhygelund/CRT/data/TIME_TO_DOM_LOADING_MS_v55")
histo58 = spark.read.parquet("s3://mozilla-metrics/user/cdowhygelund/CRT/data/TIME_TO_DOM_LOADING_MS_v58")
histo59 = spark.read.parquet("s3://mozilla-metrics/user/cdowhygelund/CRT/data/TIME_TO_DOM_LOADING_MS_v59")


from pyspark.sql import Row
def explodeUs(ver):
    def t(x):
        c = x.client_id
        ud = x.unit_density
        if ud is not None:
            for k in ud:
                f  =  { 'cid': c, 'v':ver, 'b': k, 'n': ud[k]}
                yield Row(**f)
    return t
hall  = sc.union([ histo55.rdd.flatMap(explodeUs('v55')),
           histo58.rdd.flatMap(explodeUs('v58')),
           histo59.rdd.flatMap(explodeUs('v59'))
]).toDF()
hall.createOrReplaceTempView("hall")
hall2 = spark.sql(""" select crc32(cid) as cid,v,b,n from hall """)
hall2.repartition(100).write.csv("s3://mozilla-metrics/user/sguha/tmp/CRT/units_TIME_TO_DOM_LOADING_MS.csv",mode='overwrite')
```

```{r x1, dependson='x0',cache=TRUE,include=FALSE}
library(bit64)
system('rm -rf /tmp/xx; aws s3 sync s3://mozilla-metrics/user/sguha/tmp/CRT/units_TIME_TO_DOM_LOADING_MS.csv/ /tmp/xx')
h <- fread('cat /tmp/xx/p*')
setnames(h, c('cid','v','b','n'))
```

```{r x2,dependson='x1',cache=TRUE,include=FALSE}
hs <- h[,list(p = mean(n)), by=list(v,b)]
```

We will try and see how our tests compare

- versions 55 with 58 (ought to reject!)
- versions 58 and 59 (ought to **not**  reject)

(note the code to aggregate user histograms into unit densities can be found [here](https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/18639/command/18683).

```{r x3,dependson='x2',layout="l-body-outset",cache=TRUE,fig.width=10,fig.height=3}
j <- hs[v=='v55',][order(b),]
xyplot( p ~  log(b+1), type='h',lwd=2, xlab='log (1+time)',ylab='prob',main='TIME_TO_DOM_LOADING_MS for 55',
       data=j)
j <- hs[v=='v58',][order(b),]
xyplot( p ~  log(b+1), type='h',lwd=2, xlab='log (1+time)',ylab='prob',main='TIME_TO_DOM_LOADING_MS for 58',
       data=j)
j <- hs[v=='v59',][order(b),]
xyplot( p ~  log(b+1), type='h',lwd=2, xlab='log (1+time)',ylab='prob',main='TIME_TO_DOM_LOADING_MS for 59',
       data=j)

```

## Some Simulations

Taking version 55 a starting point, we will

1. change the 0 bucket by 5% ( from `r   hs[v=='v55',][b==0,round(p*100,3)]`%   to `r hs[v=='v55',][b==0,round(p*1.05*100,3) ]`% )
2. change the bottom 10% of buckets by 5% (corresponds to times of `r paste(   hs[v=='v55',][p< quantile(p,0.10) ,][,b],collapse=', ')`)

and check to see the tests correctly reject.

What is the best way to aggregate the data? We have converted every profiles
many histograms into one normalized histogram(unit densities), these densities
are then averaged to get the plots you see above. The obvious bootstrap resampling
methods that come to mind are

1. resample profiles (and their corresponding unit densities) and then compute
   the overall density

### Test 0: Modified Chi Squared


```{r x1,dependson='x2',cache=TRUE,include=FALSE}
	---
	title: Detecting Changes in Histograms
	description: \|
	Detecting Useful Changes in Histograms with small to many bins
	author:
	- name: Saptarshi Guha
	affiliation: Product Metrics
	date: "`r Sys.Date()`"
	output:
	radix::radix_article:
	toc: true
	---

	```{r main, child = '/home/sguha/mz/template_radix.Rmd'}

	```




	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = FALSE)
	```


	## Background

	At Mozilla, a lot of our measurements are stored as histograms. For example,
	consider
	[`TIME_TO_DOM_COMPLETE`](https://telemetry.mozilla.org/probe-dictionary/?search=time_to_dom_complete_ms&searchtype=in_name&channel=nightly)
	which measures the time from NavigationStart
	to domComplete. The actual time is however bucketed into one of 100 bins (per
	profile). These histograms can then be aggregated to the profile level
	(i.e. compute an normalized histogram for the profile)


	> The word normalized here means the final histogram is a density. By converting
	> into a density, profiles with more observations are treated the same as
	> profiles with few observations


	The densities are then normalized across profiles. We use equal weighting across
	profiles (i.e. $\frac{1}{n}$ weight) treating all profiles equally. The
	resulting density is a mixture across profiles and as can be seen
	[here](https://mzl.la/2zipuUm), it appears that it is mixture of two broad types
	of densities.


	> Is this mixture arising from different types of profiles? or do profiles
	> experience a mixture of `TIME_TO_DOM_COMPLETE`. I believe Tim Smith has a plot
	> of the density of the average `X` (for some `X`, ask Tim which one) experienced
	> by a profile (i.e. he reduced every profile to one value, the mean, of
	> `TIME_TO_DOM_COMPLETE`) and this is very mixturish. Hence i assume, rather than
	> a profile experience a mixture, we have different types of profiles. Possibly
	> some hidden variable is at work here.


	## The Goal

	There are a few scenarious, but all involve comparing two groups

	(1) in an experiment, with two or more branches, how can i compare histogram `H`
	for branch `T` vs branch `C`?
	(2) for different versions of Firefox,say the first $n$ days of Fx58 vs first
	$n$ days of Fx59, how can we compare `T` and `C` for histogram `H`
	(3) for builds of Nightly, a time series of sorts, how can we compare builds?
	For example, today's build vs yesterdays build? or todays vs last 3 builds?

	## Some problems

	- we have decided to not aggregate a profiles histogram into one value. From an
	engineers point of view, they are likely interested in seeing the distribution
	of values experienced by a user, not the profiles' average value
	- though the example histogram above has 100 bins, well populated, many among the
	[Top
	Tier](https://docs.google.com/spreadsheets/d/1557qp6KBPBdf9CsH-GICs6_BqH0zu-njmRYoflWkQHE/edit#gid=0)
	have 20 buckets and it doesn't make sense to treat these as continuous
	variables
	- some histograms don't have many bins populated or these very sparsely populated

	Can we create a methodology that can compare multinomial distributions (i.e. can
	we compare pmfs) and identify when the distributions are not equivalent?

	What is Equivalence? Now is a good time to introduce equivalence. We wont
	articulate the definition
	here, but will describe the motivation. We want

	- a few bins have very different proportions
	- many bins have different proportions (need not be very big)
	- we might (or might not) treat all bins equally (i.e. the tail bins or bins
	with low $p$ might have lesser weight i.e. differences in the tails where the
	mass is small to begin with get less weight)

	## Some Approaches

	Thought of two approaches before this (i..e tweaking existing methodologies)

	### Chi Squared
	- A standard option is the Chi Squared goodness of fit test. The problem with
	this is that rejects for even the smallest difference for datasets our size
	- this approach could be modified by using a very very small p-value, but this
	is hacky and not directly tied to what sort of practically significant
	difference we are looking for
	- it's quite difficult coming up with null distribution of pmfs (probablity mass
	functions) that look 'equivalent' to the test distribution

	The ChiSquared formula $\sum \frac{(E_i-O_i)^2}{E_i}$ where $O$ is Test counts and $E$
	is Control counts, can be restated as (assuming equal sample size,$N$) as $N\sum
	(p_{cj}-p_{tj})\frac{ p_{cj}-p_{tj}}{p_{cj}}$ i.e. the weighted sum of the
	relative differences in bin proportions weighted by the actual bin proportion
	difference.

	<aside>
	larger discrepancies are often found in tails (larger sampling error) but they
	are down weighted because $p_{cj}-p_{tj}$ will be smaller there
	</aside>

	### Multinomial Logit

	- use a multionomial logit model of $P(\text{X in bucket j}\|\text{X in bucket
	0})$ and test for whether branch indicator is significant or not
	- then test if individual coefficients for buckets are above some threshold

	It's kind of hacky again and there is a need to test many coefficients and then
	surface them up according to the equivalence definition above


	## Bootstrapping a Custom Test Statistic

	The approach we take is to bootstrap a custom a test statistic, however there
	are many varieties. We will compare two histograms with $J$ bins, and $p_{jt}$
	being the proportion in the $j$'th bin for Test and $p_{jc}$ the bin proportion
	for Control. They both sum to 1 and there are $N_t$ and $N_c$ observations in
	each.

	We consider a bin to be different if there exists a bin $\|\frac{p_{jt}-p_{jc}}{p_{jc}}\|>0.05$
	i.e a bin is different if there is a 5% change in the bin proportion. The 5% is
	taken from past Shield study experience as a typical useful difference the
	investigator sought.

	### 0

	The Chi Squared approach is fairly reasonable but we can modify to it to include
	a signal as described above. Define $T_0$ as

	$$
	T_0 = \sum I_{ \| \frac{ p_{cj}-p_{tj}}{p_{cj}}\|>0.05} (p_{cj}-p_{tj})\frac{ p_{cj}-p_{tj}}{p_{cj}}
	$$

	$T_0$ is the standard Chi Squared formula modified to only include those bins
	where the cell differences are greater than 5%.

	### 1


	Define $CRR_j = \|\frac{p_{jt}-p_{jc}}{p_{jc}}\|$ (i.e. one minus the relative
	risk)

	Then the statistic $T_{1u}$ is

	$$
	T_{1u} = \frac{1}{J} \sum_{j=1}^{j=J} I_{ CRR_j > 0.05 } CRR_j
	$$

	and $T_{1w}$ as

	$$
	T_{1w} = \sum_{j=1}^{j=J} \frac{p_{jc} + p{jt}}{2}I_{ CRR_j > 0.05 } CRR_j
	$$


	$T_{1u}$ computes the average releative difference for those bins where this
	relative difference is more than 5% (in other words, it computes the average
	complementary [relative risk](https://en.wikipedia.org/wiki/Relative_risk) for
	the bins that exceed a threshold). Each bin is treated equally, i.e the right
	most bin gets as much weight as the center of the distribution. One side effect
	of this is often in the tails, the relative change might be large (because of
	small cell sizes) and we might be less interested in these bins with small
	counts and small proportion contribution to the distribution.

	In other words, for a change to be useful in the tail of the distribution the
	relative difference must be very large, and hence $T_{1w}$ which weights the
	useful relative differences by the average bin proportion adjusts for this.

	#### Adjustments

	The relative change can often be 'noisy', leading to a wide bootstrap
	distribution. One adjustment (no theoretical backing :) is to take the log of
	relative change i.e.


	$$
	T_{1lu} = \frac{1}{J} \sum_{j=1}^{j=J} I_{ CRR_j > 0.05 } log(CRR_j)
	$$

	and $T_{1lw}$ as

	$$
	T_{1lw} = \sum_{j=1}^{j=J} \frac{p_{jc} + p_{jt}}{2} I_{ CRR_j > 0.05 } log(CRR_j)
	$$



	> Adjusting for Empty Cells
	> because some cells might be empty we need to adjust for the proportions being
	> estimated as zero. An easy fix is to use a Bayesian approach
	> 'Multinomial-Dirichlet Prior(non informative version) ' to estimate bin
	> probabilities and avoid zero estimates.

	### 2

	The relative change can be 'wild' i.e. in some sense a wide variance. One
	approach to stabilize this is using the log odds. This however means we need
	change our 'significance' from the 5% threshold to something on the odd scale.

	The following graph plots $p$ on the x-axis and the corresponding odds ratio for
	5% change in $p$. The second graph plots in the other direction i.e. for 5%
	change in odds ratio (note when odds ratio is 1, then Test and Control are
	same), what are the corresponding $p$.

	> !!!! INSERT GRAPH

	Given $logodds(j) = log(p_{jc}/(1-p_{jc}))-log( p_{jt}/(1-p_{jt}) )$, we have

	$$
	T_{2u} = \frac{1}{J} \sum_{j=1}^{j=J} I_{ \|logodds(j)\| > log(0.05) } logodds(j)
	$$

	and $T_{2w}$ as

	$$
	T_{2w} = \sum_{j=1}^{j=J} \frac{1}{2}(p_{jt}+p_{jc}) I_{ \|logodds(j)\| > log(0.05) } logodds(j)
	$$

	## Bootstrapping the Test

	The standard approach is given two samples, `T` and `C`, of sample sizes $N_t$ and
	$N_c$, we

	1. Estimate cell proportions (either simple frequentist and run the risk of zero
	estimates or use a simple non informative bayesian approach)
	2. compute observed statistic $L$ (one of the above)

	Now we compute the null distribution

	1. pool the two datasets together
	3. resample $N_t+N_c$ observations
	4. randomly assign to `T` and `C`,
	5. compute the statistic
	6. repeat 1- 5 many many times to get the bootstrap null distribution

	Compute the probability that we would see our observed statistic and if this is
	very small we reject.This implies it is vary unlikely to have seen such a large
	value of our observed statistic assuming that `T` and `C` were the same.

	We can also identity the 'hotspots' in the Test bins which are different and
	because of the thresholding we need not really worry about multiple comparison
	issues (because we enforce the large threshold of at least 5%, they are most
	likely true and unlikely to be false).


	## Examples

	```{pydbx x0,cache=TRUE}
	histo55 = spark.read.parquet("s3://mozilla-metrics/user/cdowhygelund/CRT/data/TIME_TO_DOM_LOADING_MS_v55")
	histo58 = spark.read.parquet("s3://mozilla-metrics/user/cdowhygelund/CRT/data/TIME_TO_DOM_LOADING_MS_v58")
	histo59 = spark.read.parquet("s3://mozilla-metrics/user/cdowhygelund/CRT/data/TIME_TO_DOM_LOADING_MS_v59")


	from pyspark.sql import Row
	def explodeUs(ver):
	def t(x):
	c = x.client_id
	ud = x.unit_density
	if ud is not None:
	for k in ud:
	f = { 'cid': c, 'v':ver, 'b': k, 'n': ud[k]}
	yield Row(**f)
	return t
	hall = sc.union([ histo55.rdd.flatMap(explodeUs('v55')),
	histo58.rdd.flatMap(explodeUs('v58')),
	histo59.rdd.flatMap(explodeUs('v59'))
	]).toDF()
	hall.createOrReplaceTempView("hall")
	hall2 = spark.sql(""" select crc32(cid) as cid,v,b,n from hall """)
	hall2.repartition(100).write.csv("s3://mozilla-metrics/user/sguha/tmp/CRT/units_TIME_TO_DOM_LOADING_MS.csv",mode='overwrite')
	```

	```{r x1, dependson='x0',cache=TRUE,include=FALSE}
	library(bit64)
	system('rm -rf /tmp/xx; aws s3 sync s3://mozilla-metrics/user/sguha/tmp/CRT/units_TIME_TO_DOM_LOADING_MS.csv/ /tmp/xx')
	h <- fread('cat /tmp/xx/p*')
	setnames(h, c('cid','v','b','n'))
	```

	```{r x2,dependson='x1',cache=TRUE,include=FALSE}
	hs <- h[,list(p = mean(n)), by=list(v,b)]
	```

	We will try and see how our tests compare

	- versions 55 with 58 (ought to reject!)
	- versions 58 and 59 (ought to not reject)

	(note the code to aggregate user histograms into unit densities can be found [here](https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/18639/command/18683).

	```{r x3,dependson='x2',layout="l-body-outset",cache=TRUE,fig.width=10,fig.height=3}
	j <- hs[v=='v55',][order(b),]
	xyplot( p ~ log(b+1), type='h',lwd=2, xlab='log (1+time)',ylab='prob',main='TIME_TO_DOM_LOADING_MS for 55',
	data=j)
	j <- hs[v=='v58',][order(b),]
	xyplot( p ~ log(b+1), type='h',lwd=2, xlab='log (1+time)',ylab='prob',main='TIME_TO_DOM_LOADING_MS for 58',
	data=j)
	j <- hs[v=='v59',][order(b),]
	xyplot( p ~ log(b+1), type='h',lwd=2, xlab='log (1+time)',ylab='prob',main='TIME_TO_DOM_LOADING_MS for 59',
	data=j)

	```

	## Some Simulations

	Taking version 55 a starting point, we will

	1. change the 0 bucket by 5% ( from `r hs[v=='v55',][b==0,round(p100,3)]`% to `r hs[v=='v55',][b==0,round(p1.05*100,3) ]`% )
	2. change the bottom 10% of buckets by 5% (corresponds to times of `r paste( hs[v=='v55',][p< quantile(p,0.10) ,][,b],collapse=', ')`)

	and check to see the tests correctly reject.

	What is the best way to aggregate the data? We have converted every profiles
	many histograms into one normalized histogram(unit densities), these densities
	are then averaged to get the plots you see above. The obvious bootstrap resampling
	methods that come to mind are

	1. resample profiles (and their corresponding unit densities) and then compute
	the overall density

	### Test 0: Modified Chi Squared


	```{r x1,dependson='x2',cache=TRUE,include=FALSE}