Created
July 9, 2018 05:30
-
-
Save saptarshiguha/0d4480afa207b19fc1c51f05a7e189a8 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: Detecting Changes in Histograms | |
description: | | |
Detecting Useful Changes in Histograms with small to many bins | |
author: | |
- name: Saptarshi Guha | |
affiliation: Product Metrics | |
date: "`r Sys.Date()`" | |
output: | |
radix::radix_article: | |
toc: true | |
--- | |
```{r main, child = '/home/sguha/mz/template_radix.Rmd'} | |
``` | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = FALSE) | |
``` | |
## Background | |
At Mozilla, a lot of our measurements are stored as histograms. For example, | |
consider | |
[`TIME_TO_DOM_COMPLETE`](https://telemetry.mozilla.org/probe-dictionary/?search=time_to_dom_complete_ms&searchtype=in_name&channel=nightly) | |
which measures the time from NavigationStart | |
to domComplete. The actual time is however bucketed into one of 100 bins (per | |
profile). These histograms can then be aggregated to the profile level | |
(i.e. compute an normalized histogram for the profile) | |
> The word *normalized* here means the final histogram is a density. By converting | |
> into a density, profiles with more observations are treated the same as | |
> profiles with few observations | |
The densities are then normalized across profiles. We use equal weighting across | |
profiles (i.e. $\frac{1}{n}$ weight) treating all profiles equally. The | |
resulting density is a mixture across profiles and as can be seen | |
[here](https://mzl.la/2zipuUm), it appears that it is mixture of two broad types | |
of densities. | |
> Is this mixture arising from different types of profiles? or do profiles | |
> experience a mixture of `TIME_TO_DOM_COMPLETE`. I believe Tim Smith has a plot | |
> of the density of the average `X` (for some `X`, ask Tim which one) experienced | |
> by a profile (i.e. he reduced every profile to one value, the mean, of | |
> `TIME_TO_DOM_COMPLETE`) and this is very mixturish. Hence i assume, rather than | |
> a profile experience a mixture, we have different types of profiles. Possibly | |
> some hidden variable is at work here. | |
## The Goal | |
There are a few scenarious, but all involve comparing two groups | |
(1) in an experiment, with two or more branches, how can i compare histogram `H` | |
for branch `T` vs branch `C`? | |
(2) for different versions of Firefox,say the first $n$ days of Fx58 vs first | |
$n$ days of Fx59, how can we compare `T` and `C` for histogram `H` | |
(3) for builds of Nightly, a time series of sorts, how can we compare builds? | |
For example, today's build vs yesterdays build? or todays vs last 3 builds? | |
## Some problems | |
- we have decided to not aggregate a profiles histogram into one value. From an | |
engineers point of view, they are likely interested in seeing the distribution | |
of values experienced by a user, not the profiles' average value | |
- though the example histogram above has 100 bins, well populated, many among the | |
[Top | |
Tier](https://docs.google.com/spreadsheets/d/1557qp6KBPBdf9CsH-GICs6_BqH0zu-njmRYoflWkQHE/edit#gid=0) | |
have 20 buckets and it doesn't make sense to treat these as continuous | |
variables | |
- some histograms don't have many bins populated or these very sparsely populated | |
Can we create a methodology that can compare multinomial distributions (i.e. can | |
we compare pmfs) and identify when the distributions are not equivalent? | |
**What is Equivalence?** Now is a good time to introduce equivalence. We wont | |
articulate the definition | |
here, but will describe the motivation. We want | |
- a few bins have very different proportions | |
- many bins have different proportions (need not be very big) | |
- we might (or might not) treat all bins equally (i.e. the tail bins or bins | |
with low $p$ might have lesser weight i.e. differences in the tails where the | |
mass is small to begin with get less weight) | |
## Some Approaches | |
Thought of two approaches before this (i..e tweaking existing methodologies) | |
### Chi Squared | |
- A standard option is the Chi Squared goodness of fit test. The problem with | |
this is that rejects for even the smallest difference for datasets our size | |
- this approach could be modified by using a very very small p-value, but this | |
is hacky and not directly tied to what sort of practically significant | |
difference we are looking for | |
- it's quite difficult coming up with null distribution of pmfs (probablity mass | |
functions) that look 'equivalent' to the test distribution | |
The ChiSquared formula $\sum \frac{(E_i-O_i)^2}{E_i}$ where $O$ is Test counts and $E$ | |
is Control counts, can be restated as (assuming equal sample size,$N$) as $N\sum | |
(p_{cj}-p_{tj})\frac{ p_{cj}-p_{tj}}{p_{cj}}$ i.e. the weighted sum of the | |
relative differences in bin proportions weighted by the actual bin proportion | |
difference. | |
<aside> | |
larger discrepancies are often found in tails (larger sampling error) but they | |
are down weighted because $p_{cj}-p_{tj}$ will be smaller there | |
</aside> | |
### Multinomial Logit | |
- use a multionomial logit model of $P(\text{X in bucket j}|\text{X in bucket | |
0})$ and test for whether branch indicator is significant or not | |
- then test if individual coefficients for buckets are above some threshold | |
It's kind of hacky again and there is a need to test many coefficients and then | |
surface them up according to the equivalence definition above | |
## Bootstrapping a Custom Test Statistic | |
The approach we take is to bootstrap a custom a test statistic, however there | |
are many varieties. We will compare two histograms with $J$ bins, and $p_{jt}$ | |
being the proportion in the $j$'th bin for Test and $p_{jc}$ the bin proportion | |
for Control. They both sum to 1 and there are $N_t$ and $N_c$ observations in | |
each. | |
We consider a bin to be different if there exists a bin $|\frac{p_{jt}-p_{jc}}{p_{jc}}|>0.05$ | |
i.e a bin is different if there is a 5% change in the bin proportion. The 5% is | |
taken from past Shield study experience as a typical useful difference the | |
investigator sought. | |
### 0 | |
The Chi Squared approach is fairly reasonable but we can modify to it to include | |
a signal as described above. Define $T_0$ as | |
$$ | |
T_0 = \sum I_{ | \frac{ p_{cj}-p_{tj}}{p_{cj}}|>0.05} (p_{cj}-p_{tj})\frac{ p_{cj}-p_{tj}}{p_{cj}} | |
$$ | |
$T_0$ is the standard Chi Squared formula modified to only include those bins | |
where the cell differences are greater than 5%. | |
### 1 | |
Define $CRR_j = |\frac{p_{jt}-p_{jc}}{p_{jc}}|$ (i.e. one minus the relative | |
risk) | |
Then the statistic $T_{1u}$ is | |
$$ | |
T_{1u} = \frac{1}{J} \sum_{j=1}^{j=J} I_{ CRR_j > 0.05 } CRR_j | |
$$ | |
and $T_{1w}$ as | |
$$ | |
T_{1w} = \sum_{j=1}^{j=J} \frac{p_{jc} + p{jt}}{2}I_{ CRR_j > 0.05 } CRR_j | |
$$ | |
$T_{1u}$ computes the average releative difference for those bins where this | |
relative difference is more than 5% (in other words, it computes the average | |
complementary [relative risk](https://en.wikipedia.org/wiki/Relative_risk) for | |
the bins that exceed a threshold). Each bin is treated equally, i.e the right | |
most bin gets as much weight as the center of the distribution. One side effect | |
of this is often in the tails, the relative change might be large (because of | |
small cell sizes) and we might be less interested in these bins with small | |
counts and small proportion contribution to the distribution. | |
In other words, for a change to be useful in the tail of the distribution the | |
relative difference must be very large, and hence $T_{1w}$ which weights the | |
useful relative differences by the average bin proportion adjusts for this. | |
#### Adjustments | |
The relative change can often be 'noisy', leading to a wide bootstrap | |
distribution. One adjustment (no theoretical backing :) is to take the log of | |
relative change i.e. | |
$$ | |
T_{1lu} = \frac{1}{J} \sum_{j=1}^{j=J} I_{ CRR_j > 0.05 } log(CRR_j) | |
$$ | |
and $T_{1lw}$ as | |
$$ | |
T_{1lw} = \sum_{j=1}^{j=J} \frac{p_{jc} + p_{jt}}{2} I_{ CRR_j > 0.05 } log(CRR_j) | |
$$ | |
> **Adjusting for Empty Cells** | |
> because some cells might be empty we need to adjust for the proportions being | |
> estimated as zero. An easy fix is to use a Bayesian approach | |
> 'Multinomial-Dirichlet Prior(non informative version) ' to estimate bin | |
> probabilities and avoid zero estimates. | |
### 2 | |
The relative change can be 'wild' i.e. in some sense a wide variance. One | |
approach to stabilize this is using the log odds. This however means we need | |
change our 'significance' from the 5% threshold to something on the odd scale. | |
The following graph plots $p$ on the x-axis and the corresponding odds ratio for | |
5% change in $p$. The second graph plots in the other direction i.e. for 5% | |
change in odds ratio (note when odds ratio is 1, then Test and Control are | |
same), what are the corresponding $p$. | |
> !!!! INSERT GRAPH | |
Given $logodds(j) = log(p_{jc}/(1-p_{jc}))-log( p_{jt}/(1-p_{jt}) )$, we have | |
$$ | |
T_{2u} = \frac{1}{J} \sum_{j=1}^{j=J} I_{ |logodds(j)| > log(0.05) } logodds(j) | |
$$ | |
and $T_{2w}$ as | |
$$ | |
T_{2w} = \sum_{j=1}^{j=J} \frac{1}{2}(p_{jt}+p_{jc}) I_{ |logodds(j)| > log(0.05) } logodds(j) | |
$$ | |
## Bootstrapping the Test | |
The standard approach is given two samples, `T` and `C`, of sample sizes $N_t$ and | |
$N_c$, we | |
1. Estimate cell proportions (either simple frequentist and run the risk of zero | |
estimates or use a simple non informative bayesian approach) | |
2. compute observed statistic $L$ (one of the above) | |
Now we compute the null distribution | |
1. pool the two datasets together | |
3. resample $N_t+N_c$ observations | |
4. randomly assign to `T` and `C`, | |
5. compute the statistic | |
6. repeat 1- 5 many many times to get the bootstrap null distribution | |
Compute the probability that we would see our observed statistic and if this is | |
very small we reject.This implies it is vary unlikely to have seen such a large | |
value of our observed statistic assuming that `T` and `C` were the same. | |
We can also identity the 'hotspots' in the Test bins which are different and | |
because of the thresholding we need not really worry about multiple comparison | |
issues (because we enforce the large threshold of at least 5%, they are most | |
likely true and unlikely to be false). | |
## Examples | |
```{pydbx x0,cache=TRUE} | |
histo55 = spark.read.parquet("s3://mozilla-metrics/user/cdowhygelund/CRT/data/TIME_TO_DOM_LOADING_MS_v55") | |
histo58 = spark.read.parquet("s3://mozilla-metrics/user/cdowhygelund/CRT/data/TIME_TO_DOM_LOADING_MS_v58") | |
histo59 = spark.read.parquet("s3://mozilla-metrics/user/cdowhygelund/CRT/data/TIME_TO_DOM_LOADING_MS_v59") | |
from pyspark.sql import Row | |
def explodeUs(ver): | |
def t(x): | |
c = x.client_id | |
ud = x.unit_density | |
if ud is not None: | |
for k in ud: | |
f = { 'cid': c, 'v':ver, 'b': k, 'n': ud[k]} | |
yield Row(**f) | |
return t | |
hall = sc.union([ histo55.rdd.flatMap(explodeUs('v55')), | |
histo58.rdd.flatMap(explodeUs('v58')), | |
histo59.rdd.flatMap(explodeUs('v59')) | |
]).toDF() | |
hall.createOrReplaceTempView("hall") | |
hall2 = spark.sql(""" select crc32(cid) as cid,v,b,n from hall """) | |
hall2.repartition(100).write.csv("s3://mozilla-metrics/user/sguha/tmp/CRT/units_TIME_TO_DOM_LOADING_MS.csv",mode='overwrite') | |
``` | |
```{r x1, dependson='x0',cache=TRUE,include=FALSE} | |
library(bit64) | |
system('rm -rf /tmp/xx; aws s3 sync s3://mozilla-metrics/user/sguha/tmp/CRT/units_TIME_TO_DOM_LOADING_MS.csv/ /tmp/xx') | |
h <- fread('cat /tmp/xx/p*') | |
setnames(h, c('cid','v','b','n')) | |
``` | |
```{r x2,dependson='x1',cache=TRUE,include=FALSE} | |
hs <- h[,list(p = mean(n)), by=list(v,b)] | |
``` | |
We will try and see how our tests compare | |
- versions 55 with 58 (ought to reject!) | |
- versions 58 and 59 (ought to **not** reject) | |
(note the code to aggregate user histograms into unit densities can be found [here](https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/18639/command/18683). | |
```{r x3,dependson='x2',layout="l-body-outset",cache=TRUE,fig.width=10,fig.height=3} | |
j <- hs[v=='v55',][order(b),] | |
xyplot( p ~ log(b+1), type='h',lwd=2, xlab='log (1+time)',ylab='prob',main='TIME_TO_DOM_LOADING_MS for 55', | |
data=j) | |
j <- hs[v=='v58',][order(b),] | |
xyplot( p ~ log(b+1), type='h',lwd=2, xlab='log (1+time)',ylab='prob',main='TIME_TO_DOM_LOADING_MS for 58', | |
data=j) | |
j <- hs[v=='v59',][order(b),] | |
xyplot( p ~ log(b+1), type='h',lwd=2, xlab='log (1+time)',ylab='prob',main='TIME_TO_DOM_LOADING_MS for 59', | |
data=j) | |
``` | |
## Some Simulations | |
Taking version 55 a starting point, we will | |
1. change the 0 bucket by 5% ( from `r hs[v=='v55',][b==0,round(p*100,3)]`% to `r hs[v=='v55',][b==0,round(p*1.05*100,3) ]`% ) | |
2. change the bottom 10% of buckets by 5% (corresponds to times of `r paste( hs[v=='v55',][p< quantile(p,0.10) ,][,b],collapse=', ')`) | |
and check to see the tests correctly reject. | |
What is the best way to aggregate the data? We have converted every profiles | |
many histograms into one normalized histogram(unit densities), these densities | |
are then averaged to get the plots you see above. The obvious bootstrap resampling | |
methods that come to mind are | |
1. resample profiles (and their corresponding unit densities) and then compute | |
the overall density | |
### Test 0: Modified Chi Squared | |
```{r x1,dependson='x2',cache=TRUE,include=FALSE} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment