Skip to content

Instantly share code, notes, and snippets.

@saptarshiguha
Created July 9, 2018 05:30
Show Gist options
  • Save saptarshiguha/0d4480afa207b19fc1c51f05a7e189a8 to your computer and use it in GitHub Desktop.
Save saptarshiguha/0d4480afa207b19fc1c51f05a7e189a8 to your computer and use it in GitHub Desktop.
---
title: Detecting Changes in Histograms
description: |
Detecting Useful Changes in Histograms with small to many bins
author:
- name: Saptarshi Guha
affiliation: Product Metrics
date: "`r Sys.Date()`"
output:
radix::radix_article:
toc: true
---
```{r main, child = '/home/sguha/mz/template_radix.Rmd'}
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## Background
At Mozilla, a lot of our measurements are stored as histograms. For example,
consider
[`TIME_TO_DOM_COMPLETE`](https://telemetry.mozilla.org/probe-dictionary/?search=time_to_dom_complete_ms&searchtype=in_name&channel=nightly)
which measures the time from NavigationStart
to domComplete. The actual time is however bucketed into one of 100 bins (per
profile). These histograms can then be aggregated to the profile level
(i.e. compute an normalized histogram for the profile)
> The word *normalized* here means the final histogram is a density. By converting
> into a density, profiles with more observations are treated the same as
> profiles with few observations
The densities are then normalized across profiles. We use equal weighting across
profiles (i.e. $\frac{1}{n}$ weight) treating all profiles equally. The
resulting density is a mixture across profiles and as can be seen
[here](https://mzl.la/2zipuUm), it appears that it is mixture of two broad types
of densities.
> Is this mixture arising from different types of profiles? or do profiles
> experience a mixture of `TIME_TO_DOM_COMPLETE`. I believe Tim Smith has a plot
> of the density of the average `X` (for some `X`, ask Tim which one) experienced
> by a profile (i.e. he reduced every profile to one value, the mean, of
> `TIME_TO_DOM_COMPLETE`) and this is very mixturish. Hence i assume, rather than
> a profile experience a mixture, we have different types of profiles. Possibly
> some hidden variable is at work here.
## The Goal
There are a few scenarious, but all involve comparing two groups
(1) in an experiment, with two or more branches, how can i compare histogram `H`
for branch `T` vs branch `C`?
(2) for different versions of Firefox,say the first $n$ days of Fx58 vs first
$n$ days of Fx59, how can we compare `T` and `C` for histogram `H`
(3) for builds of Nightly, a time series of sorts, how can we compare builds?
For example, today's build vs yesterdays build? or todays vs last 3 builds?
## Some problems
- we have decided to not aggregate a profiles histogram into one value. From an
engineers point of view, they are likely interested in seeing the distribution
of values experienced by a user, not the profiles' average value
- though the example histogram above has 100 bins, well populated, many among the
[Top
Tier](https://docs.google.com/spreadsheets/d/1557qp6KBPBdf9CsH-GICs6_BqH0zu-njmRYoflWkQHE/edit#gid=0)
have 20 buckets and it doesn't make sense to treat these as continuous
variables
- some histograms don't have many bins populated or these very sparsely populated
Can we create a methodology that can compare multinomial distributions (i.e. can
we compare pmfs) and identify when the distributions are not equivalent?
**What is Equivalence?** Now is a good time to introduce equivalence. We wont
articulate the definition
here, but will describe the motivation. We want
- a few bins have very different proportions
- many bins have different proportions (need not be very big)
- we might (or might not) treat all bins equally (i.e. the tail bins or bins
with low $p$ might have lesser weight i.e. differences in the tails where the
mass is small to begin with get less weight)
## Some Approaches
Thought of two approaches before this (i..e tweaking existing methodologies)
### Chi Squared
- A standard option is the Chi Squared goodness of fit test. The problem with
this is that rejects for even the smallest difference for datasets our size
- this approach could be modified by using a very very small p-value, but this
is hacky and not directly tied to what sort of practically significant
difference we are looking for
- it's quite difficult coming up with null distribution of pmfs (probablity mass
functions) that look 'equivalent' to the test distribution
The ChiSquared formula $\sum \frac{(E_i-O_i)^2}{E_i}$ where $O$ is Test counts and $E$
is Control counts, can be restated as (assuming equal sample size,$N$) as $N\sum
(p_{cj}-p_{tj})\frac{ p_{cj}-p_{tj}}{p_{cj}}$ i.e. the weighted sum of the
relative differences in bin proportions weighted by the actual bin proportion
difference.
<aside>
larger discrepancies are often found in tails (larger sampling error) but they
are down weighted because $p_{cj}-p_{tj}$ will be smaller there
</aside>
### Multinomial Logit
- use a multionomial logit model of $P(\text{X in bucket j}|\text{X in bucket
0})$ and test for whether branch indicator is significant or not
- then test if individual coefficients for buckets are above some threshold
It's kind of hacky again and there is a need to test many coefficients and then
surface them up according to the equivalence definition above
## Bootstrapping a Custom Test Statistic
The approach we take is to bootstrap a custom a test statistic, however there
are many varieties. We will compare two histograms with $J$ bins, and $p_{jt}$
being the proportion in the $j$'th bin for Test and $p_{jc}$ the bin proportion
for Control. They both sum to 1 and there are $N_t$ and $N_c$ observations in
each.
We consider a bin to be different if there exists a bin $|\frac{p_{jt}-p_{jc}}{p_{jc}}|>0.05$
i.e a bin is different if there is a 5% change in the bin proportion. The 5% is
taken from past Shield study experience as a typical useful difference the
investigator sought.
### 0
The Chi Squared approach is fairly reasonable but we can modify to it to include
a signal as described above. Define $T_0$ as
$$
T_0 = \sum I_{ | \frac{ p_{cj}-p_{tj}}{p_{cj}}|>0.05} (p_{cj}-p_{tj})\frac{ p_{cj}-p_{tj}}{p_{cj}}
$$
$T_0$ is the standard Chi Squared formula modified to only include those bins
where the cell differences are greater than 5%.
### 1
Define $CRR_j = |\frac{p_{jt}-p_{jc}}{p_{jc}}|$ (i.e. one minus the relative
risk)
Then the statistic $T_{1u}$ is
$$
T_{1u} = \frac{1}{J} \sum_{j=1}^{j=J} I_{ CRR_j > 0.05 } CRR_j
$$
and $T_{1w}$ as
$$
T_{1w} = \sum_{j=1}^{j=J} \frac{p_{jc} + p{jt}}{2}I_{ CRR_j > 0.05 } CRR_j
$$
$T_{1u}$ computes the average releative difference for those bins where this
relative difference is more than 5% (in other words, it computes the average
complementary [relative risk](https://en.wikipedia.org/wiki/Relative_risk) for
the bins that exceed a threshold). Each bin is treated equally, i.e the right
most bin gets as much weight as the center of the distribution. One side effect
of this is often in the tails, the relative change might be large (because of
small cell sizes) and we might be less interested in these bins with small
counts and small proportion contribution to the distribution.
In other words, for a change to be useful in the tail of the distribution the
relative difference must be very large, and hence $T_{1w}$ which weights the
useful relative differences by the average bin proportion adjusts for this.
#### Adjustments
The relative change can often be 'noisy', leading to a wide bootstrap
distribution. One adjustment (no theoretical backing :) is to take the log of
relative change i.e.
$$
T_{1lu} = \frac{1}{J} \sum_{j=1}^{j=J} I_{ CRR_j > 0.05 } log(CRR_j)
$$
and $T_{1lw}$ as
$$
T_{1lw} = \sum_{j=1}^{j=J} \frac{p_{jc} + p_{jt}}{2} I_{ CRR_j > 0.05 } log(CRR_j)
$$
> **Adjusting for Empty Cells**
> because some cells might be empty we need to adjust for the proportions being
> estimated as zero. An easy fix is to use a Bayesian approach
> 'Multinomial-Dirichlet Prior(non informative version) ' to estimate bin
> probabilities and avoid zero estimates.
### 2
The relative change can be 'wild' i.e. in some sense a wide variance. One
approach to stabilize this is using the log odds. This however means we need
change our 'significance' from the 5% threshold to something on the odd scale.
The following graph plots $p$ on the x-axis and the corresponding odds ratio for
5% change in $p$. The second graph plots in the other direction i.e. for 5%
change in odds ratio (note when odds ratio is 1, then Test and Control are
same), what are the corresponding $p$.
> !!!! INSERT GRAPH
Given $logodds(j) = log(p_{jc}/(1-p_{jc}))-log( p_{jt}/(1-p_{jt}) )$, we have
$$
T_{2u} = \frac{1}{J} \sum_{j=1}^{j=J} I_{ |logodds(j)| > log(0.05) } logodds(j)
$$
and $T_{2w}$ as
$$
T_{2w} = \sum_{j=1}^{j=J} \frac{1}{2}(p_{jt}+p_{jc}) I_{ |logodds(j)| > log(0.05) } logodds(j)
$$
## Bootstrapping the Test
The standard approach is given two samples, `T` and `C`, of sample sizes $N_t$ and
$N_c$, we
1. Estimate cell proportions (either simple frequentist and run the risk of zero
estimates or use a simple non informative bayesian approach)
2. compute observed statistic $L$ (one of the above)
Now we compute the null distribution
1. pool the two datasets together
3. resample $N_t+N_c$ observations
4. randomly assign to `T` and `C`,
5. compute the statistic
6. repeat 1- 5 many many times to get the bootstrap null distribution
Compute the probability that we would see our observed statistic and if this is
very small we reject.This implies it is vary unlikely to have seen such a large
value of our observed statistic assuming that `T` and `C` were the same.
We can also identity the 'hotspots' in the Test bins which are different and
because of the thresholding we need not really worry about multiple comparison
issues (because we enforce the large threshold of at least 5%, they are most
likely true and unlikely to be false).
## Examples
```{pydbx x0,cache=TRUE}
histo55 = spark.read.parquet("s3://mozilla-metrics/user/cdowhygelund/CRT/data/TIME_TO_DOM_LOADING_MS_v55")
histo58 = spark.read.parquet("s3://mozilla-metrics/user/cdowhygelund/CRT/data/TIME_TO_DOM_LOADING_MS_v58")
histo59 = spark.read.parquet("s3://mozilla-metrics/user/cdowhygelund/CRT/data/TIME_TO_DOM_LOADING_MS_v59")
from pyspark.sql import Row
def explodeUs(ver):
def t(x):
c = x.client_id
ud = x.unit_density
if ud is not None:
for k in ud:
f = { 'cid': c, 'v':ver, 'b': k, 'n': ud[k]}
yield Row(**f)
return t
hall = sc.union([ histo55.rdd.flatMap(explodeUs('v55')),
histo58.rdd.flatMap(explodeUs('v58')),
histo59.rdd.flatMap(explodeUs('v59'))
]).toDF()
hall.createOrReplaceTempView("hall")
hall2 = spark.sql(""" select crc32(cid) as cid,v,b,n from hall """)
hall2.repartition(100).write.csv("s3://mozilla-metrics/user/sguha/tmp/CRT/units_TIME_TO_DOM_LOADING_MS.csv",mode='overwrite')
```
```{r x1, dependson='x0',cache=TRUE,include=FALSE}
library(bit64)
system('rm -rf /tmp/xx; aws s3 sync s3://mozilla-metrics/user/sguha/tmp/CRT/units_TIME_TO_DOM_LOADING_MS.csv/ /tmp/xx')
h <- fread('cat /tmp/xx/p*')
setnames(h, c('cid','v','b','n'))
```
```{r x2,dependson='x1',cache=TRUE,include=FALSE}
hs <- h[,list(p = mean(n)), by=list(v,b)]
```
We will try and see how our tests compare
- versions 55 with 58 (ought to reject!)
- versions 58 and 59 (ought to **not** reject)
(note the code to aggregate user histograms into unit densities can be found [here](https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/18639/command/18683).
```{r x3,dependson='x2',layout="l-body-outset",cache=TRUE,fig.width=10,fig.height=3}
j <- hs[v=='v55',][order(b),]
xyplot( p ~ log(b+1), type='h',lwd=2, xlab='log (1+time)',ylab='prob',main='TIME_TO_DOM_LOADING_MS for 55',
data=j)
j <- hs[v=='v58',][order(b),]
xyplot( p ~ log(b+1), type='h',lwd=2, xlab='log (1+time)',ylab='prob',main='TIME_TO_DOM_LOADING_MS for 58',
data=j)
j <- hs[v=='v59',][order(b),]
xyplot( p ~ log(b+1), type='h',lwd=2, xlab='log (1+time)',ylab='prob',main='TIME_TO_DOM_LOADING_MS for 59',
data=j)
```
## Some Simulations
Taking version 55 a starting point, we will
1. change the 0 bucket by 5% ( from `r hs[v=='v55',][b==0,round(p*100,3)]`% to `r hs[v=='v55',][b==0,round(p*1.05*100,3) ]`% )
2. change the bottom 10% of buckets by 5% (corresponds to times of `r paste( hs[v=='v55',][p< quantile(p,0.10) ,][,b],collapse=', ')`)
and check to see the tests correctly reject.
What is the best way to aggregate the data? We have converted every profiles
many histograms into one normalized histogram(unit densities), these densities
are then averaged to get the plots you see above. The obvious bootstrap resampling
methods that come to mind are
1. resample profiles (and their corresponding unit densities) and then compute
the overall density
### Test 0: Modified Chi Squared
```{r x1,dependson='x2',cache=TRUE,include=FALSE}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment