saptarshiguha/powerdoc.md

## powerdoc.md

      
    Raw
  

              powerdoc.md
            
          
  output
  
  
  html_document
  
  
  keep_md
  
  
  true
  
  
Approaches to Estimating Samples Sizes and Power

Background

As  data scientist at Mozilla one of your responsibilities will be to analyze
(and/or help design) the a Shield experiment. A Shield experiment is a platform
to gather data under different hypotheses and (more often than not) test if the
hypotheses are meaningfully different. The experiment will require your input in
determing the sample size. The requisite sample size depends on several factors
starting with Is at an observational study or an experiment with several
hypothesis tests? Since an observational study will have confidence intervals which  are
statistically equivalent to hypothesis tests we shall ignore the distinction
between hypotheses tests and observational designs and discuss the latter.
The requirements for the a hypothesis test (let's assume a single one) are


**What is the metric under study? **
Get a precise definition of this metrics as this will inform everything
below. Consider active ticks which is a scalar that measures the 'active' (user
interacted ith the browser) length of the session. Each tick corresponds to
5 seconds. For an experiment that monitors subjects for two weeks(across a
Test and Control) branch, two valid metrics are  active hours/profile/day
and active hours/profile. Both have different distributional
properties(e.g. skew, tail thickness,outlier handling, and
transformations). In this discussion we will assume the first definition.


**What is meaningful difference we want to detect? **
For example, is the PM interested in a 5% change(between Test and Control) in
active ticks/profile/day? Or an absolute difference of 10 minutes in
active ticks/profile/day. Nearly all power calculators use absolute
differences. In order to convert relative differences into absolute
differences you'll need some existing data. If this is a brand new probe,
existing data from an engineers laptop would have to suffice!
Some shield documents the request will be of the form What is the
suitable sample size for statistical significance? This question cannot be
answered and the humble data scientists needs to push back and ask for what
meaningful differences the investigators would like to detect. Parameters
like power and Type 1 error will be yours to choose (unless the investigator
wishes to weigh in)


Data Aggregations

Assuming you can get sample data before the experiment, ideally try and
replicate the conditions of the experiment and consider methods of aggregations

if the metric is  active ticks/profile, when extracting sample data use a
two week period.
replicate the filters (e.g. channel, locale and os)
Consider uris/hour/profile: for a profile we can compute this as total uris over time period divided by total hours or we could compute these at a
daily level across the time period or not even compute the ratio , rather
detect changes in uris visited controlling for hours used.
Histograms can be aggregated in two ways:

for a histogram H, aggregate to one histogram for a profile and
summarize this as a mean to arrive at one value per profile. This would be
interpreted as the typical H a profile would  experience`. In this
case, the distribution  of the summary across profiles is very smooth and
one can use tests based on the assumption the underlying data is continuous.
summarize to one histogram per branch e.g. instead of summarizing a
profiles histogram as a mean, take the average of every profiles histogram
in the branch. This represents a distribution of values the profile
experiences. Often these histograms are very discrete (e.g. 20 buckets)
We will ignore this form of aggregation for now, and focus on the first one


This sample data will provide a typical mean and standard deviation(and other
summaries)  for the metric  that would likely be observed in the Test group.

Some Existing Approaches

Active Ticks Example (t-test)

Let's get some data: active ticks per profile for a span of one week for
release, US and whether the profile used it for more than 6 hours in the week
(i.e. 3 days X 2 hrs per day)
spark.sql("""
select client_id as cid,
       cast(sum(active_ticks)*5/3600.0 as float) as ahours,
       case when (sum(active_ticks)*5/3600) >= 6 then 1 else 0 end as avguseq
from main_summary
where app_name = 'Firefox' and sample_id='42' and country='US' 
and submission_date_s3>='20180801' and submission_date_s3<='20180807'
group by 1
limit 30000
 """).toPandas()
##       cid    ahours avguseq
##  1:  9265  0.891667       0
##  2: 13648 27.261110       1
##  3:   589 11.197222       1
##  4: 19518  6.693056       1
##  5: 18758  6.134722       1
##  6: 18796  3.295833       0
##  7:  8700  0.038889       0
##  8:  6563  0.213889       0
##  9: 11235  3.005556       0
## 10:  1050  0.893056       0

For skewed data like active hours and using parametric power claculators
(e.g. assuming normal or t distribiutions) transfomr to the log scale.
To compute a sample size to detect differences of at least 5% with 90% power and
1% Type I error ($\alpha$), we can use the following R code (we are using the
pwr  package and
more details can be found here)
library(pwr) 
s       <- s[,logmean :=  log(ahours+1)]
tmean   <- s[, mean(logmean)]
tmean5p <- s[, mean(log(ahours*1.05+1))]
poolsd  <- s[, sd(logmean)]
pwr.t.test(n =NULL 
           , d           = (tmean5p - tmean)/poolsd
           , sig.level   = 0.01
           , power       = 0.9
           , type        = 'two.sample'
           , alternative = 'greater')
## 
##      Two-sample t test power calculation 
## 
##               n = 38642.36
##               d = 0.02595644
##       sig.level = 0.01
##           power = 0.9
##     alternative = greater
## 
## NOTE: n is number in *each* group

Note

Without taking a log transformation, the above sample size is 76,994 which is
much larger. In both cases, sample size is per branch.
The assumption here is that there is a shift in mean and that the pooled
standard deviation is essentially that of test. We can rephrase the parameter
d in the form of Cohens D which is the difference in means(sometimes called
Effect Size) divided by the pooled standard deviation. In other words, we
are looking for differences on the standard deviation scale. This is
potentially less intuitive to many non data folks. See
cohen.ES
for conventional values.
If the sample size was too large (easily possible in nightly), choose
a different power.
So far, we have stuck to frequentist methods. Surely Bayesians have other
ways?
If you're not inclined to use R, the app G*
Power is a very good program to choose values for
you. You'll still need to compute prior means etc.
In the pwr.t.test our effect size is Cohen's D. For a test of proportions
this is something else(it's a transformation of the difference of
proportions). The pwr package has functions to convert a test and control
proportion into effect sizes (ES.h). The vignette suggest conventional
effects sizes which are easy to choose, but they dont mean much to the
investigator who wants to know what that effect size is on the original
scale. If you do choose to use convertial effect sizes try and provide
actual examples of these differences that correspond to this effect size.
From pwr vignette (linked above),

This is a crucial part of using the pwr package correctly: You must provide
an effect size on the expected scale. 

Testing Proportions

Usually requires much smaller samples but it depends on where the proportion
lies and the meaningful difference to be detected. In the above, consider the
requirement to see a 1% increase in profiles using the browser for more than 6
active hours in the week.
The larger proportion is the first argument to ES.h when alternative="greater"
and the sample sizes are per branch
## p1 is 0.1586, a 2% increase looks like 0.1618

## 
##      Difference of proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.008649451
##               n = 347986
##       sig.level = 0.01
##           power = 0.9
##     alternative = greater
## 
## NOTE: same sample sizes

Multiple Comparisons

More often than not, the Shield experiments will be comparing many different
things e.g. measure A for branch 1 vs branch 2, branch 1 vs branch 3 and so on
and the same for measure B.
There are ways to handle this e.g. Tukeys Range Test (but that is all possible
comparisons), however a simple approach is

compute sample sizes for A for branch 1 vs branch 2, but the Type 1 Error
($\alpha$) is adjusted using simple Bonferroni Correction.
repeat for measure B
choose the larger sample size
if sample size is too big, try a smaller but still acceptable power and/or
larger but still acceptable meaningful difference.

Anova

Presence of multiple branches might enougraege you to use power tests for anova
e.g. power.anova.test. However the underlying test here is are the branches
significantly different and is not designed to check pair wise
branches. Moreover the effect here (if using the pwr package) is not related
to relative difference in means but connected to the variance.
In summary, tread carefully when using power.anova.test
Example

Consider Tim's power analysis for the Webrender
Experiment. There
are about 10 histograms each of which were converted to one value per
profile. Tim observed that taking log transform caused the distributions to be
close enough to normal properties (symmetry and not very thick tails).
His work follows the above closely but he uses the base function power.t.test
which might be a bit easier to use (same principles apply)
Empirical Power Analysis

We have assumed the metrics follow a distribution. Note, you can assume non
parametric tests and to compute the samples sizes under a non parametric set up,
the easiest option is to download G* Power.
However, given we have prior measurements, we can also simulate the power for
a given test. The idea being if we apply a t-test to data that's wildly different
from normal, how different is the actual power from the real power? We can
guess the real power from simulation studies.
For an example of different approaches, see Tims work for
here and
some work by our interns
here.
Briefly, Tim had data from a Poisson distribution and the objective there was to
detect a $\Delta$ relative difference in the mean using the Mann Whitney
statistic. In his example, he fixes a sample, samples test and control, with
replacement, and effectively increases points in the control by $1+\Delta$. He
found with a sample size of 1000 per branch,the power is ~30%. By looping over
sample sizes with fixed $\Delta$, Type 1 error he computed observed power.
For the record for G* Power gave similar power to the simulation value for a
sample of size 1000( G* Power can compute power for Mann Whitney Tests).
But importantly note that the test, Mann Whitney, which assumes the underlying
distribution is continuous has been applied to discrete data. In such a case, it
is a good idea to simulate your power as Tim did.
Reporting results

Don't just report the p-value. Also report the means of both groups, the
difference (effect size) and the other definition of effect size (difference
divided by pooled standard deviation). p-values will generally decrease with
increasing sample size.
Complicated Models

You're doing something like
glm( result ~ branch+active_hours+os, random=~|profile_id, data=foo, family='poisson')

in other words, modeling count data on branch(e.g test and control) and active
hours with multiple observations per client (thus there could be correlation
within client)? How does one choose the sample size then?
Im going out on a limb and say that if you consider the data consisting of
client_id, branch, result

and compute sample sizes based on this(for a given power, Type 1 error etc), you
will have an  upper bound on sample size (because the model, error moves
into the other variables and thus your branch effect becomes more precise and
lower sample size).
Once again, you can run simulations based on the above model and the statistic for
testing the branch coefficient. This is clearly a lot of effort.
How Long to Run the Experiment ?

If you anticipate day of week effects will affect conclusions, then monitor a
profile for a week (at least).
Also, based on the filters and enrollment criteria, you can run some queries on
the data to estimate how long it will take to enroll the requisite numbers. The
sample sizes determined above are after filters have been applied.