Skip to content

Instantly share code, notes, and snippets.

@saptarshiguha
Created August 20, 2018 21:25
Show Gist options
  • Save saptarshiguha/64e033a21b73944fa0213d6bf5700d6c to your computer and use it in GitHub Desktop.
Save saptarshiguha/64e033a21b73944fa0213d6bf5700d6c to your computer and use it in GitHub Desktop.
output
html_document
keep_md
true

Approaches to Estimating Samples Sizes and Power

Background

As data scientist at Mozilla one of your responsibilities will be to analyze (and/or help design) the a Shield experiment. A Shield experiment is a platform to gather data under different hypotheses and (more often than not) test if the hypotheses are meaningfully different. The experiment will require your input in determing the sample size. The requisite sample size depends on several factors starting with Is at an observational study or an experiment with several hypothesis tests? Since an observational study will have confidence intervals which are statistically equivalent to hypothesis tests we shall ignore the distinction between hypotheses tests and observational designs and discuss the latter.

The requirements for the a hypothesis test (let's assume a single one) are

  • **What is the metric under study? **

    Get a precise definition of this metrics as this will inform everything below. Consider active ticks which is a scalar that measures the 'active' (user interacted ith the browser) length of the session. Each tick corresponds to 5 seconds. For an experiment that monitors subjects for two weeks(across a Test and Control) branch, two valid metrics are active hours/profile/day and active hours/profile. Both have different distributional properties(e.g. skew, tail thickness,outlier handling, and transformations). In this discussion we will assume the first definition.

  • **What is meaningful difference we want to detect? **

    For example, is the PM interested in a 5% change(between Test and Control) in active ticks/profile/day? Or an absolute difference of 10 minutes in active ticks/profile/day. Nearly all power calculators use absolute differences. In order to convert relative differences into absolute differences you'll need some existing data. If this is a brand new probe, existing data from an engineers laptop would have to suffice!

    Some shield documents the request will be of the form What is the suitable sample size for statistical significance? This question cannot be answered and the humble data scientists needs to push back and ask for what meaningful differences the investigators would like to detect. Parameters like power and Type 1 error will be yours to choose (unless the investigator wishes to weigh in)

Data Aggregations

Assuming you can get sample data before the experiment, ideally try and replicate the conditions of the experiment and consider methods of aggregations

  • if the metric is active ticks/profile, when extracting sample data use a two week period.
  • replicate the filters (e.g. channel, locale and os)
  • Consider uris/hour/profile: for a profile we can compute this as total uris over time period divided by total hours or we could compute these at a daily level across the time period or not even compute the ratio , rather detect changes in uris visited controlling for hours used.
  • Histograms can be aggregated in two ways:
    • for a histogram H, aggregate to one histogram for a profile and summarize this as a mean to arrive at one value per profile. This would be interpreted as the typical H a profile would experience`. In this case, the distribution of the summary across profiles is very smooth and one can use tests based on the assumption the underlying data is continuous.
    • summarize to one histogram per branch e.g. instead of summarizing a profiles histogram as a mean, take the average of every profiles histogram in the branch. This represents a distribution of values the profile experiences. Often these histograms are very discrete (e.g. 20 buckets) We will ignore this form of aggregation for now, and focus on the first one
  • This sample data will provide a typical mean and standard deviation(and other summaries) for the metric that would likely be observed in the Test group.

Some Existing Approaches

Active Ticks Example (t-test)

Let's get some data: active ticks per profile for a span of one week for release, US and whether the profile used it for more than 6 hours in the week (i.e. 3 days X 2 hrs per day)

spark.sql("""
select client_id as cid,
       cast(sum(active_ticks)*5/3600.0 as float) as ahours,
       case when (sum(active_ticks)*5/3600) >= 6 then 1 else 0 end as avguseq
from main_summary
where app_name = 'Firefox' and sample_id='42' and country='US' 
and submission_date_s3>='20180801' and submission_date_s3<='20180807'
group by 1
limit 30000
 """).toPandas()
##       cid    ahours avguseq
##  1:  9265  0.891667       0
##  2: 13648 27.261110       1
##  3:   589 11.197222       1
##  4: 19518  6.693056       1
##  5: 18758  6.134722       1
##  6: 18796  3.295833       0
##  7:  8700  0.038889       0
##  8:  6563  0.213889       0
##  9: 11235  3.005556       0
## 10:  1050  0.893056       0

For skewed data like active hours and using parametric power claculators (e.g. assuming normal or t distribiutions) transfomr to the log scale.

To compute a sample size to detect differences of at least 5% with 90% power and 1% Type I error ($\alpha$), we can use the following R code (we are using the pwr package and more details can be found here)

library(pwr) 
s       <- s[,logmean :=  log(ahours+1)]
tmean   <- s[, mean(logmean)]
tmean5p <- s[, mean(log(ahours*1.05+1))]
poolsd  <- s[, sd(logmean)]
pwr.t.test(n =NULL 
           , d           = (tmean5p - tmean)/poolsd
           , sig.level   = 0.01
           , power       = 0.9
           , type        = 'two.sample'
           , alternative = 'greater')
## 
##      Two-sample t test power calculation 
## 
##               n = 38642.36
##               d = 0.02595644
##       sig.level = 0.01
##           power = 0.9
##     alternative = greater
## 
## NOTE: n is number in *each* group

Note

  1. Without taking a log transformation, the above sample size is 76,994 which is much larger. In both cases, sample size is per branch.
  2. The assumption here is that there is a shift in mean and that the pooled standard deviation is essentially that of test. We can rephrase the parameter d in the form of Cohens D which is the difference in means(sometimes called Effect Size) divided by the pooled standard deviation. In other words, we are looking for differences on the standard deviation scale. This is potentially less intuitive to many non data folks. See cohen.ES for conventional values.
  3. If the sample size was too large (easily possible in nightly), choose a different power.
  4. So far, we have stuck to frequentist methods. Surely Bayesians have other ways?
  5. If you're not inclined to use R, the app G* Power is a very good program to choose values for you. You'll still need to compute prior means etc.
  6. In the pwr.t.test our effect size is Cohen's D. For a test of proportions this is something else(it's a transformation of the difference of proportions). The pwr package has functions to convert a test and control proportion into effect sizes (ES.h). The vignette suggest conventional effects sizes which are easy to choose, but they dont mean much to the investigator who wants to know what that effect size is on the original scale. If you do choose to use convertial effect sizes try and provide actual examples of these differences that correspond to this effect size. From pwr vignette (linked above),
This is a crucial part of using the pwr package correctly: You must provide
an effect size on the expected scale. 

Testing Proportions

Usually requires much smaller samples but it depends on where the proportion lies and the meaningful difference to be detected. In the above, consider the requirement to see a 1% increase in profiles using the browser for more than 6 active hours in the week.

The larger proportion is the first argument to ES.h when alternative="greater" and the sample sizes are per branch

## p1 is 0.1586, a 2% increase looks like 0.1618
## 
##      Difference of proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.008649451
##               n = 347986
##       sig.level = 0.01
##           power = 0.9
##     alternative = greater
## 
## NOTE: same sample sizes

Multiple Comparisons

More often than not, the Shield experiments will be comparing many different things e.g. measure A for branch 1 vs branch 2, branch 1 vs branch 3 and so on and the same for measure B. There are ways to handle this e.g. Tukeys Range Test (but that is all possible comparisons), however a simple approach is

  • compute sample sizes for A for branch 1 vs branch 2, but the Type 1 Error ($\alpha$) is adjusted using simple Bonferroni Correction.
  • repeat for measure B
  • choose the larger sample size
  • if sample size is too big, try a smaller but still acceptable power and/or larger but still acceptable meaningful difference.

Anova

Presence of multiple branches might enougraege you to use power tests for anova e.g. power.anova.test. However the underlying test here is are the branches significantly different and is not designed to check pair wise branches. Moreover the effect here (if using the pwr package) is not related to relative difference in means but connected to the variance.

In summary, tread carefully when using power.anova.test

Example

Consider Tim's power analysis for the Webrender Experiment. There are about 10 histograms each of which were converted to one value per profile. Tim observed that taking log transform caused the distributions to be close enough to normal properties (symmetry and not very thick tails).

His work follows the above closely but he uses the base function power.t.test which might be a bit easier to use (same principles apply)

Empirical Power Analysis

We have assumed the metrics follow a distribution. Note, you can assume non parametric tests and to compute the samples sizes under a non parametric set up, the easiest option is to download G* Power.

However, given we have prior measurements, we can also simulate the power for a given test. The idea being if we apply a t-test to data that's wildly different from normal, how different is the actual power from the real power? We can guess the real power from simulation studies. For an example of different approaches, see Tims work for here and some work by our interns here.

Briefly, Tim had data from a Poisson distribution and the objective there was to detect a $\Delta$ relative difference in the mean using the Mann Whitney statistic. In his example, he fixes a sample, samples test and control, with replacement, and effectively increases points in the control by $1+\Delta$. He found with a sample size of 1000 per branch,the power is ~30%. By looping over sample sizes with fixed $\Delta$, Type 1 error he computed observed power.

For the record for G* Power gave similar power to the simulation value for a sample of size 1000( G* Power can compute power for Mann Whitney Tests).

But importantly note that the test, Mann Whitney, which assumes the underlying distribution is continuous has been applied to discrete data. In such a case, it is a good idea to simulate your power as Tim did.

Reporting results

Don't just report the p-value. Also report the means of both groups, the difference (effect size) and the other definition of effect size (difference divided by pooled standard deviation). p-values will generally decrease with increasing sample size.

Complicated Models

You're doing something like

glm( result ~ branch+active_hours+os, random=~|profile_id, data=foo, family='poisson')

in other words, modeling count data on branch(e.g test and control) and active hours with multiple observations per client (thus there could be correlation within client)? How does one choose the sample size then?

Im going out on a limb and say that if you consider the data consisting of

client_id, branch, result

and compute sample sizes based on this(for a given power, Type 1 error etc), you will have an upper bound on sample size (because the model, error moves into the other variables and thus your branch effect becomes more precise and lower sample size).

Once again, you can run simulations based on the above model and the statistic for testing the branch coefficient. This is clearly a lot of effort.

How Long to Run the Experiment ?

If you anticipate day of week effects will affect conclusions, then monitor a profile for a week (at least).

Also, based on the filters and enrollment criteria, you can run some queries on the data to estimate how long it will take to enroll the requisite numbers. The sample sizes determined above are after filters have been applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment