output | ||||
---|---|---|---|---|
|
As data scientist at Mozilla one of your responsibilities will be to analyze (and/or help design) the a Shield experiment. A Shield experiment is a platform to gather data under different hypotheses and (more often than not) test if the hypotheses are meaningfully different. The experiment will require your input in determing the sample size. The requisite sample size depends on several factors starting with Is at an observational study or an experiment with several hypothesis tests? Since an observational study will have confidence intervals which are statistically equivalent to hypothesis tests we shall ignore the distinction between hypotheses tests and observational designs and discuss the latter.
The requirements for the a hypothesis test (let's assume a single one) are
-
**What is the metric under study? **
Get a precise definition of this metrics as this will inform everything below. Consider active ticks which is a scalar that measures the 'active' (user interacted ith the browser) length of the session. Each tick corresponds to 5 seconds. For an experiment that monitors subjects for two weeks(across a Test and Control) branch, two valid metrics are
active hours/profile/day
andactive hours/profile
. Both have different distributional properties(e.g. skew, tail thickness,outlier handling, and transformations). In this discussion we will assume the first definition. -
**What is meaningful difference we want to detect? **
For example, is the PM interested in a 5% change(between Test and Control) in
active ticks/profile/day
? Or an absolute difference of 10 minutes inactive ticks/profile/day
. Nearly all power calculators use absolute differences. In order to convert relative differences into absolute differences you'll need some existing data. If this is a brand new probe, existing data from an engineers laptop would have to suffice!Some shield documents the request will be of the form What is the suitable sample size for statistical significance? This question cannot be answered and the humble data scientists needs to push back and ask for what meaningful differences the investigators would like to detect. Parameters like power and Type 1 error will be yours to choose (unless the investigator wishes to weigh in)
Assuming you can get sample data before the experiment, ideally try and replicate the conditions of the experiment and consider methods of aggregations
- if the metric is
active ticks/profile
, when extracting sample data use a two week period. - replicate the filters (e.g.
channel
,locale
andos
) - Consider
uris/hour/profile
: for a profile we can compute this astotal uris over time period
divided bytotal hours
or we could compute these at a daily level across the time period or not even compute the ratio , rather detect changes inuris
visited controlling for hours used. - Histograms can be aggregated in two ways:
- for a histogram
H
, aggregate to one histogram for a profile and summarize this as a mean to arrive at one value per profile. This would be interpreted as the typicalH
a profile would experience`. In this case, the distribution of the summary across profiles is very smooth and one can use tests based on the assumption the underlying data is continuous. - summarize to one histogram per branch e.g. instead of summarizing a profiles histogram as a mean, take the average of every profiles histogram in the branch. This represents a distribution of values the profile experiences. Often these histograms are very discrete (e.g. 20 buckets) We will ignore this form of aggregation for now, and focus on the first one
- for a histogram
- This sample data will provide a typical mean and standard deviation(and other summaries) for the metric that would likely be observed in the Test group.
Let's get some data: active ticks per profile for a span of one week for release, US and whether the profile used it for more than 6 hours in the week (i.e. 3 days X 2 hrs per day)
spark.sql("""
select client_id as cid,
cast(sum(active_ticks)*5/3600.0 as float) as ahours,
case when (sum(active_ticks)*5/3600) >= 6 then 1 else 0 end as avguseq
from main_summary
where app_name = 'Firefox' and sample_id='42' and country='US'
and submission_date_s3>='20180801' and submission_date_s3<='20180807'
group by 1
limit 30000
""").toPandas()
## cid ahours avguseq
## 1: 9265 0.891667 0
## 2: 13648 27.261110 1
## 3: 589 11.197222 1
## 4: 19518 6.693056 1
## 5: 18758 6.134722 1
## 6: 18796 3.295833 0
## 7: 8700 0.038889 0
## 8: 6563 0.213889 0
## 9: 11235 3.005556 0
## 10: 1050 0.893056 0
For skewed data like active hours and using parametric power claculators (e.g. assuming normal or t distribiutions) transfomr to the log scale.
To compute a sample size to detect differences of at least 5% with 90% power and
1% Type I error (pwr
package and
more details can be found here)
library(pwr)
s <- s[,logmean := log(ahours+1)]
tmean <- s[, mean(logmean)]
tmean5p <- s[, mean(log(ahours*1.05+1))]
poolsd <- s[, sd(logmean)]
pwr.t.test(n =NULL
, d = (tmean5p - tmean)/poolsd
, sig.level = 0.01
, power = 0.9
, type = 'two.sample'
, alternative = 'greater')
##
## Two-sample t test power calculation
##
## n = 38642.36
## d = 0.02595644
## sig.level = 0.01
## power = 0.9
## alternative = greater
##
## NOTE: n is number in *each* group
Note
- Without taking a log transformation, the above sample size is 76,994 which is much larger. In both cases, sample size is per branch.
- The assumption here is that there is a shift in mean and that the pooled
standard deviation is essentially that of test. We can rephrase the parameter
d
in the form of Cohens D which is the difference in means(sometimes called Effect Size) divided by the pooled standard deviation. In other words, we are looking for differences on the standard deviation scale. This is potentially less intuitive to many non data folks. Seecohen.ES
for conventional values. - If the sample size was too large (easily possible in nightly), choose a different power.
- So far, we have stuck to frequentist methods. Surely Bayesians have other ways?
- If you're not inclined to use R, the app G* Power is a very good program to choose values for you. You'll still need to compute prior means etc.
- In the
pwr.t.test
our effect size is Cohen's D. For a test of proportions this is something else(it's a transformation of the difference of proportions). Thepwr
package has functions to convert a test and control proportion into effect sizes (ES.h
). The vignette suggest conventional effects sizes which are easy to choose, but they dont mean much to the investigator who wants to know what that effect size is on the original scale. If you do choose to use convertial effect sizes try and provide actual examples of these differences that correspond to this effect size. Frompwr
vignette (linked above),
This is a crucial part of using the pwr package correctly: You must provide
an effect size on the expected scale.
Usually requires much smaller samples but it depends on where the proportion lies and the meaningful difference to be detected. In the above, consider the requirement to see a 1% increase in profiles using the browser for more than 6 active hours in the week.
The larger proportion is the first argument to ES.h
when alternative="greater"
and the sample sizes are per branch
## p1 is 0.1586, a 2% increase looks like 0.1618
##
## Difference of proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.008649451
## n = 347986
## sig.level = 0.01
## power = 0.9
## alternative = greater
##
## NOTE: same sample sizes
More often than not, the Shield experiments will be comparing many different things e.g. measure A for branch 1 vs branch 2, branch 1 vs branch 3 and so on and the same for measure B. There are ways to handle this e.g. Tukeys Range Test (but that is all possible comparisons), however a simple approach is
- compute sample sizes for A for branch 1 vs branch 2, but the Type 1 Error
(
$\alpha$ ) is adjusted using simple Bonferroni Correction. - repeat for measure B
- choose the larger sample size
- if sample size is too big, try a smaller but still acceptable power and/or larger but still acceptable meaningful difference.
An example of multiple comparison would be offering a translation service across 3 different locales. The test group have offers to translate the page and the control group don't. The investigator wishes to see if offering translation services increases daily usage by 5%. You can think of this as 3 different tests (across 3 locales). You can also consider a regression model with covariates which might to lead to smaller sample sizes (see "Complicated Models" below).
Presence of multiple branches might enougraege you to use power tests for anova
e.g. power.anova.test
. However the underlying test here is are the branches
significantly different and is not designed to check pair wise
branches. Moreover the effect here (if using the pwr
package) is not related
to relative difference in means but connected to the variance.
In summary, tread carefully when using power.anova.test
Consider Tim's power analysis for the Webrender Experiment. There are about 10 histograms each of which were converted to one value per profile. Tim observed that taking log transform caused the distributions to be close enough to normal properties (symmetry and not very thick tails).
His work follows the above closely but he uses the base function power.t.test
which might be a bit easier to use (same principles apply)
We have assumed the metrics follow a distribution. Note, you can assume non parametric tests and to compute the samples sizes under a non parametric set up, the easiest option is to download G* Power.
However, given we have prior measurements, we can also simulate the power for a given test. The idea being if we apply a t-test to data that's wildly different from normal, how different is the actual power from the real power? We can guess the real power from simulation studies. For an example of different approaches, see Tims work for here and some work by our interns here.
Briefly, Tim had data from a Poisson distribution and the objective there was to
detect a
For the record for G* Power gave similar power to the simulation value for a sample of size 1000( G* Power can compute power for Mann Whitney Tests).
But importantly note that the test, Mann Whitney, which assumes the underlying distribution is continuous has been applied to discrete data. In such a case, it is a good idea to simulate your power as Tim did.
Don't just report the p-value. Also report the means of both groups, the difference (effect size) and the other definition of effect size (difference divided by pooled standard deviation). p-values will generally decrease with increasing sample size.
You're doing something like
glm( result ~ branch+active_hours+os, random=~|profile_id, data=foo, family='poisson')
in other words, modeling count data on branch(e.g test and control) and active hours with multiple observations per client (thus there could be correlation within client)? How does one choose the sample size then?
Im going out on a limb and say that if you consider the data consisting of
client_id, branch, result
and compute sample sizes based on this(for a given power, Type 1 error etc), you will have an upper bound on sample size (because the model, error moves into the other variables and thus your branch effect becomes more precise and lower sample size).
Once again, you can run simulations based on the above model and the statistic for testing the branch coefficient. This is clearly a lot of effort.
If you anticipate day of week effects will affect conclusions, then monitor a profile for a week (at least).
Also, based on the filters and enrollment criteria, you can run some queries on the data to estimate how long it will take to enroll the requisite numbers. The sample sizes determined above are after filters have been applied.