Power analyses are important for experimental study design so that the
researcher has an idea of how many experiment subjects are needed to
minimize Type II error. Most NIH funding applications require sample
size and power analyses. In microbiome studies power analyses can be
difficult not just because the true effect size is unknown, but also
because the composition of the microbiome in control and experimental
groups (i.e. beta diversity) is generally unknown. However, if the study
intends to use pairwise distances and PERMANOVA to measure diversity
then we can use methods from the R package micropower (Kelly et al.
2015) to perform power analyses as long as we have previously available
datasets characteristic of one of the groups (most likely the control)
and some idea of the effect size. The basic idea behind micropower
is
to simulate distance matrices given prior population parameters computed
from previous studies, then simulate a range of effect sizes,
rarefaction curves, etc. to estimate PERMANOVA power from the simulated
distance matrices.
micropower
is a great package that I’ve used to perform power analyses
for researchers wanting to do 16s and whole genome shotgun metagenomics
experiments. However it comes without a tutorial or much documentation,
so it’s not entirely clear how to use it. Following is a walkthrough of
what such an analysis looks like as well as what you will need to
consider for your own power analysis.
We will be using data from the Human Microbiome Project for this tutorial (Methé et al. 2012). The Human Microbiome Project has alleviated some of the difficulty in a lack of accessible and well characterized data by providing 16s and whole genome data on 300 healthy human subjects from the US in multiple tissue types.
In particular we will 1) use the HMP1 Metadata Project Catalog to find shotgun metagenomics gut samples, and then 2) match up the SRA IDs of the samples to the abundances table provided in the shotgun community profiling database HMSCP. These abundances tables are what we will use to compute the necessary population parameters for the power analysis by transforming them into OTU tables.
This entire analysis can be reproduced by cloning the repository at
http://github.com/compbiocore/metagenomics-power-analyses-tutorial and
executing the R notebook (tutorial.Rmd
) there. If you don’t already
have the packages micropower
, devtools
and others installed then
execute the commands below to install the packages.
install.packages('devtools')
devtools::install_github("brendankelly/micropower")
install.packages('knitr')
install.packages('kableExtra')
install.packages('dplyr')
install.packages('ggplot2')
install.packages('parallel')
Afterwards we are going to load the packages we need. We will also set a
random seed so as to make the results from this notebook fully
reproducible. Additionally, we have loaded the package parallel to
allow us to run operations like lapply
split up onto multiple cores.
If your machine doesn’t have multiple cores you can omit the package but
everything will take longer so we recommend shortening the number of
simulations to get through the tutorial.
library(micropower)
## Loading required package: permute
library(knitr)
library(kableExtra)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(parallel)
cores = detectCores()
set.seed(515087345)
We will assume that you have cloned the entire repository as well and
thus have the same path to the data. If not then modify the variable
PATH
to point to the path of the data.
PATH='data'
The analysis in general consists of the following steps. First, we
decide on what distance metric we will be using. micropower
accommodates both unweighted and weighted Jaccard distances as well as
unweighted or weighted UniFrac distances (Lozupone and Knight 2005). A
discussion of which distance measure to choose is at the end of this
post.
Then we find an appropriate dataset, then wrangle the data into an OTU
table. The necessary input for micropower
is within-group mean and
standard deviations. These can be computed using the chosen distance
metric from OTU tables. Thus any dataset that is used should either
provide within-group mean and standard deviations already, provide a
within-group distance matrix, or provide data that can be transformed
into an OTU table.
After that we do 3 different OTU table simulations to determine different parameters. In the first, we simulate OTU tables with different levels of subsampling (rarefaction) in order to find the subsampling level that corresponds to the within-group mean. In the second, we simulate OTU tables with different numbers of OTUs and the given rarefaction level in order to find the number of OTUs corresponding to the within-group standard deviation. Finally we simulate OTU tables with different effect sizes and the given rarefaction level and number of OTUs. This finally allows us to answer questions relating power, sample size, and effect size. For example, we can find how much power we will have given an effect size and sample size, or the minimum detectible effect size for a given power level and sample size, or how many samples we need to achieve at least x power and be able to detect an effect size of y.
In our hypothetical experiment for this tutorial we are considering 30 samples each in the control and experimental groups with a sequencing depth of 1000 per OTU bin, as well as performing the PERMANOVA test with the weighted Jaccard distance. The question we seek to answer is if we have at least 90% power to detect differences between our control and experimental groups.
To start with, we have provided the project catalog from HMP as
described in Data. You can download this yourself on their site
by clicking on “HMP metagenomic samples” under Browse datasets,
clicking on “View Project Catalog” at the bottom, and then clicking on
the “Save to CSV” button. From here we filter for samples where the
collection site is listed as gastrointestinal_tract
(gut) and with
gene counts greater than 0 (shotgun metagenomics as opposed to 16s) to
get a list of SRA IDs. This results in 147 samples.
# getting sample IDs
project_catalog<-read.csv(file.path(PATH,"project_catalog.csv"))
kable(head(project_catalog), caption="A few rows of values and column names in project catalog.") %>% kable_styling()
Sequence.Read.Archive.ID | NCBI.Project.ID | HMP.Isolation.Body.Site | HMP.Isolation.Body.Subsite | Gene.Count | IMG.HMP.ID | Sequencing.Center | Addition.Date | Last.Modification.Date | X | X.1 |
---|---|---|---|---|---|---|---|---|---|---|
SRS011105 | 48479 | airways | Nares | 4432 | 7000000387 | Broad Institute | 2011-04-25T20:00:00 | 2011-08-15T20:00:00 | NA | NA |
SRS011132 | 48479 | airways | Nares | 316 | 7000000030 | Broad Institute | 2011-04-25T20:00:00 | 2011-08-15T20:00:00 | NA | NA |
SRS011263 | 48479 | airways | Nares | 108 | 7000000542 | Broad Institute | 2011-04-25T20:00:00 | 2011-08-15T20:00:00 | NA | NA |
SRS011397 | 48479 | airways | Nares | 8467 | 7000000209 | Broad Institute | 2011-04-25T20:00:00 | 2011-08-15T20:00:00 | NA | NA |
SRS012291 | 48479 | airways | Nares | 37 | 7000000717 | Broad Institute | 2011-04-25T20:00:00 | 2011-08-15T20:00:00 | NA | NA |
SRS012663 | 48479 | airways | Nares | 4460 | 7000000424 | Broad Institute | 2011-04-25T20:00:00 | 2011-08-15T20:00:00 | NA | NA |
IDs <- select(project_catalog, Sequence.Read.Archive.ID, HMP.Isolation.Body.Site, Gene.Count) %>%
filter(HMP.Isolation.Body.Site=='gastrointestinal_tract' & Gene.Count>0) %>%
.$Sequence.Read.Archive.ID
length(IDs)
## [1] 147
With the list of SRA IDs we can then download abundance tables associated with those SRA IDs. Here we have provided them in the Github repo for you, but if you want to download your own data, the HMP project provides a client Aspera to do so.
The abundance tables from HMP need to be set up so that columns are by sample (SRA IDs), rows are by reference name, and entries are by the number of reads per sample per reference name i.e. like an OTU table. In addition, the information provided in each sample’s abundance table is depth, breadth, and total reference bases. Depth is the number of times the reference was covered, breadth is the proportion of reference bases that were covered, and total reference bases is the length of the genome. In addition, the length of reads in sequencing was 100, and the breadth is given as a number out of 100. To get the number of reads covering a reference then we take the depth x total reference bases x breadth / (100 x 100). We then join the 147 samples together as columns.
tsvs<-lapply(IDs, function(x) read.csv(file.path(PATH,paste0(x,'.tsv')),sep='\t') %>% transmute(Reference.Name=Reference.Name,Nreads=Depth*Breadth*Total.reference.bases/10000))
otu_table<-Reduce(function(...) full_join(...,by="Reference.Name"), tsvs)
colnames(otu_table) <- c("Reference.Name",as.character(IDs))
otu_table[is.na(otu_table)] <- 0 # convert NAs to 0
Now that we have an OTU table we can compute the pairwise unweighted Jaccard distances. Then we just take the mean and standard deviation of the matrix to get the necessary population parameters.
jaccard<-calcWJstudy(otu_table[,2:148])
m<-mean(jaccard) #0.7676824
s<-sd(jaccard) #0.1317751
Once we have determined population parameters for within-group
distances, we now use hashMean
to simulate a list of OTU tables at
different levels of subsampling (rare_levels
) so that we can determine
what level of subsampling produces OTU tables with the same mean
pairwise distance 0.7676824.
#weight_hm<-hashMean(rare_levels=runif(1000,0,1),rep_per_level=10,otu_number=1000,sequence_depth=1000)
#means<-mclapply(weight_hm, function(x) (mean(lowerTriDM(calcWJstudy(x)))),mc.cores=cores)
#names(means) <- sapply(strsplit(names(means),"_"),FUN=function(x) {x[[1]]})
#save(means,file="means.Rdata")
load("means.Rdata")
mean_df <- data.frame(subsampling=as.numeric(names(means)),means=as.numeric(means),target=abs(as.numeric(means)-m))
subsample <- mean_df[which(mean_df$target==min(mean_df$target)),]$subsampling
qplot(data=mean_df,x=subsampling,y=means)
In more detail:
hashMean
simulates a list of OTU tables with different levels of subsampling (rare_levels
). We draw 1000 samples uniformly from[0,1]
here.rep_per_level
indicates the number of replicates per subsampling level. Thus we have 1000 x 10 = 10000 simulated OTU tables.otu_number
indicates the number of simulated OTUs, andsequence_depth
indicates the average number of sequence counts per OTU bin. The names of the list are the levels of subsampling with the index number- We use
mclapply
to parallelize computing Jaccard distances of the OTU tables withcalcWJstudy
and then the means of those distances - To save time, we loaded the means from an Rdata object.
mean_df
is a dataframe with three columns: subsampling level (taken fromnames(means)
), within-group mean of the Jaccard distance for the OTU table generated at that subsampling level, and the absolute difference between the simulated within-group means and the true within-group mean- From here we take the subsampling level that corresponds to the
minimum absolute difference between simulated mean and true
within-group mean. In theory we could fit it with an exponential
decay or loess regression (you can do so yourself with
fit <- nls(means ~ SSasymp(subsampling, yf, y0, log_alpha), data = mean_df)
) but the fit is not the best. Since we are just trying to get an estimate this works as well. qplot
provides a plot of how within-group mean drops as subsampling level increases.
Once we have the level of subsampling ($subsample=$7.729136910^{-4}), we
use hashSD
to simulate a list of OTU tables at different numbers of
OTUs at our specific subsampling level to determine what number of OTUs
produces OTU tables with the same standard deviation of pairwise
distance. These are the additional parameters that will be used to
simulate OTU tables under a range of effect sizes. More on the theory
behind this and the validation is available in the paper.
#weight_hsd<-hashSD(rare_depth=subsample,otu_number_range=10^runif(n = 100,min = 1,max = 5),sim_number=30, sequence_depth=1000)
#sds<-mclapply(weight_hsd, function(x) (sd(lowerTriDM(calcWJstudy(x)))), mc.cores=cores)
#names(sds) <- sapply(names(sds), function(x) substring(x,4))
#save(sds,file="sds.Rdata")
load("sds.Rdata")
sds_df <- data.frame(otunum=as.numeric(names(sds)),sd=as.numeric(sds),target=abs(as.numeric(sds)-s))
otunum <- sds_df[which(sds_df$target==min(sds_df$target)),]$otunum
qplot(data=sds_df,x=otunum,y=sd) + geom_smooth(method="lm") + scale_x_log10() + scale_y_log10() + xlab("log10(otunum)") + ylab("log10(sd)")
Scatter plot for means vs subsampling with weighted Jaccard distance from hashMeans
In more detail:
hashSD
simulates a list of OTU tables with different numbers of OTUs sampled.rare_depth
is the subsampling level, and should be set to the subsampling level predicted from earlier.otu_number_range
is the range of OTU numbers to test.sim_number
is the number of simulated subjects per OTU number, and should correspond to the expected number of subjects in your study.sequence_depth
is the same as inhashMean
, and should remain the same.- The Jaccard distance computation can take a long time, so to save
time
sds
has already been saved and loaded. - Similarly to above, we compute the standard deviation for each OTU
table from
hashSD
and then pick the otu number that corresponds most closely to the known standard deviation. - The plot here is done on a log scale and fitted with a linear model,
which models the relationship between standard deviation and OTU
number well. We could have done
lm
andpredict
if we had wanted to instead.
Finally, the OTU number and the rarefaction level get passed to
simPower
along with the number of subjects in each group and a range
of effect sizes in order to simulate a list of OTU tables. We then
compute the weighted Jaccard distance for the list of OTUs one more
time. We can check here that the means and standard deviations for these
OTU tables correspond approximately to the mean and standard deviation
we had for our control group.
sp<-simPower(group_size_vector=c(30,30), otu_number=otunum, rare_depth=subsample,sequence_depth=1000,effect_range=seq(0,0.3,length.out=100))
wj<-mclapply(sp,function(x) calcWJstudy(x),mc.cores=cores)
# check if mean & sd is approximately OK
mean(as.numeric(mclapply(wj, mean, mc.cores=4)))
## [1] 0.7271514
mean(as.numeric(mclapply(wj, sd, mc.cores=4)))
## [1] 0.1690322
We can now compute PERMANOVA power with bootstrapping for each distance matrix given the number of subjects in each group to sample and type I error. Finally we can convert our list of PERMANOVA powers into a dataframe with a column for (log) effect size and a column for (log) power, fit a linear model, and use it to predict what the smallest detectable effect size we’ll have for a given power level. We can also plot the relationship between effect size and power with x subjects/group.
bp30<-bootPower(wj, boot_number=10, subject_group_vector=c(30,30),alpha=0.05)
bp30_model <- subset(bp30, power < 0.95 & power > 0.2)
bp30_model <- data.frame(log_omega2=log10(bp30_model$simulated_omega2),log_power=log10(bp30_model$power))
bp30_model <- subset(bp30_model, log_omega2>-Inf)
bp30_lm <- lm(log_omega2 ~ log_power, data=bp30_model)
power80 <- 10^predict(bp30_lm, newdata=data.frame(log_power=log10(0.8)))
power90 <- 10^predict(bp30_lm, newdata=data.frame(log_power = log10(0.9)))
ggplot2::qplot(data=bp30_model,x=log_power,y=log_omega2) + geom_smooth(method="lm") + ggtitle("log10(Omega2) to log10(Power) with 30 Subjects/Group") + xlab("log10(Power)") + ylab("log10(Omega2)")
It turns out we have 80% power to detect an effect size of 0.0127314,
and 90% power to detect an effect size of 0.0167005. Is this good
enough? Well it falls within the lower end of the range of catalogued
effect sizes from the micropower
paper (Kelly et al. 2015), so
probably. But of course it depends on what you’re actually looking to
study. See How do I know the effect size for my study? That
is however, all the steps that are needed in micropower
to actually
perform a power analysis for a microbiome study using PERMANOVA.
Well that’s always the difficulty of power analyses, isn’t it? Luckily
in the micropower
paper there is a table cataloguing effect sizes with
different distance measures and different comparisons that can be used
as a point of reference. If the minimum effect size able to be detected
at a x power falls within the range of catalogued effect sizes then it
is a good bet that your study has enough power with the proposed number
of subjects. With a pilot study or a prior study that can be used as a
proxy, as long as you have OTU tables or distance matrices, you can use
the calcOmega2
function from micropower
to estimate the effect size
as well.
micropower
allows for two distance measures: Jaccard and UniFrac; and
in two different ways: unweighted or weighted. Jaccard and UniFrac
distances are also the most commonly used in microbiome studies. UniFrac
is a phylogenetic based distance metric that accounts for the fraction
of branch length in a phylogenetic tree between two sets of taxa, while
Jaccard distance is a non-phylogenetic distance metric between two sets.
In general if a phylogenetic tree is available then UniFrac is a better
choice. The choice between unweighted and weighted comes down to what
you’re trying to measure in your study. Unweighted measures consider
only the presence or absence of taxa, while weighted take into account
relative abundance as well.
I like micropower
specifically because (1) it deals with the PERMANOVA
test, which is commonly used in microbiome studies, and (2) you only
need abundance tables for one group, which usually means you can just
use control data from HMP. I also like the HMP
package (Rosa et al.
2012), but in order to do power-sample size calculations you need to
have abundance tables for all groups. It is also meant for power
analyses with the Dirichlet-multinomial mixture model.
Kelly, Brendan J., Robert Gross, Kyle Bittinger, Scott Sherrill-Mix, James D. Lewis, Ronald G. Collman, Frederic D. Bushman, and Hongzhe Li. 2015. “Power and sample-size estimation for microbiome studies using pairwise distances and PERMANOVA.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btv183.
Lozupone, Catherine, and Rob Knight. 2005. “UniFrac: A new phylogenetic method for comparing microbial communities.” Applied and Environmental Microbiology. https://doi.org/10.1128/AEM.71.12.8228-8235.2005.
Methé, Barbara A., Karen E. Nelson, Mihai Pop, Heather H. Creasy, Michelle G. Giglio, Curtis Huttenhower, Dirk Gevers, et al. 2012. “A framework for human microbiome research.” Nature. https://doi.org/10.1038/nature11209.
Rosa, Patricio S. la, J. Paul Brooks, Elena Deych, Edward L. Boone, David J. Edwards, Qin Wang, Erica Sodergren, George Weinstock, and William D. Shannon. 2012. “Hypothesis Testing and Power Calculations for Taxonomic-Based Human Microbiome Data.” PLoS ONE. https://doi.org/10.1371/journal.pone.0052078.