The following is a guide on how to perform a phylogenetic principal component analysis.
The phylogenetic principal component analysis, short pPCA, has been proposed by Revell in 2009 and Jombart et al. 2010.
Jombart also implemented pPCA in the adephylo R-package.
pPCA takes the phylogenetic non-independence among species means into account (see Polly et al. (2013)).
To estimate phylogenetic PCs, two steps must be performed.
To construct the phylogenetic tree, we use the .fasta
files as input.
Output should be an object with a .tre
file extension.
To perform pPCA we use two R-functions: ape::read.tree
to import the phylogenetic tree from 1., and phylobase::phylo4d
and adephylo::ppca
to run the pPCA.
Here is an introduction on how to import tree data into R.
## From https://www.rdocumentation.org/packages/ape/versions/5.2/topics/read.tree
s <- "owls(((Strix_aluco:4.2,Asio_otus:4.2):3.1,Athene_noctua:7.3):6.3,Tyto_alba:13.5);"
cat(s, file = "ex.tre", sep = "\n")
tree_owls <- ape::read.tree("ex.tre")
plot(tree_owls)
Next, we apply the function phylobase::phylo4d
to combine the tree tree_owls
and the phenotype data Y
. Y
: n × m data matrix containing the data for m traits (in our case SNPs) measured in n species (=individuals)
vir_4d <- phylobase::phylo4d(tree_owls, Y)
Lastly, we apply the function adephylo::ppca
to calculate the pPCA. The main usage is vir_pca <- adephylo::ppca(x = vir_4d, scale = FALSE, scannf = FALSE, nfposi = 16, method = "Abouheif")
, where
x
: is a phylo4d object (fromphylobase::phylo4d
)nfposi
: number of eigenvalues (= number of PCs)
A list of objects is returned, and $li
are the pPCA we are interested in.
li
: PCAs (corresponds toprcomp(Y)$x
)c1
: loadings (corresponds toprcomp(Y)$rotation
)eig
: eigenvalues (corresponds toprcomp(Y)$sdev^2
)
Let's make a small example with an artificial Y
matrix and tree_owls
from above.
n <- 4 ## four species (as in tree.owls
m <- 100 ## 100 traits (in our case these would be SNPs)
Y <- matrix(sample(c(0, 1, 2), n*m, replace = TRUE), nrow = n ) ## sample random 0s, 1s and 2s. 0s and 1s won't work with option corr, only cov!
rownames(Y) <- c("Strix_aluco", "Asio_otus", "Athene_noctua", "Tyto_alba")
## Y <- matrix(sample(c(0,1), (n*m), replace = TRUE), nrow = 4 )
## make sure that the same order as tree!
vir_4d <- phylobase::phylo4d(tree_owls, Y)
vir_pca <- adephylo::ppca(vir_4d, scale = FALSE, scannf = FALSE, nfposi = 2, method = "Abouheif")$li
- Both packages (
phylobase
andadephylo
) have a number of dependencies (e.g. thefs
package). Make sure to install the necessary tools. - Only include individuals that are in both datasets, the tree and the phenotype.
- If you have missings in your
Y
matrix, you need to impute them (or remove these individuals). A simple imputation is to replace missing values by the row mean. - You should choose the number of PCs
nfposi
depending on your dataset.
ind <- which(is.na(Y), arr.ind = TRUE)
Y[ind] <- rowMeans(Y[,-1], na.rm = TRUE)[ind[,1]]
Now that you have the phylogentic PCs, you can include them as predictors in your regression model, see Naret et al. 2018.
The example above only has four observations, but for the sake of showing code we can still write down a regression model (it won't run though because of the low sample size).
dat <- data.frame(weight = rnorm(nrow(vir_pca), mean = 170),
some_predictor = rnorm(nrow(vir_pca), mean = 0),
vir_pca)
dat
# weight some_predictor PC1 PC2
# Strix_aluco 171.9021 1.3546809 -4.373603 1.076891
# Asio_otus 168.6747 0.4932644 -4.384902 -1.851824
# Athene_noctua 171.3471 -0.2525620 3.193547 5.946482
# Tyto_alba 170.3702 0.6291904 5.564957 -5.171549
model <- lm(weight ~ some_predictor + PC1 + PC2 , data = dat)
- pPCA has been used and documented by Ruffell et al. (2014), see details in appendix.
- A summary on the method is provided in Polly et al. (2013).
- Implementation of the pPCA algorithm in R by Liam Revell does not work well (returned complex numbers) (Revell 2012, available in the package phytools).