sinarueeger/phylo-pca.md

## phylo-pca.md

      
    Raw
  

              phylo-pca.md
            
          
    Phylogenetic principal component analysis (in R)

The following is a guide on how to perform a phylogenetic principal component analysis.
The phylogenetic principal component analysis, short pPCA, has been proposed by Revell in 2009 and Jombart et al. 2010.
Jombart also implemented pPCA in the adephylo R-package.
pPCA takes the phylogenetic non-independence among species means into account (see Polly et al. (2013)).

To estimate phylogenetic PCs, two steps must be performed.


Construct phylogenetic tree


Perform pPCA


Include in linear regression


1. Construct phylogenetic tree

To construct the phylogenetic tree, we use the .fasta files as input.
Output should be an object with a .tre file extension.
2. Perform pPCA

To perform pPCA we use two R-functions: ape::read.tree to import the phylogenetic tree from 1., and phylobase::phylo4d and  adephylo::ppca to run the pPCA.
Here is an introduction on how to import tree data into R.
## From https://www.rdocumentation.org/packages/ape/versions/5.2/topics/read.tree
s <- "owls(((Strix_aluco:4.2,Asio_otus:4.2):3.1,Athene_noctua:7.3):6.3,Tyto_alba:13.5);"
cat(s, file = "ex.tre", sep = "\n")
tree_owls <- ape::read.tree("ex.tre")
plot(tree_owls)

Next, we apply the function phylobase::phylo4d to combine the tree tree_owls and the phenotype data Y. Y: n × m data matrix containing the data for m traits (in our case SNPs) measured in n species (=individuals)
vir_4d <- phylobase::phylo4d(tree_owls, Y)

Lastly, we apply the function adephylo::ppca to calculate the pPCA. The main usage is vir_pca <- adephylo::ppca(x = vir_4d, scale = FALSE, scannf = FALSE, nfposi = 16, method = "Abouheif"), where

x: is a phylo4d object (from phylobase::phylo4d)
nfposi: number of eigenvalues (= number of PCs)

A list of objects is returned, and $li are the pPCA we are interested in.

li: PCAs (corresponds to prcomp(Y)$x)
c1: loadings (corresponds to prcomp(Y)$rotation)
eig: eigenvalues (corresponds to prcomp(Y)$sdev^2)

Let's make a small example with an artificial Y matrix and tree_owls from above.
n <- 4 ## four species (as in tree.owls
m <- 100 ## 100 traits (in our case these would be SNPs)
Y <- matrix(sample(c(0, 1, 2), n*m, replace = TRUE), nrow = n ) ## sample random 0s, 1s and 2s. 0s and 1s won't work with option corr, only cov!
rownames(Y) <- c("Strix_aluco", "Asio_otus", "Athene_noctua", "Tyto_alba")
## Y <- matrix(sample(c(0,1), (n*m), replace = TRUE), nrow = 4 )
## make sure that the same order as tree!

vir_4d <- phylobase::phylo4d(tree_owls, Y) 
vir_pca <- adephylo::ppca(vir_4d, scale = FALSE, scannf = FALSE, nfposi = 2, method = "Abouheif")$li


Note


Both packages (phylobase and adephylo) have a number of dependencies (e.g. the fs package). Make sure to install the necessary tools.
Only include individuals that are in both datasets, the tree and the phenotype.
If you have missings in your Y matrix, you need to impute them (or remove these individuals). A simple imputation is to replace missing values by the row mean.
You should choose the number of PCs nfposi depending on your dataset.

ind <- which(is.na(Y), arr.ind = TRUE)
Y[ind] <- rowMeans(Y[,-1], na.rm = TRUE)[ind[,1]]

3. Include in linear regression

Now that you have the phylogentic PCs, you can include them as predictors in your regression model, see Naret et al. 2018.
The example above only has four observations, but for the sake of showing code we can still write down a regression model (it won't run though because of the low sample size).
dat <- data.frame(weight = rnorm(nrow(vir_pca), mean = 170), 
                  some_predictor = rnorm(nrow(vir_pca), mean = 0),
                  vir_pca)
dat
#                 weight some_predictor       PC1       PC2
# Strix_aluco   171.9021      1.3546809 -4.373603  1.076891
# Asio_otus     168.6747      0.4932644 -4.384902 -1.851824
# Athene_noctua 171.3471     -0.2525620  3.193547  5.946482
# Tyto_alba     170.3702      0.6291904  5.564957 -5.171549


model <- lm(weight ~ some_predictor + PC1 + PC2 , data = dat)

Notes


pPCA has been used and documented by Ruffell et al. (2014), see details in appendix.
A summary on the method is provided in Polly et al. (2013).
Implementation of the pPCA algorithm in R by Liam Revell does not work well (returned complex numbers) (Revell 2012, available in the package phytools).