Skip to content

Instantly share code, notes, and snippets.

@sinarueeger
Last active July 18, 2022 23:15
Show Gist options
  • Save sinarueeger/718c7ec3cee473ad0beed4feea4d0e24 to your computer and use it in GitHub Desktop.
Save sinarueeger/718c7ec3cee473ad0beed4feea4d0e24 to your computer and use it in GitHub Desktop.
Estimating phylogenetic PCs

Phylogenetic principal component analysis (in R)

The following is a guide on how to perform a phylogenetic principal component analysis.

The phylogenetic principal component analysis, short pPCA, has been proposed by Revell in 2009 and Jombart et al. 2010.

Jombart also implemented pPCA in the adephylo R-package.

pPCA takes the phylogenetic non-independence among species means into account (see Polly et al. (2013)).

To estimate phylogenetic PCs, two steps must be performed.

  1. Construct phylogenetic tree

  2. Perform pPCA

  3. Include in linear regression

1. Construct phylogenetic tree

To construct the phylogenetic tree, we use the .fasta files as input.

Output should be an object with a .tre file extension.

2. Perform pPCA

To perform pPCA we use two R-functions: ape::read.tree to import the phylogenetic tree from 1., and phylobase::phylo4d and adephylo::ppca to run the pPCA.

Here is an introduction on how to import tree data into R.

## From https://www.rdocumentation.org/packages/ape/versions/5.2/topics/read.tree
s <- "owls(((Strix_aluco:4.2,Asio_otus:4.2):3.1,Athene_noctua:7.3):6.3,Tyto_alba:13.5);"
cat(s, file = "ex.tre", sep = "\n")
tree_owls <- ape::read.tree("ex.tre")
plot(tree_owls)

Next, we apply the function phylobase::phylo4d to combine the tree tree_owls and the phenotype data Y. Y: n × m data matrix containing the data for m traits (in our case SNPs) measured in n species (=individuals)

vir_4d <- phylobase::phylo4d(tree_owls, Y)

Lastly, we apply the function adephylo::ppca to calculate the pPCA. The main usage is vir_pca <- adephylo::ppca(x = vir_4d, scale = FALSE, scannf = FALSE, nfposi = 16, method = "Abouheif"), where

  • x: is a phylo4d object (from phylobase::phylo4d)
  • nfposi: number of eigenvalues (= number of PCs)

A list of objects is returned, and $li are the pPCA we are interested in.

  • li: PCAs (corresponds to prcomp(Y)$x)
  • c1: loadings (corresponds to prcomp(Y)$rotation)
  • eig: eigenvalues (corresponds to prcomp(Y)$sdev^2)

Let's make a small example with an artificial Y matrix and tree_owls from above.

n <- 4 ## four species (as in tree.owls
m <- 100 ## 100 traits (in our case these would be SNPs)
Y <- matrix(sample(c(0, 1, 2), n*m, replace = TRUE), nrow = n ) ## sample random 0s, 1s and 2s. 0s and 1s won't work with option corr, only cov!
rownames(Y) <- c("Strix_aluco", "Asio_otus", "Athene_noctua", "Tyto_alba")
## Y <- matrix(sample(c(0,1), (n*m), replace = TRUE), nrow = 4 )
## make sure that the same order as tree!

vir_4d <- phylobase::phylo4d(tree_owls, Y) 
vir_pca <- adephylo::ppca(vir_4d, scale = FALSE, scannf = FALSE, nfposi = 2, method = "Abouheif")$li

Note

  • Both packages (phylobase and adephylo) have a number of dependencies (e.g. the fs package). Make sure to install the necessary tools.
  • Only include individuals that are in both datasets, the tree and the phenotype.
  • If you have missings in your Y matrix, you need to impute them (or remove these individuals). A simple imputation is to replace missing values by the row mean.
  • You should choose the number of PCs nfposi depending on your dataset.
ind <- which(is.na(Y), arr.ind = TRUE)
Y[ind] <- rowMeans(Y[,-1], na.rm = TRUE)[ind[,1]]

3. Include in linear regression

Now that you have the phylogentic PCs, you can include them as predictors in your regression model, see Naret et al. 2018.

The example above only has four observations, but for the sake of showing code we can still write down a regression model (it won't run though because of the low sample size).

dat <- data.frame(weight = rnorm(nrow(vir_pca), mean = 170), 
                  some_predictor = rnorm(nrow(vir_pca), mean = 0),
                  vir_pca)
dat
#                 weight some_predictor       PC1       PC2
# Strix_aluco   171.9021      1.3546809 -4.373603  1.076891
# Asio_otus     168.6747      0.4932644 -4.384902 -1.851824
# Athene_noctua 171.3471     -0.2525620  3.193547  5.946482
# Tyto_alba     170.3702      0.6291904  5.564957 -5.171549

model <- lm(weight ~ some_predictor + PC1 + PC2 , data = dat)

Notes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment