Created
December 1, 2015 19:14
-
-
Save jgx65/bd5d1fc7f16bd6de8161 to your computer and use it in GitHub Desktop.
Vignette for population differentiation on SNP data
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Calculating genetic differentiation and clustering methods from SNP data" | |
output: pdf_document | |
--- | |
Introduction | |
======================================================== | |
In this vignette, we will discuss how to assess population genetic structure | |
from SNP data at population level. We will estimate $F_{st}$ per population, | |
Pairwise $F_{st}$, AMOVA (Hierarchical $F_{st}$). We will finally assess the | |
genetic structure at individual level assuming that we do not know populations | |
using a multivariate analysis. | |
The dataset used for those analysis concerns the plant: lodgepole pine (*Pinus | |
contorta*, *Pinaceae*). You can have more information on this data set and the | |
species on the web site of A. Eckert: (http://eckertdata.blogspot.fr/). But | |
here the dataset is used as a test dataset with no idea of interpreting the | |
results in a biological way. We will work on a subset of the dataset to make | |
the calculations faster. | |
Resources/Packages | |
======================================================== | |
```{r,packages, message=FALSE} | |
library(adegenet) | |
library(hierfstat) | |
``` | |
Workflow | |
======================================================== | |
### Import data | |
The data are stored in a text file (genotype=AA..). We will import the dataset | |
in R as a data frame, and then convert the SNP data file into | |
[genind](http://www.inside-r.org/packages/cran/adegenet/docs/.valid.genind) | |
objects. | |
Dataset "Master_Pinus_data_genotype.txt" can be downloaded | |
[here](https://github.com/NESCent/popgenInfo/tree/master/data/Master_Pinus_data_genotype.txt). | |
The text file is a matrix of (550 rows x 3086 columns). It contains 4 extra | |
columns: first column is the label of the individuals, the three other are | |
description of the region, all the other columns are for the genotypes as (AA or | |
AT...). | |
When you import the data, you need to be in the right directory. | |
```{r, data_import_df_show,eval=FALSE} | |
Mydata <- read.table("Master_Pinus_data_genotype.txt", header = T) | |
``` | |
```{r, data_import_df_do, echo = FALSE} | |
# You should be in the right directory | |
Mydata <- read.table("../data/Master_Pinus_data_genotype.txt", header = T) | |
``` | |
```{r, data_manipulate} | |
ind <- as.character(Mydata$tree_id) #use later with adegent (individual labels) | |
population <- as.character(Mydata$state) #use later with adegenet (population labels) | |
county <- Mydata$county | |
dim(Mydata) #550 individuals x 3082 SNPs | |
``` | |
Data conversion | |
To convert MydataR to genind objects (adegent), the object locus needs to | |
contain only genotypes. We decrease the number of snp to make the calculations | |
faster and keep only 20 SNPs in the object locus. We convert Mydata1 in a fstat | |
object (Mydata2) | |
```{r,data_conversion} | |
locus <- Mydata[, 5:24] # simpler | |
(Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind, pop = population,sep="")) | |
Mydata2 <- genind2hierfstat(Mydata1) | |
``` | |
## $F_{st}$ (observed and expected heterozygosity) (package hierfstat) | |
```{r,Basicstatiscis_with Fst} | |
basic.stats(Mydata1) # Fst following Nei (1987) on genind object | |
basic.stats(Mydata2) # same as previous, on hierfstat object | |
wc(Mydata1) # Weir and Cockrham estimate | |
``` | |
### Hierarchical $F_{st}$ tests (=AMOVA for SNP dataset) | |
The function `varcomp.glob` produces a Hierarchical Fst (=AMOVA for SNPs or | |
biallelic markers) It is possible to make permutations on the different levels: | |
The function `test.g` tests the effect of the population on genetic | |
differentiation. Individuals are randomly permuted among states. The states | |
influence genetic differentiation at 5% level. With the function | |
`test.between`, the counties are permuted among states. The states influence | |
significantly genetic structuring. | |
```{r,Hierarchical_Fst} | |
loci <- Mydata2[, -1] # Remove the population column | |
varcomp.glob(levels = data.frame(population, county), loci, diploid = TRUE) | |
test.g(loci, level = population) | |
test.between(loci, test.lev = population, rand.unit = county, nperm = 100) | |
``` | |
### Pairwise $F_{st}$ (package hierfstat) | |
```{r,pairewise_Fst} | |
genet.dist(Mydata1,method="WC84") | |
# No test at the moment | |
``` | |
## Unsupervised clustering | |
We don't know the populations and we are looking for. As recommended by T. | |
Jombart, with the function `gpr` we used the maximum possible number of PCA | |
axis which is 20 here. See detailed tutorial on this method for more information | |
(http://adegenet.r-forge.r-project.org/) In this example, we used | |
`choose.n.clust = FALSE` but it is nice to use the option `TRUE` and then you | |
will be able to choose the number of clusters. | |
```{r,clustering_without_a_priori} | |
# using Kmeans and DIAPC in adegenet | |
grp <- find.clusters(Mydata1, max.n.clust = 10, n.pca = 20, choose.n.clust = FALSE) | |
names(grp) | |
grp$grp | |
``` | |
The K means procedure detected 5 groups. We will use this number of group in | |
the discriminant analysis (function `dapc`). On your own dataset, you need to | |
spend more time to estimate the number of cluster. | |
```{r,description_of_clusters} | |
dapc1 <- dapc(Mydata1, grp$grp, n.pca = 20, n.da = 6) | |
scatter(dapc1) # plot of the group | |
``` | |
Five groups were finally detected and represented by the discriminant analysis. | |
The 11 states are not the good level for population analysis. But we need to | |
remember that we used only 20 SNPs to conduct the analysis. | |
Conclusions | |
======================================================== | |
### What did we learn today? | |
In this vignette, we learned how to calculate $F_{st}$ in existing populations | |
and to investigate the effect of population structure on genetic differentiation | |
from hierarchical $F_{st}$ analysis (like AMOVA in the case of SNP). We also run | |
a multivariate analysis to investigate the genetic structure of the data at | |
individual level assuming no population. | |
### What is next? | |
You may now want to move to the estimation of genetic distances (See vignette | |
DistancesSNP, coming soon) | |
# Contributors | |
- Stephanie Manel | |
Reference | |
======================================================== | |
Eckert, A. J., A. D. Bower, S. C. Gonzalez-Martinez, J. L. Wegrzyn, G. Coop and D. B. Neale. 2010. Back to nature: Ecological genomics of loblolly pine (Pinus taeda, Pinaceae). Molecular Ecology 19: 3789-3805. | |
Thierry de Meus, Jerome Goudet "A step-by-step tutorial to use HierFstat to analyse populations hierarchically structured at multiple levels.", Infect. Genet. Evol., vol. 7, no. 6, 2007 | |
# Session Information | |
Information about the R session used for this tutorial. | |
```{r, sessioninfo} | |
options(width = 100) | |
devtools::session_info() | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@hlapp @smanel This file should run with no error, let me know if it does. Thanks!