Skip to content

Instantly share code, notes, and snippets.

@jgx65
Created December 1, 2015 19:14
Show Gist options
  • Save jgx65/bd5d1fc7f16bd6de8161 to your computer and use it in GitHub Desktop.
Save jgx65/bd5d1fc7f16bd6de8161 to your computer and use it in GitHub Desktop.
Vignette for population differentiation on SNP data
---
title: "Calculating genetic differentiation and clustering methods from SNP data"
output: pdf_document
---
Introduction
========================================================
In this vignette, we will discuss how to assess population genetic structure
from SNP data at population level. We will estimate $F_{st}$ per population,
Pairwise $F_{st}$, AMOVA (Hierarchical $F_{st}$). We will finally assess the
genetic structure at individual level assuming that we do not know populations
using a multivariate analysis.
The dataset used for those analysis concerns the plant: lodgepole pine (*Pinus
contorta*, *Pinaceae*). You can have more information on this data set and the
species on the web site of A. Eckert: (http://eckertdata.blogspot.fr/). But
here the dataset is used as a test dataset with no idea of interpreting the
results in a biological way. We will work on a subset of the dataset to make
the calculations faster.
Resources/Packages
========================================================
```{r,packages, message=FALSE}
library(adegenet)
library(hierfstat)
```
Workflow
========================================================
### Import data
The data are stored in a text file (genotype=AA..). We will import the dataset
in R as a data frame, and then convert the SNP data file into
[genind](http://www.inside-r.org/packages/cran/adegenet/docs/.valid.genind)
objects.
Dataset "Master_Pinus_data_genotype.txt" can be downloaded
[here](https://github.com/NESCent/popgenInfo/tree/master/data/Master_Pinus_data_genotype.txt).
The text file is a matrix of (550 rows x 3086 columns). It contains 4 extra
columns: first column is the label of the individuals, the three other are
description of the region, all the other columns are for the genotypes as (AA or
AT...).
When you import the data, you need to be in the right directory.
```{r, data_import_df_show,eval=FALSE}
Mydata <- read.table("Master_Pinus_data_genotype.txt", header = T)
```
```{r, data_import_df_do, echo = FALSE}
# You should be in the right directory
Mydata <- read.table("../data/Master_Pinus_data_genotype.txt", header = T)
```
```{r, data_manipulate}
ind <- as.character(Mydata$tree_id) #use later with adegent (individual labels)
population <- as.character(Mydata$state) #use later with adegenet (population labels)
county <- Mydata$county
dim(Mydata) #550 individuals x 3082 SNPs
```
Data conversion
To convert MydataR to genind objects (adegent), the object locus needs to
contain only genotypes. We decrease the number of snp to make the calculations
faster and keep only 20 SNPs in the object locus. We convert Mydata1 in a fstat
object (Mydata2)
```{r,data_conversion}
locus <- Mydata[, 5:24] # simpler
(Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind, pop = population,sep=""))
Mydata2 <- genind2hierfstat(Mydata1)
```
## $F_{st}$ (observed and expected heterozygosity) (package hierfstat)
```{r,Basicstatiscis_with Fst}
basic.stats(Mydata1) # Fst following Nei (1987) on genind object
basic.stats(Mydata2) # same as previous, on hierfstat object
wc(Mydata1) # Weir and Cockrham estimate
```
### Hierarchical $F_{st}$ tests (=AMOVA for SNP dataset)
The function `varcomp.glob` produces a Hierarchical Fst (=AMOVA for SNPs or
biallelic markers) It is possible to make permutations on the different levels:
The function `test.g` tests the effect of the population on genetic
differentiation. Individuals are randomly permuted among states. The states
influence genetic differentiation at 5% level. With the function
`test.between`, the counties are permuted among states. The states influence
significantly genetic structuring.
```{r,Hierarchical_Fst}
loci <- Mydata2[, -1] # Remove the population column
varcomp.glob(levels = data.frame(population, county), loci, diploid = TRUE)
test.g(loci, level = population)
test.between(loci, test.lev = population, rand.unit = county, nperm = 100)
```
### Pairwise $F_{st}$ (package hierfstat)
```{r,pairewise_Fst}
genet.dist(Mydata1,method="WC84")
# No test at the moment
```
## Unsupervised clustering
We don't know the populations and we are looking for. As recommended by T.
Jombart, with the function `gpr` we used the maximum possible number of PCA
axis which is 20 here. See detailed tutorial on this method for more information
(http://adegenet.r-forge.r-project.org/) In this example, we used
`choose.n.clust = FALSE` but it is nice to use the option `TRUE` and then you
will be able to choose the number of clusters.
```{r,clustering_without_a_priori}
# using Kmeans and DIAPC in adegenet
grp <- find.clusters(Mydata1, max.n.clust = 10, n.pca = 20, choose.n.clust = FALSE)
names(grp)
grp$grp
```
The K means procedure detected 5 groups. We will use this number of group in
the discriminant analysis (function `dapc`). On your own dataset, you need to
spend more time to estimate the number of cluster.
```{r,description_of_clusters}
dapc1 <- dapc(Mydata1, grp$grp, n.pca = 20, n.da = 6)
scatter(dapc1) # plot of the group
```
Five groups were finally detected and represented by the discriminant analysis.
The 11 states are not the good level for population analysis. But we need to
remember that we used only 20 SNPs to conduct the analysis.
Conclusions
========================================================
### What did we learn today?
In this vignette, we learned how to calculate $F_{st}$ in existing populations
and to investigate the effect of population structure on genetic differentiation
from hierarchical $F_{st}$ analysis (like AMOVA in the case of SNP). We also run
a multivariate analysis to investigate the genetic structure of the data at
individual level assuming no population.
### What is next?
You may now want to move to the estimation of genetic distances (See vignette
DistancesSNP, coming soon)
# Contributors
- Stephanie Manel
Reference
========================================================
Eckert, A. J., A. D. Bower, S. C. Gonzalez-Martinez, J. L. Wegrzyn, G. Coop and D. B. Neale. 2010. Back to nature: Ecological genomics of loblolly pine (Pinus taeda, Pinaceae). Molecular Ecology 19: 3789-3805.
Thierry de Meus, Jerome Goudet "A step-by-step tutorial to use HierFstat to analyse populations hierarchically structured at multiple levels.", Infect. Genet. Evol., vol. 7, no. 6, 2007
# Session Information
Information about the R session used for this tutorial.
```{r, sessioninfo}
options(width = 100)
devtools::session_info()
```
@jgx65
Copy link
Author

jgx65 commented Dec 1, 2015

@hlapp @smanel This file should run with no error, let me know if it does. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment