jgx65/DifferentiationSNP.Rmd

## DifferentiationSNP.Rmd
---
title: "Calculating genetic differentiation and clustering methods from SNP data"
output: pdf_document
---

Introduction
========================================================

In this vignette, we will discuss how to assess population genetic structure
from SNP data at population level. We will estimate $F_{st}$ per population,
Pairwise $F_{st}$, AMOVA (Hierarchical $F_{st}$). We will finally assess the
genetic structure at individual level assuming that we do not know populations
using a multivariate analysis.


The dataset used for those analysis concerns the plant: lodgepole pine (*Pinus
contorta*, *Pinaceae*). You can have more information on this data set and the
species on the web site of A. Eckert:  (http://eckertdata.blogspot.fr/). But
here the dataset is used as a test dataset with no idea of interpreting the
results in a biological way. We will work on a subset of the dataset to make
the calculations faster.


Resources/Packages
========================================================


```{r,packages, message=FALSE}
library(adegenet)
library(hierfstat)
```


Workflow
========================================================
### Import data

The data are stored in a text file (genotype=AA..). We will  import the dataset
in R as a data frame, and then  convert the SNP data file into
[genind](http://www.inside-r.org/packages/cran/adegenet/docs/.valid.genind)
objects.

Dataset "Master_Pinus_data_genotype.txt" can be downloaded
[here](https://github.com/NESCent/popgenInfo/tree/master/data/Master_Pinus_data_genotype.txt).

The  text file is a matrix of (550 rows x 3086 columns). It contains 4 extra
columns: first column is the label of the individuals, the three other are
description of the region, all the other columns are for the genotypes as (AA or
AT...).

When you import the data, you need to be in the right directory.

```{r, data_import_df_show,eval=FALSE}

Mydata <- read.table("Master_Pinus_data_genotype.txt", header = T)
```
```{r, data_import_df_do, echo = FALSE}
# You should be in the right directory
Mydata <- read.table("../data/Master_Pinus_data_genotype.txt", header = T)
```

```{r, data_manipulate}
ind <- as.character(Mydata$tree_id) #use later with adegent (individual labels)
population <- as.character(Mydata$state) #use later with adegenet (population labels)
county <- Mydata$county
dim(Mydata) #550 individuals x 3082 SNPs
```

Data conversion

To convert MydataR to genind objects (adegent), the object locus needs to
contain only genotypes. We decrease the number of snp to make the calculations
faster and keep only 20 SNPs in the object locus. We convert Mydata1 in a fstat
object (Mydata2)

```{r,data_conversion}

locus <- Mydata[, 5:24] # simpler
(Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind, pop = population,sep=""))
Mydata2 <- genind2hierfstat(Mydata1)


```

## $F_{st}$ (observed and expected heterozygosity)  (package hierfstat)

```{r,Basicstatiscis_with Fst}
basic.stats(Mydata1) # Fst following Nei (1987) on genind object
basic.stats(Mydata2) # same as previous, on hierfstat object
wc(Mydata1) # Weir and Cockrham estimate
```

### Hierarchical $F_{st}$ tests (=AMOVA for SNP dataset)
The function `varcomp.glob` produces a Hierarchical Fst (=AMOVA for SNPs or
biallelic markers) It is possible to make permutations on the different levels:
The function `test.g` tests the effect of the population on genetic
differentiation. Individuals are randomly permuted among states. The states
influence  genetic differentiation at 5% level. With the function
`test.between`, the counties are permuted among states. The states influence
significantly genetic structuring.


```{r,Hierarchical_Fst}
loci <- Mydata2[, -1] # Remove the population column
varcomp.glob(levels = data.frame(population, county), loci, diploid = TRUE)
test.g(loci, level = population)
test.between(loci, test.lev = population, rand.unit = county, nperm = 100)
```


### Pairwise $F_{st}$ (package hierfstat)

```{r,pairewise_Fst}
genet.dist(Mydata1,method="WC84")
# No test at the moment
```


## Unsupervised clustering

We don't know the populations and we are looking for. As recommended by T.
Jombart, with the function `gpr` we used the maximum  possible number of PCA
axis which is 20 here. See detailed tutorial on this method for more information
(http://adegenet.r-forge.r-project.org/) In this example, we used
`choose.n.clust = FALSE` but it is nice to use the option `TRUE` and then you
will be able to choose the number of clusters.

```{r,clustering_without_a_priori}
# using Kmeans and DIAPC in adegenet
grp <- find.clusters(Mydata1, max.n.clust = 10, n.pca = 20, choose.n.clust = FALSE)
names(grp)
grp$grp
```

The K means procedure detected 5 groups. We will use this number of  group in
the discriminant analysis (function `dapc`). On your own dataset, you need to
spend more time to estimate the number of cluster.

```{r,description_of_clusters}
dapc1 <- dapc(Mydata1, grp$grp, n.pca = 20, n.da = 6)
scatter(dapc1) # plot of the group
```

Five groups were finally detected and represented by the discriminant analysis.
The 11 states are not the good level for population analysis. But we need to
remember that we used only 20 SNPs to conduct the analysis.

Conclusions
========================================================

### What did we learn today?

In this vignette, we learned how to calculate $F_{st}$ in existing populations
and to investigate the effect of population structure on genetic differentiation
from hierarchical $F_{st}$ analysis (like AMOVA in the case of SNP). We also run
a multivariate analysis to investigate the genetic structure of the data at
individual level assuming no population.

### What is next?

You may now want to move to the estimation of genetic distances (See vignette
DistancesSNP, coming soon)

# Contributors

- Stephanie Manel

Reference
========================================================

Eckert, A. J., A. D. Bower, S. C. Gonzalez-Martinez, J. L. Wegrzyn, G. Coop and D. B. Neale. 2010. Back to nature: Ecological genomics of loblolly pine (Pinus taeda, Pinaceae). Molecular Ecology 19: 3789-3805.


Thierry de Meus, Jerome Goudet "A step-by-step tutorial to use HierFstat to analyse populations hierarchically structured at multiple levels.", Infect. Genet. Evol., vol. 7, no. 6, 2007


# Session Information

Information about the R session used for this tutorial.

```{r, sessioninfo}
options(width = 100)
devtools::session_info()
```
	---
	title: "Calculating genetic differentiation and clustering methods from SNP data"
	output: pdf_document
	---

	Introduction
	========================================================

	In this vignette, we will discuss how to assess population genetic structure
	from SNP data at population level. We will estimate $F_{st}$ per population,
	Pairwise $F_{st}$, AMOVA (Hierarchical $F_{st}$). We will finally assess the
	genetic structure at individual level assuming that we do not know populations
	using a multivariate analysis.


	The dataset used for those analysis concerns the plant: lodgepole pine (*Pinus
	contorta, Pinaceae*). You can have more information on this data set and the
	species on the web site of A. Eckert: (http://eckertdata.blogspot.fr/). But
	here the dataset is used as a test dataset with no idea of interpreting the
	results in a biological way. We will work on a subset of the dataset to make
	the calculations faster.



	Resources/Packages
	========================================================



	```{r,packages, message=FALSE}
	library(adegenet)
	library(hierfstat)
	```


	Workflow
	========================================================
	### Import data

	The data are stored in a text file (genotype=AA..). We will import the dataset
	in R as a data frame, and then convert the SNP data file into
	[genind](http://www.inside-r.org/packages/cran/adegenet/docs/.valid.genind)
	objects.

	Dataset "Master_Pinus_data_genotype.txt" can be downloaded
	[here](https://github.com/NESCent/popgenInfo/tree/master/data/Master_Pinus_data_genotype.txt).

	The text file is a matrix of (550 rows x 3086 columns). It contains 4 extra
	columns: first column is the label of the individuals, the three other are
	description of the region, all the other columns are for the genotypes as (AA or
	AT...).

	When you import the data, you need to be in the right directory.

	```{r, data_import_df_show,eval=FALSE}

	Mydata <- read.table("Master_Pinus_data_genotype.txt", header = T)
	```
	```{r, data_import_df_do, echo = FALSE}
	# You should be in the right directory
	Mydata <- read.table("../data/Master_Pinus_data_genotype.txt", header = T)
	```

	```{r, data_manipulate}
	ind <- as.character(Mydata$tree_id) #use later with adegent (individual labels)
	population <- as.character(Mydata$state) #use later with adegenet (population labels)
	county <- Mydata$county
	dim(Mydata) #550 individuals x 3082 SNPs
	```

	Data conversion

	To convert MydataR to genind objects (adegent), the object locus needs to
	contain only genotypes. We decrease the number of snp to make the calculations
	faster and keep only 20 SNPs in the object locus. We convert Mydata1 in a fstat
	object (Mydata2)

	```{r,data_conversion}

	locus <- Mydata[, 5:24] # simpler
	(Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind, pop = population,sep=""))
	Mydata2 <- genind2hierfstat(Mydata1)


	```

	## $F_{st}$ (observed and expected heterozygosity) (package hierfstat)

	```{r,Basicstatiscis_with Fst}
	basic.stats(Mydata1) # Fst following Nei (1987) on genind object
	basic.stats(Mydata2) # same as previous, on hierfstat object
	wc(Mydata1) # Weir and Cockrham estimate
	```

	### Hierarchical $F_{st}$ tests (=AMOVA for SNP dataset)
	The function `varcomp.glob` produces a Hierarchical Fst (=AMOVA for SNPs or
	biallelic markers) It is possible to make permutations on the different levels:
	The function `test.g` tests the effect of the population on genetic
	differentiation. Individuals are randomly permuted among states. The states
	influence genetic differentiation at 5% level. With the function
	`test.between`, the counties are permuted among states. The states influence
	significantly genetic structuring.


	```{r,Hierarchical_Fst}
	loci <- Mydata2[, -1] # Remove the population column
	varcomp.glob(levels = data.frame(population, county), loci, diploid = TRUE)
	test.g(loci, level = population)
	test.between(loci, test.lev = population, rand.unit = county, nperm = 100)
	```


	### Pairwise $F_{st}$ (package hierfstat)

	```{r,pairewise_Fst}
	genet.dist(Mydata1,method="WC84")
	# No test at the moment
	```


	## Unsupervised clustering

	We don't know the populations and we are looking for. As recommended by T.
	Jombart, with the function `gpr` we used the maximum possible number of PCA
	axis which is 20 here. See detailed tutorial on this method for more information
	(http://adegenet.r-forge.r-project.org/) In this example, we used
	`choose.n.clust = FALSE` but it is nice to use the option `TRUE` and then you
	will be able to choose the number of clusters.

	```{r,clustering_without_a_priori}
	# using Kmeans and DIAPC in adegenet
	grp <- find.clusters(Mydata1, max.n.clust = 10, n.pca = 20, choose.n.clust = FALSE)
	names(grp)
	grp$grp
	```

	The K means procedure detected 5 groups. We will use this number of group in
	the discriminant analysis (function `dapc`). On your own dataset, you need to
	spend more time to estimate the number of cluster.

	```{r,description_of_clusters}
	dapc1 <- dapc(Mydata1, grp$grp, n.pca = 20, n.da = 6)
	scatter(dapc1) # plot of the group
	```

	Five groups were finally detected and represented by the discriminant analysis.
	The 11 states are not the good level for population analysis. But we need to
	remember that we used only 20 SNPs to conduct the analysis.

	Conclusions
	========================================================

	### What did we learn today?

	In this vignette, we learned how to calculate $F_{st}$ in existing populations
	and to investigate the effect of population structure on genetic differentiation
	from hierarchical $F_{st}$ analysis (like AMOVA in the case of SNP). We also run
	a multivariate analysis to investigate the genetic structure of the data at
	individual level assuming no population.

	### What is next?

	You may now want to move to the estimation of genetic distances (See vignette
	DistancesSNP, coming soon)

	# Contributors

	- Stephanie Manel

	Reference
	========================================================

	Eckert, A. J., A. D. Bower, S. C. Gonzalez-Martinez, J. L. Wegrzyn, G. Coop and D. B. Neale. 2010. Back to nature: Ecological genomics of loblolly pine (Pinus taeda, Pinaceae). Molecular Ecology 19: 3789-3805.


	Thierry de Meus, Jerome Goudet "A step-by-step tutorial to use HierFstat to analyse populations hierarchically structured at multiple levels.", Infect. Genet. Evol., vol. 7, no. 6, 2007


	# Session Information

	Information about the R session used for this tutorial.

	```{r, sessioninfo}
	options(width = 100)
	devtools::session_info()
	```