Skip to content

Instantly share code, notes, and snippets.

@JDSwenson
Created October 17, 2022 21:02
Show Gist options
  • Save JDSwenson/d22a31d013c05e1226a1ec698f724520 to your computer and use it in GitHub Desktop.
Save JDSwenson/d22a31d013c05e1226a1ec698f724520 to your computer and use it in GitHub Desktop.

Stacks Pipeline v2

See description of sstacks in this google group exchange: https://groups.google.com/g/stacks-users/c/qZV9RF_eOp4

See Julian's response on this post about where to get values for number of loci, SNPs, etc from Stacks: https://groups.google.com/g/stacks-users/c/A99aMiqLAyQ/m/gv-2pbiwDgAJ

TO DO

  1. Pick up at (or around) line 453
    • Add script that uses reduced catalog
    • Finish with vcftools script and then direct to SNP filter GitHub
    • Change the numbers of scripts (e.g. 04_run_ustacks.sh -> 03_run_ustacks.sh)
    • Remove the pointless part of the ustacks script that lists files at the top
  2. Adjust individual heterozygosity calculations during parameter optimization (don't need to do a separate stacks run each time; can just use vcf tools or add calculations to heatmap script)
  3. Make script that extracts metrics a separate script during 03_process_data and 04_run_ustacks
  4. Change requested hours for the jobs
  5. Remove r80 files from PCA and Missingness (i.e. so they only run on the full files)

This is a pipeline for processing RAD-Seq data and determining optimal parameters for assembling loci, calling SNPs, and calculating population statistics using the program Stacks. The pipeline assumes we are starting with raw data from the sequencer (i.e. demultiplexed by Illumina barcode)

Note: if at any point a script fails (e.g. calling from the wrong directory, forgetting to activate the atropos environment), it is very important to delete the associated log file before re-running. Otherwise, any script that extracts information from the log file will also extract information for the failed run.

Workspace setup:

First, cd into the directory where you plan to store your Stacks files.

Unless otherwise indicated, this "main directory" - the parent directory to all the folders we're about to create - is where you should be when you run all the scripts in this pipeline. All scripts and files use the relative paths of the files and assume you're in the main directory.

#For example
$ mkdir example_dir
$ cd example_dir #This is your "main directory" for the workflow

Step 1: Install the following programs

i. Atropos (for trimming)

#load anaconda
$ module load anaconda3/2019.03

#create new environment for atropos and dependencies
$ conda create -n atropos

#prerequisite to activating the environment for the first time
$ conda init bash

#activate conda environment with atropos
$ conda activate atropos

#install atropos and dependencies
$ conda install -c bioconda atropos

ii. clustOpt (for parameter optimization)

# Download scripts from McCartney‐Melstad Github into your working directory
$ git clone https://github.com/atcg/clustOpt.git

1. Open the missingnessHeatMaps.R script (located in clustOpt/resources) in your favorite text editor program (e.g. nano, emacs, atom) and edit according to the instructions below.

1a) Edit package loading

#!/usr/bin/env Rscript

#suppressPackageStartupMessages(library(SNPRelate))
#library(pheatmap)
#library(grImport2)
#suppressPackageStartupMessages(library(dendextend))

# Modified loading
library(pheatmap, lib="/project/uma_lisa_komoroske/bin")
library(grImport2, lib="/project/uma_lisa_komoroske/bin")
library(gdsfmt,lib="/project/uma_lisa_komoroske/bin")
library(SNPRelate,lib="/project/uma_lisa_komoroske/bin")
library(dendextend,lib="/project/uma_lisa_komoroske/bin")

1b) Turn on heat map row and column names

  • Find the command called pheatmap() and turn on (T) in the “show_rownames” and “show_colnames” flags on (example below):
makeVectorHeatmap <- function(missingnessMatrix, dendrogramz, name) {
    # Make the heat map SVG
    heatmapFile <- paste(name, ".heatmap.svg", sep = "")
    svg(heatmapFile, height=4, width=4.44)
    heatmapReordered <- missingnessMatrix[labels(dendrogramz), labels(dendrogramz)]
    
    pheatmap(heatmapReordered, breaks=mat_breaks, cluster_rows=F, cluster_cols=F, show_rownames=T, show_colnames=T, border_color = F)
    
    dev.off()

1c) Save the script and exit nano (or whatever text editor you're using)

2. Open the script ./clustopt/vcfToPCAvarExplained.R and edit package loading to call the packages from our local bin:

#!/usr/bin/env Rscript

# Load SNPRelate
#suppressPackageStartupMessages(library(SNPRelate))

# Modified loading
library(gdsfmt,lib="/project/uma_lisa_komoroske/bin")
library(SNPRelate,lib="/project/uma_lisa_komoroske/bin")

3. Move the clustopt directory into the metrics folder: mv ./clustopt ./metrics

Step 2: Save scripts and set up the file structure

i. Make a directory to store your scripts using mkdir scripts

ii. Copy the BASH scripts at the end of this page through metrics4Rmarkdown.sh into separate text files in the folder scripts. For ease, you may want to give them the same names (e.g. 01_mk_dirs.sh, 02_prelim_QC.sh, etc.). No need to copy the R scripts in the section after.

iii. From the main directory, run bash ./scripts/01_mkdirs.sh This will set up the directory so you should see the following folders:

00_RAW_data 01_trimmed_RAW_data 02_process_out 03_clone_filter_out 04_ustacks_out 05_parameter_opt 06_denovo_map_out 07_vcfTools_out logs metrics scripts

There will also be folders within 05_parameter_opt for each parameter value (1-9), and subfolders for each value to store r80 output files (see parameter optimization section).

Step 3: Create barcode file and link to raw data

i. In the main directory, create a tab-delimited barcode file that includes the barcodes and sample names for every sample run in the lane:

[barcode]<tab>[sample1_name]
[barcode]<tab>[sample2_name]
[barcode]<tab>[sample3_name]
etc ...

When Stacks demultiplexes the samples, it will name each according to the names given in this barcode file, appending a .1 or .2 to the end depending on whether it is the read with the barcode/cut site (1) or the paired read (2). (Note, in the bestrad protocol, the barcode and cut site could be on either read.)

ii. Link to the raw sequencing data in the 00_RAW_data folder. You must use the absolute path to the raw reads. Assuming are working from your main directory within the lab project folder, you can use this as an example (substituting your name/folder for the brackets): $ ln -s /project/uma_lisa_komoroske/[exact location_of_raw_reads]/*fq.gz 00_RAW_data/

I: Data QC and cleaning

Before we can proceed with data cleaning and analysis, it is important to first understand our data so that we can prepare it for Stacks. Specifically, the first step in the Stacks pipeline expects that the barcodes will occupy the first 6-8 bases of the sequence, followed immediately by the cut site. If either the barcode or cut site is positioned differently in a read, that read will be dropped.

A quick note about bestrad: The Best RAD protocol (see Ali et. al. 2016) does two things that can cause problems if not accounted for:

  1. Potentially transposes the reads. The cut site and barcode can be located on either the forward or reverse read. Stacks will handle this and transpose the reads if we feed it the --bestrad flag. In terms of preliminary QC, it means that we need to check both reads for cut sites and/or barcodes.
  2. Adds a GG to the beginning of each read. This places the barcodes in positions other than where Stacks expects and will result in many (or most) reads being unnecessarily dropped during demultiplexing. We can recitfy this by trimming the first two bases from every read to position the barcode at the beginning of the read where Stacks expects.

I.1. Trim, demultiplex, clean, and filter data

We can demultiplex, clean, and check for the cut site on our reads using the program process_radtags in Stacks. If we enable the --bestrad flag, the program will check both the forward and reverse reads for the cut site and barcode and transpose if necessary (see this question on the Stacks Google Group about dealing with barcodes and this question about the extra GG at the beginning of each read).

Then, we can filter for PCR clones using the program clone_filter, which is included with Stacks. PCR duplicates are unwanted artifacts from library prep. Among other issues, PCR duplicates can inflate metrics related to coverage, which may lead to overconfidence in SNP calling.

Before running:

  • Specify the barcode file at the top of the script 02_process_data.sh
  • Activate the atropos environment with:
$ module load anaconda3/2019.03
$ conda activate atropos

RUN SCRIPT $ bsub < ./scripts/02_process_data.sh Submits to long queue

This script will: i. Run atropos to trim the first two bases from every read ii. Run process_radtags to demultiplex and clean the data iii. Run clone_filter to filter PCR clones iv. Produce a summary of metrics

Assumes:

  • DNA was cut with sbfI
    • If it wasn't, change the -e flag in process_radtags
  • Best RAD protocol was used to generate libraries
    • If it wasn't, remove the --bestrad flag from process_radtags
  • Barcodes are positioned on just one read at the end (inline_null)
    • If they weren't, change the --inline_null flag from process_radtags. See process_radtags documentation for other barcode options
  • Files are gzipped
    • If they aren't, change or remove the -i flag from clone_filter. See clone_filter documentation for more info

Expected output

  1. Filtered, cleaned, demultiplexed reads, in 03_clone_filter_out
  2. Metrics from sample filtering in metrics/filter_metrics.csv
  3. Lane metrics in metrics/lane_metrics.csv

To see the number of reads dropped due to missing barcode, missing cut site, or low quality, run: $ head -20 logs/02_process_data.err

II: Selecting a Subset of Data for Parameter Optimization

The primary parameters that govern the formation of 'stacks' of loci in Stacks must be optimized for each species. These parameters are:
  • -m: minimum coverage (identical reads) required to form a stack (allele) in each individual
  • -M: Number of mismatches allowed between stacks (alleles) when collapsing into loci in each individual
  • -n: Number of mismatches allowed between loci among individuals when constructing the catalog
Depending on factors such as natural heterozygosity, genome duplications, etc, the appropriate threshold for these parameters will vary among species. To empirically determine this threshold, we run Stacks on a representative subset of individuals using various settings of parameters M and n, and then compare the output following McCartney‐Melstad et al (2019). We hold m to its default value of 3, as this value works well with a variety of taxa (Paris et. al. 2017).

For more information about parameter optimization, see the papers linked above as well as Rochette & Catchen 2017.

3. Run ustacks on full dataset to determine coverage for each sample

Running Stacks on the full dataset with every parameter setting would be very computationally intensive. Therefore, we want to identify a subset of individuals that we can use for parameter optimization. According to Rochette & Catchen (2017), this subset should a) be representative of the hypothesized population structure, and b) have sufficient coverage for high confidence SNP calls (>25x). We can generate metrics of coverage following data cleaning and PCR duplicate removal with ustacks.

ustacks is the first component of the Stacks pipeline. It uses single-end reads (paired end reads come in later) to identify putative alleles within individuals (governed by the -m parameter) and then collapses these alleles into putative loci (based on the -M parameter)again within individuals. In the process, ustacks provides metrics of coverage for each sample. It is this coverage metric that we are most interested in right now, as this will help us identify an appropriate subset of data for parameter optimization.

Before running:

If the demultiplexed/filtered samples include multiple species (e.g. if samples from multiple projects were sequenced together), make sure to move all the samples that are not of immediate interest out of the 03_clone_filter_out folder. From this point forward in the protocol, you should only be working with samples from a single project that can be analyzed together.

RUN SCRIPT $ bsub < ./scripts/03_run_ustacks.sh

Expected output

  1. Output files from ustacks will be in the folder 04_ustacks_out
  2. File with coverage metrics for all samples: metrics/coverage_metrics.csv

Optional step If you'd like to automatically link to the samples with the highest coverage for parameter optimization (ignoring population), then save the script below and run it from the main directory with

$ bash [script].sh

If you have multiple populations represented in your dataset (which will usually be the case), ignore this step.

#!/bin/bash

# Number of samples you'd like to link to
numsamp=

### LINK TO FILES WITH HIGHEST COVERAGE 

### Sort top samples by coverage, as above & store in hs.txt
sort -nrk3 -t"," ./metrics/coverage_metrics.csv | head -$num_samps | cut -d"," -f1 > hs.txt

### Make directory a variable
dir=$(pwd)
dir2="/03_clone_filter_out/"

### Create links
while IFS= read -r line
do
ln -s $dir$dir2$line* ./05_parameter_opt/clone_filter_test/
done < hs.txt

When run, this script will list samples with the highest coverage in the file hs.txt and links to these files will be created in 05_parameter_opt/clone_filter_test. There is another script to emulate this step for multiple populations in the next section.


4. Examine metrics to determine appropriate subset of data for parameter optimization

1. Transfer all of the files in the metrics folder to your local computer using a program like Filezilla or WinSCP. 2. Import the files filter_metrics.csv and coverage_metrics.csv into R and merge the dataframes by the sample_names column. Here's an example script (assumes you're in the same directory as the files)

coverage <- read.csv("coverage_metrics.csv", header=TRUE)
filter <- read.csv("filter_metrics.csv", header=TRUE)
sample_metrics <- merge(filter, coverage, by.y="sample_names")

3. Examine the filter and coverage metrics. Ideas of useful ways to examine the data include:

a) Add a column for population and examine coverage by population with a simple box plot. If you have a column named population, then you can create this plot with

library(ggplot2)
ggplot(data = sample_metrics, aes(x=factor(population), y=(mean_coverage))) +
  geom_boxplot()

b) Examine number of reads that are lost when assembling initial stacks (the n parameter in ustacks) by subtracting the column reads_retained_by_ustacks from the column output_reads_from_clone_filter c) Look at the mean and range of the percent_clones column. If there are any potential lab factors that could have contributed to variability in this metric, consider adding a column for this factor and plotting percent clones against it.

4. Once you've generated useful plots and/or summary statistics from your data, consult with Lisa (or your PI) to determine an appropriate subset of data for parameter optimization.

The first chunk of the parameter optimization R markdown below contains code to visualize coverage and PCR duplicates by population

In general, we want to select samples with high coverage that are representative of our overall dataset. For population structure, this would mean including at least a couple samples from each population. A good number of overall samples to shoot for is 12. You can include as many as you'd like, but run time will suffer (and parameter optimization is the most computationally-intensive part of the pipeline so it'll take awhile).
For more information about selecting samples for parameter optimization, see Rochette & Catchen 2017 and Paris et. al. 2017.

III: Parameter optimization (Part 1)

Having identified an appropriate subset of data on which to test Stacks with different parameter settings, now we want to run denovo_map.pl multiple times on the subset, varying M and n. denovo_map.pl combines the various components of the Stacks pipeline, making it easy to run in one go (see denovo_map documentation for more info).

The components comprising denovo_map.pl are:

  • ustacks: assembles alleles into loci within each sample using single-end data
  • cstacks: creates a catalog of loci across the population
  • sstacks: matches samples back against the catalog
  • tsv2bam: transposes data so they're organized by loci rather than sample
  • gstacks: assembles paired-end contig, calls SNPs/haplotypes and genotypes each sample
  • populations: calculates population statistics and exports data in a variety of formats

A bit more about parameter optimization ... When alleles are clustered into individual loci (ustacks), and again when individual loci are merged into catalog loci (cstacks), there is a risk that paralogs will be incorrectly merged. This 'undersplitting' of loci can artificially inflate heterozygosity at a locus, leading to undesireable outcomes such as high numbers of loci being filtered out of the dataset (if the -max_obs_het flag is set) or contributing to erroneous population statistic estimates. Thus, it is important to optimize the parameters that control locus assembly (M & n) so the resulting data are robust and reliable.

Importantly, in Stacks, filtering stringency increases with lower numbers i.e. a value of M=n=1 means only one mismatch is allowed before loci are considered distinct and split, while M=n=9 means sequences with as many as 9 mismatches can still be merged together into a locus. This is the opposite of other pipelines like PyRAD, where lower numbers mean less stringency. Thus, the danger of setting M=n too low in Stacks is that loci will be oversplit and we will lose power in our analysis (since natrual variation at a locus will cause the alleles to be split into separate loci for analysis), and the danger of setting M=n too high is that paralogs will be merged into one locus and fixed bases in paralogous loci will be called as SNPs. Which values are appropriate for these parameters varies among species and populations; therefore, we must empirically determine the optimal parameter values for each dataset.

4. Run Stacks on subset of data using different parameter settings

Before running:

  • If you automatically linked to files with the highest coverage, then skip ahead to RUN SCRIPT.
  • If you did NOT automatically link to the files with the highest coverage in the step above (most common), then you need to specify the subset of samples that will be used for parameter optimization, and follow steps i and ii below.

i. write the prefixes of the files you'd like to use for parameter optimization (the part of the file name before 1 or 2) in one column of a text file named hs.txt and save this in the main directory

example hs.txt

Sbf_Rbo-00027_bead
Sbf_Rbo-00038_bead
Sbf_Rbo-00025_bead
Sbf_Rbo-00045_bead
Sbf_Rbo-00042_bead
Sbf_Rbo-00043_bead
Sbf_Rbo-00030_salt
Sbf_Rbo-00029_salt

ii. Save the following script and run from the main directory with bash [script].sh.

### Make directory a variable
dir=$(pwd)
dir2="/03_clone_filter_out/"

### Create links
while IFS= read -r line
do
ln -s $dir$dir2$line* ./05_parameter_opt/clone_filter_test/
done < hs.txt

This will create symbolic links to the filtered/cleaned subset of samples in the folder 05_parameter_opt/clone_filter_test, which are called in the next script.

Alternatively, you can manually link to or copy the files. The point is that we want the files for our subset of samples from the 03_clone_filter_out folder to also be in the folder 05_parameter_opt/clone_filter_test

RUN SCRIPT $ bsub < ./scripts/04_parameter_test.sh

If you hit an error before this script finishes running, then you'll need to delete the partial files that were output by Stacks before running again. This can be done easily with the following three lines of code (it will complain that it can't delete directories, but that's good - we don't want the directories to be deleted!):

rm 05_parameter_opt/m3*/*
rm 05_parameter_opt/logs/*
rm 05_parameter_opt/vcf_files/*
rm metrics/m*.csv
rm ./logs/05_parameter_test.*

Expected output

  • The folder ./metrics will be filled with .csv files containing sample and population metrics for each parameter setting, as well as .het files specifying individual heterozygosity.
  • The folder 05_parameter_opt/vcf_files will be filled with .vcf files for each parameter setting.

5. Visualize data to determine optimal parameters for the dataset

Having run Stacks on a subset of our data with different parameter settings guiding locus formation, we will now use a combination of the methods outlined in McCartney‐Melstad et al (2019) and Paris et. al. (2017) to determine which parameters are optimal for correctly assembling loci in our species. This includes examining the following metrics:

  • Number of total SNPs
  • Number of r80 loci
  • Individual heterozygosity
  • Cumulative variance in PCAs
  • Data missingness vs pairwise genetic similarity

RUN SCRIPT $ bsub < ./scripts/metrics4Rmarkdown.sh

Expected Output (all in the metrics folder) clustOpt (Directory) vcflist.txt PCAplot[2-9].txt PCAvarExplained_[2-9].txt

The next several steps must be carried out using R. The easiest way to do this is to move the needed files to your personal computer and run the Stacks_parameter_optimization.Rmd on your local PC or Mac with RStudio.

Transfer all the files from the metrics folder into your local working directory. These will be referenced in the section below.

Metric 1: Total number of SNPs

Files needed: ./metrics/m3_M{x}_n{x}_population_metrics.csv Expectation: When the clustering threshold is too high (i.e. low M=n value), then loci will be oversplit and we expect to find fewer SNPs. When the clustering threshold is too low (i.e. high M=n value), we expect that paralogs will be incorrectly collapsed into loci. This inflates heterozygosity at the locus and may lead to intrinsic filtering of paralogs down the line. This led McCartney-Melstad et. al. (2019) to deduce that the number of SNPs will rise as loci are less oversplit, and then fall once they start to be undersplit, leaving the optimal parameter value that which gives the highest number of SNPs. However, in Stacks, the max_obs_het filter must be manually set for Stacks to filter loci based on inflated heterozygosity levels (see Julian Catchen's response to a Google question for more info). We will graph SNPs along with the r80 loci below.

Metric 2: r80 loci

Files needed: ./metrics/m3_M{x}_n{x}_population_metrics_r80.csv Expectation: Paris et. al. (2017) explain that broadly shared polymorphic loci are unlikely to be derived from paralogous or repetitive sequence. Therefore, selecting loci for downstream analysis that are found in 80% of samples likely represents an efficient method of filtering the dataset for informative (and real!) variable regions.

RUN SCRIPT: Chunks corresponding to Metric 1 and Metric 2 in Stacks_parameter_optimization R markdown file

Expected output:

  • The output from this chunk will plot total SNPs and r80 loci (i.e. loci found in 80% of individuals) by parameter value, similar to the graph below.

For total SNPs, also see McCartney-Melstad (2019) - Figure 3b, but note that they have downstream heterozygosity filters so the number of SNPs drops after hitting a pinnacle. This may not occur in Stacks unless the --max_obs_het flag is set in the populations program.

For r80 SNPs, also see Paris et. al. (2017) - Figure 3b.

Optimizing the parameter We expect that the optimal clustering thresholds for a dataset will be at or near the maximum number of total SNPs and r80 SNPs. Therefore, in the example graph above, we might infer that a value of 2 or above would be optimal for M and n.

Metric 3: Observed heterozygosity

Files needed:./metrics/m3_M{x}_n{x}.het

Expectation: Similar to the total number of SNPs, we expect that heterozygosity will be lowest when clustering thresholds are most stringent (M=n=1). As the threshold is relaxed, we expect heterozygosity to increase. However, assuming we set the max_obs_het filter, we expect heterozygosity to drop again when paralogous loci are collapsed together and thus have very high heterozygosity (see Julian Catchen's explanation of the filters in the populations program). If we do not set --max_obs_het or another similar flag, then we may expect to see an asymptote instead of a drop, since these loci won't be filtered.

RUN SCRIPT: Chunk corresponding to Metric 3: Individual heterozygosity in Stacks_parameter_optimization R markdown file

Expected output: The output from this chunk will produce boxplots showing the heterozygosity across samples by parameter value.

Also see McCartney-Melstad (2019) - Figure 3a.

Optimizing the parameter Because both oversplitting and undersplitting will result in lower levels of heterozygosity (assuming the --max_obs_het filter is set), the optimal parameter value should be near the maximum value. The above graph, for instance, suggests that a value of 8 may be appropriate (note this graphs comes from a different species (GD) as the graphs used for the previous two metrics (Rbo)).

Metric 4: Cumulative variance in PCAs

Files needed: PCAplot{x}[1-9].txt (output from the metrics4rmarkdown.sh script) Expectation: We would expect that the fraction of the summed variance explained by the first few principal components to increase up until loci are oversplit. At this point, the proportion of variance explained by those principal components should decrease.

RUN SCRIPT: R script for Metric 4: Cumulative variance in PCAs

Expected output: The output from this chunk will produce scatter plots showing the variance explained by the first 2-5, and then 6-9 PCs.

Optimizing the parameter In the above graphs, the first 6 PCs suggest a value of M|n=2 is optimal for the dataset. More PCs suggest a value of M|n=1, but this may be due to a large number of loci only being present in one individual (ie missing data). We do not filter for missing data during parameter optimization, so be careful with results that suggest M|n=1 unless you have reason to suspect that there is little genetic variation in your study group. Also see McCartney-Melstad (2019) - Figure 3c

Metric 5: Data missingness vs. pairwise genetic similarity Files needed: vcflist.txt Files corresponding to this metric are produced by the metrics4rmarkdown.sh script and should be in your metrics folder. They are the pdfs that end in .heatmap.

Expectation: We expect that the optimal value of M|n will be the value that minimizes the correlation between data missingness and pairwise genetic similarity i.e. the heatmap that has the most blue.

Expected output: This will produce heat maps that show pairwise data missingness where rows/columns represent individual samples. The dendrograms represent samples clustered by genetic similarity (SNP identity by state). For this example, we are evaluating data missingness at values of 4 (A) and 6 (B). The colors indicate missingness fractions as shown in the legend, ranging from 0.2 (20% missing data) to 0.5 (50% missing data).

A) m3_M4_n4

B) m3_M6_n6

Also see McCartney-Melstad (2019) - Figure 4a and 4b
Optimizing the parameter

From our example, we find that a value of 4 results in overall lower data missingness (average 30%), compared to the value of 6 (average 35%). When examining this parameter, look at the heat maps for all values and find the one that has the most blue and least red.

This metric can also help us recognize when an individual (or two) has a lot of missing data relative to the others. Individuals with high degrees of missing data should not included in parameter optimization, as we've found that one sample with lots of missingness can have a major influence on the optimal parameter values.

In the case shown above, we can see that individuals Sbf_Gd19_bead and Sbf_Gd6_bead have high levels of missing data relative to the others. As such, it may be wise to back up and re-run the parameter optimization script, replacing these individuals with two others that have less missing data. We likely will not know this a priori so we may have to run different subsets of samples through parameter optimization until we find a combination where all individuals have equivalent amounts of missing data.

#JDS ended here (07/13/2021): pick up with reduced catalog script, then SNP filter

IV. Parameter optimization (Part 2)

Taking the above steps to optimize parameters M and n should help minimize the number of paralogs that are combined into loci, but some may still slip through. Fortunately, there are methods for identifying and filtering these charlatan loci based on 1) deviations from expected proportions of heterozygous individuals in a population, and 2) allelic read ratios (see section 7 for more details). Before we can filter, however, we need to figure out what values for heterozygosity and read ratios are appropriate for our system - we don't want to accidentally throw away the honest loci along with the imposters!

6. Re-run Stacks on the full dataset using optimal parameter settings for M and n

Now that we've optimized parameter values and done our best to ensure that Stacks produces real SNPs from real loci, let's get into the weeds a bit more about what Stacks is actually doing. As mentioned above, there are several distinct steps involved in forming loci and calling SNPs:

ustacks is the first program run in the de novo pipeline. The first key parameter given to ustacks is m. m tells Stacks how many identical reads are required to form an allele. Reads that are not included in the formation of primary stacks are set aside and incorporated later as "secondary reads". At this step, ustacks filters repetitive regions by identifying stacks with coverage that are 2 standard deviations above the mean and discarding them, along with stacks that are one nucleotide apart from these very high coverage stacks.

Once the primary stacks have been formed (according to the m filter) and merged (according to the M filter), secondary reads that did not meet the criterion of being identical to those used during the formation of primary stacks are brought back into the picture. Secondary reads are matched against the primary stacks with a mismatch rate governed by the N parameter, which is not a parameter we use during optimization; rather, we keep it at its default setting of M + 2. After the secondary reads are incorporated, ustacks calls SNPs by examining each nucleotide at a time using a maximum likelihood framework.

gstacks starts by assembling a consensus sequence at each locus using de Bruijn graphs and the sequences for each locus from each individual. The sequence contigs that maximize the cross-sample coverage at each locus are called as the reference sequence. These consensus sequences are output in the file catalog.fa.

SNPs are called using a Bayesian genotype caller. This method leverages information from all individuals in the metapopulation to construct a prior that is fed into the genotyping model to call SNPs for each individual separately.

Before running:

  • Specify the optimal values for M and n directly in the script 06_run_denovo_map.sh
  • Create a population map
    • The population map is a two-column tab-delimited population map for Stacks. Column one should have the sample prefix (i.e. the part of the sample name before .1.fq) and column 2 should specify the population to which the sample putatively belongs (see example below)
example popmap_full.txt
#Note the population can be specified as a number, character string, or combination.
Sbf_Rbo-00027_bead<tab>1
Sbf_Rbo-00038_bead     1
Sbf_Rbo-00025_bead     1
Sbf_Rbo-00045_bead     Chesapeake
Sbf_Rbo-00042_bead     Chesapeake
Sbf_Rbo-00043_bead     3
Sbf_Rbo-00030_salt     3
Sbf_Rbo-00029_salt     3
See section 4.3.2 of the Stacks manual for more about population maps.

If you have loads of samples and don't want to manually type all the names into the population map, then use the below script to grab the prefixes and store them in a text file in the first column.

$ ls 03_clone_filter_out/ | awk -F "." '{print $1}' | uniq > popmap_full.txt
  • Manually go through popmap_full.txt and enter the population in the second column
  • Note: popmap_full.txt is different than popmap.txt, as the latter includes just a subset of data for the parameter optimization step. MAke sure that you are using the former here.
  • Now, specify the population map file directly in the script 06_run_denovo_map.sh. If you named it pop_map.txt as in the above code, then no need to do anything - that's the default.

RUN SCRIPT $ bsub < ./scripts/06_run_denovo_map.sh

Expected output
  1. This script runs the main Stacks pipeline and there will be a lot of files at the end corresponding to the different pipeline components output to the folder 06_denovo_map_out. These files will be used as input to the populations program in step 8.
  2. For our purposes here, the most important output file is populations.snps.vcf which will also be in the 06_denovo_map_out folder. It is this file that we use as input to HDplot below.

7. Filter data for paralogs

To determine the values we will use for paralog filtering and then conduct the filtering, we will use HDplot. HDplot was originally proposed by McKinney et. al. (2017), but we will use a version that was adapted by Edgardo Mortiz and redubbed paralog-finder.
HDplot/paralog-finder identifies putative paralogs using two metrics: 1) expected proportion of heterozygous individuals within a population (H), and 2) within heterozygotes, the read ratio deviation (D) for each allele. These metrics are calculated for each locus, then we plot the results, visually inspect the graphs, and determine the values that are appropriate for our sytem. Then, we will then create a "blacklist" of putative paralogs based on this information that will be excluded from our downstream analyses.
This process occurs in three steps:

Step 1: RUN SCRIPT $ bsub < ./scripts/07_run_HDplot.sh

Expected output: (both in main directory)
  • The image file graph1_D_vs_Het.png.
  • A file with calculated read depths ratios at each locus called population.depthsBias

Step 2: Transfer the image file to your home computer and open it

It should look something like this:

Interpreting the graph

The main region of interest in this image is the point cloud. The marginal box plots along the sides help us understand the distribution of the data (since point clouds can be deceiving). There do not presently exist clear recommendations for setting values of D and H to accurately eliminate the maximum number of paralogs, so it is wise to be conservative -- hopefully with so many different steps for minimizing and filtering paralogs, we will capture the majority.
Here, we can use D to eliminate loci that are outliers in terms of the read ratio deviation. For instance, looking at the box plot on the right of the above graph, we can see that 95% of heterozygous loci have a D of |4| or less. Thus, we would pick limits for D between -4 and 4 to retain 95% of reads and eliminate the 5% that are most likely to represent paralogous loci.
To decide on a value for H, we want to capture the dense cloud of points and eliminate the diffuse points. In the graph above, it is difficult to tell where the cloud of points disperses, but a reasonable limit to set would be 0.65, as it looks like the main point cloud (i.e. the points that are not the sporadic vertical bars) is quite diffuse at that point. It is helpful to have an a priori idea of expected heterozygosity for your species or population and to know a bit about their evolutionary history. For instance, the above graph was generated for a teleost, which had a teleost-specific genome duplication event. This fact is reflected in the presence of "diverged duplicates" in the graph, which can be seen in the vertical bar at am H of 1.0.
For more details about how to interpret these graphs, see McKinney et. al. (2017), and the Paralog-Finder GitHub. Additionally, Garrett McKinney, who created HDplot, has said he is happy to help interpret graphs - he said many people have had difficulty interpreting the graphs and he welcomes emails about it.

Step 3: Run below code chunk, specifying D and H as arguments.

$ module load python/2.7.9
$ python ./scripts/blacklist_paralogs.py -i./population.depthsBias --maxH [H] --minD [-D] --maxD [+D]
Arguments:

-i ./depthsBias -- This is the hidden file output during step 8.i. when we calculated read depth ratios. --maxH [H] -- maximum value of heterozygosity allowed at a locus. Default is 0.6 --minD [-D] -- Lower limit of D. Default is -7 --maxD [D] -- Upper limit of D. Default is 7

Expected output:

  • paralogs.blacklist: This is a blacklist of putatively paralogous loci to be excluded from our full populations analysis based on the read ratio deviation D and expected heterozygosity H.
  • singletons.whitelist: This is the "whitelist" of loci that we will use to calculate population statistics i.e. the loci that made it through our filters.

Both of these lists will be output to the folder from which the script was run. These will be passed to the populations program and run on the full dataset using our optimized values for M and n.

8. Re-run populations, specifying list of putative paralogs

Before running:

  • Specify the population map file directly in the script 08_run_populations.sh. If you named it pop_map.txt, then no need to do anything - that’s the default.
  • Consider adding the following flags, which are not set as defaults:
    • -r 0.8: This flag will only use r80 loci to calculate populations statistics.
    • --max-obs-het [H]: Just to make sure the paralog filter removed all the loci with greater than expected heterozygosity ratings, it doesn't hurt to re-specify that value here.
    • See the populations manual for other potential flags, especially regarding output file format.

RUN SCRIPT $ bsub < ./scripts/08_run_populations.sh

Expected output

This will calculate population statistics using loci that have been assembled with our optimal parameter values and data that have been filtered for paralogs. In other words, this is the step where we (finally) get results we can use!

The output will be given in the specified output format (default for our script is vcf). More detailed descriptions of the files and specific fields can be found in the populations manual and the Stacks manual section 6.6.

V. SNP Filtering

9. Light filtering with vcftools (and LD calculations with bcftools for reference-based workflows)

Here, we will do some light filtering to remove SNPs that are missing very high amounts of data and/or are uninformative. If using a reference-based workflow, we will also filter for LD.

for de novo workflows (no LD filter)

vcftools --vcf input.vcf --max-missing 0.5 --maf 0.01 --min-meanDP 10 --max-meanDP 1000 --recode --mac 3 --out output.vcf

Here, the options are:

  • --max-missing: maximum missingness for a locus. Bounded between 0 and 1, where 0 means sites that are missing from every sample are retained, and 1 means sites that are missing from any sample are removed.
  • --maf: minor allele frequency. Can set this to a low value to make sure we don't miss rare alleles.
  • --min-meeanDP, --max-meanDP: Minimum and maximum average read depth allowed for a locus.
  • --mac: Minor allele count. Set this to 3 to ensure that any included allele is present in at least 2 individuals.

for reference-based workflows (pipe to bcftools for LD filter)

vcftools --vcf input.vcf --max-missing 0.5 --maf 0.01 --min-meanDP 10 --max-meanDP 1000 --recode --mac 3 --stdout | bcftools +prune -m 0.5 -w 100000 - -Ov -o output.vcf

Additional options for bcftools:

  • -m: maximum R2/D' value for LD. Any site with a value above this will be removed
  • -w: The window over which to calculate LD. Defaults to 100kb
  • -Ov: output as a vcf file

10. SNP filter in R

Now that we have our set of (lightly filtered) SNPs, we'll do some more robust filtering in R. For this, we use the Rmarkdown file named general_snp_filtering_markdown_template.Rmd from our lab GitHub page. Instructions and expectations for running the script are included in the .Rmd file.

Expected output

At the end of the SNP filter script, you will have two single-column text files:

  1. The list of samples that passed the filters
  2. The list of loci that passed the filters

These whitelists will be passed to bcftools along with the original vcf file that was output from vcftools in step 9.

11. Create filtered vcf file for population genetics

bcftools will generate a new filtered vcf file that includes only the samples and loci that passed the filtering criteria. There are a few steps that need to be run to prepare the original vcf file for filtering. These steps do not play well in a pipe so they have to run individually, but it's all straightforward and takes < 5 minutes to code and run. These can easily be run from an interactive queue.

Step 11.1: First, load bcftools module load bcftools/1.15

Step 11.2: Now, zip the vcf file bcftools view -O z [InFile].vcf -o [OutFile].vcf.gz

Step 11.3: Index the zipped vcf file - note that you do not need to specify an output file here. bcftools index [InFile].vcf.gz

Step 11.4: Finally, subset the zipped vcf file to only retain SNPs and samples that passed filters and output an unzipped vcf for population genetics.

bcftools view -S Whitelist_indivs.txt -R  Whitelist_loci.txt  [InFile].vcf.gz -o [out]_filtered.vcf

This all runs quickly, so can run on the interactive queue.

Expected output

The output from this will be a filtered vcf file that is suitable for population genetics analyses. If using the naming convention above, this will be the file with the suffix "filtered.vcf".

BASH Scripts

01_mkdirs.sh

#!/bin/bash

mkdir 00_RAW_data
mkdir 01_trimmed_RAW_data
mkdir 02_process_out
mkdir 03_clone_filter_out
mkdir 04_ustacks_out
mkdir 05_parameter_opt
mkdir 05_parameter_opt/logs
mkdir 05_parameter_opt/clone_filter_test
mkdir 05_parameter_opt/vcf_files
mkdir 06_denovo_map_out
mkdir 07_FINAL_populations_out
mkdir logs
mkdir metrics

mv ./clustOpt ./metrics

mv ./paralog-finder/HDplot_process_vcf.py ./scripts
mv ./paralog-finder/blacklist_paralogs.py ./scripts
mv ./paralog-finder/HDplot_graphs.R ./scripts
mv ./paralog-finder/ ./scripts

for i in $(seq 1 9)

do

mkdir ./05_parameter_opt/m3_M${i}_n${i}
mkdir ./05_parameter_opt/m3_M${i}_n${i}/r80
mkdir ./05_parameter_opt/m3_M${i}_n${i}/ind_het

done

02_prelim_QC.sh

#!/bin/bash

#BSUB -q long
#BSUB -W 20:00
#BSUB -R rusage[mem=16000]
#BSUB -n 4
#BSUB -R span[hosts=1]
#BSUB -e ./logs/02_prelim_QC.err
#BSUB -oo ./logs/02_prelim_QC.log

#READ BEFORE RUNNING
#This script assumes: 
#1) raw sequence files are gzipped.
#2) barcodes are 8 bases long
#3) raw data is in 00_RAW_data folder, paired end reads ending in 1.fq.gz

#Where do we expect the cut site? With an 8-base barcode and best rad protocol we expect sites 11-16
cut_loc="11-16"

#What is the cut sequence we expect to find? For Sbf it's "TGCAG". Make last character a wild card because process_radtags will rescue cut sites where the last character is different
cut_seq="TGCA."

#Lines to subset
lines=100000

for file in ./00_RAW_data/*1.fq.gz
do

prefix=${file%%1.fq.gz}

#To check whole file
#lines=$(wc -l $file)

echo $file

#Check cut site in positions 11-16 in both forward and reverse reads
cut_freq_1=$(zcat ${prefix}1.fq.gz | head -$lines | awk '/^@/{getline; print}' | cut -c 11-16 | sort | uniq -c | grep "$cut_seq" | awk '{sum+=$1} END{print sum}')
cut_freq_2=$(zcat ${prefix}2.fq.gz | head -$lines | awk '/^@/{getline; print}' | cut -c 11-16 | sort | uniq -c | grep "$cut_seq" | awk '{sum+=$1} END{print sum}')

#Get total number of reads. Could also be done by passing to wc -l 
tot_freq=$(zcat ${prefix}1.fq.gz | head -$lines | awk '/^@/{getline; print}' | cut -c 11-16 | sort | uniq -c | awk '{sum+=$1} END{print sum}')

cut_freq_tot=$((cut_freq_1 + cut_freq_2))

#Check percent of reads with cut site in expected position on either read - divide total found cut sites from both directions by total reads in one direction
cut_perc=$(echo "scale=2; $cut_freq_tot/$tot_freq * 100" | bc)

#Compare number of reads that start with [G,N][G,N] to total number of reads. Check for GG on both reads (since bestrad can switch them)
GG1=$(zcat $file | head -$lines | grep "^[G,N][G,N]" | wc -l)
GG2=$(zcat ${prefix}2.fq.gz | head -$lines | grep "^[G,N][G,N]" | wc -l)
tot=$(zcat $file | head -$lines | grep "^@" | wc -l)

GG=$((GG1 + GG2))
GGperc=$(echo "scale=2; $GG / $tot *100" | bc)
echo "Cut site found in $cut_perc% of reads"
echo "Reads that start with GG (or NG, GN, or NN): $GG"
echo "Total reads analyzed: $tot"
echo "Percent reads that start with GG: $GGperc%"

done

03_process_data.sh

#!/bin/bash
#run process_radtags and clone_filter

#BSUB -q long
#BSUB -W 100:00
#BSUB -R rusage[mem=16000]
#BSUB -n 8
#BSUB -R span[hosts=1]
#BSUB -e ./logs/03_process_data.err
#BSUB -oo ./logs/03_process_data.log

##Need to specify barcode file
barcode_file="./RADtest_barcodes2.txt"

#Run atropos to trim first two bases
for file in ./00_RAW_data/*.fq.gz
do
sample=${file##./00_RAW_data/}
atropos trim --cut 2 -se $file -o 01_trimmed_RAW_data/${sample}
done

wait

#Run process_radtags
module load stacks/2.3d

for file in ./01_trimmed_RAW_data/*1.fq.gz
do
read_name=${file%%1.fq.gz}
process_radtags -1 $file  -2 ${read_name}2.fq.gz -b $barcode_file -o 02_process_out -c -q -r --inline_null -e sbfI --bestrad
done

wait

#Move "remainder reads" into separate folder
mkdir ./02_process_out/remainder_reads
mv ./02_process_out/*.rem.* ./02_process_out/remainder_reads

#Run clone_filter
for file in ./02_process_out/*.1.fq.gz
do
sample=${file%%.1.fq.gz}
clone_filter -1 ${sample}.1.fq.gz -2 ${sample}.2.fq.gz -i gzfastq -o ./03_clone_filter_out
done

wait

#Rename clone_filter output to remove second number from end - Read 1
for file in ./03_clone_filter_out/*.1.1.fq.gz
do
sample=${file%%.1.1.fq.gz}
mv $file ${sample}.1.fq.gz
done

#Read 2
for file in ./03_clone_filter_out/*.2.2.fq.gz
do
sample=${file%%.2.2.fq.gz}
mv $file ${sample}.2.fq.gz
done

wait

##Extract filter metrics to export to R
touch filt_metrix.txt
###Run on the .err file from process_radtags
grep -A 4 "total sequences" logs/03_process_data.err > filt_metrix.txt

###Run on the .err file from clone_filter
grep -A 1 -e "Reading data from" -e "clone reads" logs/03_process_data.err >> filt_metrix.txt

#total sequences - per lane
tot_seq=$(awk -F" " '/total sequences/ {print $1}' filt_metrix.txt)

#barcode not found drops - per lane
bc_nf=$(awk -F" " '/barcode not found/ {print $1}' filt_metrix.txt)

#low quality read drops - per lane
lq_rd=$(awk -F" " '/low quality/ {print $1}' filt_metrix.txt)

#RAD cutsite not found drops - per lane
no_RAD=$(awk -F" " '/RAD cutsite/ {print $1}' filt_metrix.txt)

#Retained reads - per lane
Retained=$(awk -F" " '/retained reads/ {print $1}' filt_metrix.txt)

#Store sample names as variable
clone_names=$(awk -F "/" '/fq.gz/ { found++ } found > 1 {print $3}' filt_metrix.txt | cut -d'.' -f1)

#extract input reads to clone_filter - per sample
in_filter=$(awk -F" " '/pairs/ {print $1}' filt_metrix.txt)

#extract output reads from clone_filter - per sample
out_filter=$(awk -F" " '/pairs/ {print $6}' filt_metrix.txt)

#extract percent clones - per sample
per_clones=$(awk -F" " '/clone reads/ {print $16}' filt_metrix.txt | cut -d'%' -f1)

touch ./metrics/lane_metrics.csv
echo "Total sequences","barcode not found drops","low quality read drops","RAD cutsite not fo\und drops","Reads Retained" > ./metrics/lane_metrics.csv
echo $tot_seq,$bc_nf,$lq_rd,$no_RAD,$Retained >> ./metrics/lane_metrics.csv

touch ./metrics/filt_metrics.csv
echo "sample_names "$clone_names > ./metrics/filt_metrics.csv
echo "input_reads_to_clone_filter "$in_filter >> ./metrics/filt_metrics.csv
echo "output_reads_from_clone_filter "$out_filter >> ./metrics/filt_metrics.csv
echo "percent_clones "$per_clones >> ./metrics/filt_metrics.csv

sed -i "s/ /,/g" ./metrics/filt_metrics.csv

#transpose to put metrics in columns
awk 'BEGIN { FS=OFS="," }
{
    for (rowNr=1;rowNr<=NF;rowNr++) {
        cell[rowNr,NR] = $rowNr
    }
    maxRows = (NF > maxRows ? NF : maxRows)
    maxCols = NR
}
END {
    for (rowNr=1;rowNr<=maxRows;rowNr++) {
        for (colNr=1;colNr<=maxCols;colNr++) {
            printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : O\
RS)
        }
    }
}' ./metrics/filt_metrics.csv > ./metrics/filter_metrics.csv

rm ./filt_metrix.txt
rm ./metrics/filt_metrics.csv
The example script above includes the following flags for process_radtags:

-1 forward read -2 reverse read -b tab-delimited text file with barcode info -c clean the sequences by removing reads with uncalled bases -q discard reads with low quality scores -r rescue barcodes and RAD tags if they differ from expectations by one base --inline_null specifies barcode position (see link above for other options) -e restriction enzyme used --bestrad specifies that the bestrad protocol was used, which means the barcode could be on the forward or reverse read

See here for additional options that can be passed to process_radtags and see here for details of additional options that can be passed to clone_filter.

04_run_ustacks.sh

#!/bin/bash
#run ustacks

#BSUB -q long
#BSUB -W 100:00
#BSUB -R rusage[mem=16000]
#BSUB -n 8
#BSUB -R span[hosts=1]
#BSUB -e ./logs/04_run_ustacks.err
#BSUB -oo ./logs/04_run_ustacks.log

#Specify the number of samples to order by coverage in output file and potentially link to.
num_samps=6

#Run ustacks on the whole dataset to determine which samples will be used for parameter optimization
module load stacks/2.3d

counter=1

for file in ./03_clone_filter_out/*1.fq.gz
do
ustacks -f $file -i $counter -o ./04_ustacks_out -p 8
counter=$((counter + 1))
done

wait

#get metrics of interest from the .err file
touch cov.txt

#Build simple text file for extracting metrics information later with awk
#Get sample number, input reads, coverage, and percent of reads used for stacks 
grep -e "Input file" -e "Final coverage" logs/04_run_ustacks.err > cov.txt

##Extract metrics with awk
#Store sample names as variable
sample_names=$(awk -F'/' '/Input file/ {print $3}' cov.txt | cut -d'.' -f1)

#extract number of reads used by ustacks - per sample
reads_used=$(awk -F "=" '/Final coverage/ {print $5}' cov.txt | cut -d'(' -f1)

#extract mean coverage - per sample
mean_cov=$(awk -F "=" '/Final coverage/ {print $2}' cov.txt | cut -d';' -f1)

touch ./metrics/cov_metrics.csv
echo "sample_names "$sample_names > ./metrics/cov_metrics.csv
echo "reads_retained_by_ustacks "$reads_used >> ./metrics/cov_metrics.csv
echo "mean_coverage "$mean_cov >> ./metrics/cov_metrics.csv

sed -i "s/ /,/g" ./metrics/cov_metrics.csv

#transpose to put metrics in columns
awk 'BEGIN { FS=OFS="," }
{
    for (rowNr=1;rowNr<=NF;rowNr++) {
        cell[rowNr,NR] = $rowNr
    }
    maxRows = (NF > maxRows ? NF : maxRows)
    maxCols = NR
}
END {
    for (rowNr=1;rowNr<=maxRows;rowNr++) {
        for (colNr=1;colNr<=maxCols;colNr++) {
            printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
        }
    }
}' ./metrics/cov_metrics.csv > ./metrics/coverage_metrics.csv

rm ./metrics/cov_metrics.csv
rm ./cov.txt

#Print samples with most coverage (# specified by num_samps variable at top)
echo "The $nums_samps samples with the highest coverage are:"
sort -nrk3 -t"," ./metrics/coverage_metrics.csv | head -$num_samps | cut -d"," -f1,4

05_parameter_test.sh

#!/bin/bash
#Run entire stacks pipeline of subset of samples with various parameter settings

#BSUB -q long
#BSUB -W 80:00
#BSUB -R rusage[mem=16000]
#BSUB -n 15
#BSUB -R span[hosts=1]
#BSUB -e logs/05_parameter_test.err
#BSUB -oo logs/05_parameter_test.log

###READ BEFORE RUNNING
#Assumes sample prefixes are stored in hs.txt
#Specify upper limit of M & n parameters to test with params variable
params=9

#Specify number of samples
num_samps=$(awk 'END {print NR}' hs.txt)

#Make popmap file for parameter optimization (all from same population)
sed "s/$/\t1/" ./hs.txt > ./popmap.txt

#Make popmap file for individual heterozygosity (all from different populations)
awk '$2=NR' OFS="\t" hs.txt > ./het_popmap.txt

#Run denovo_map with every parameter setting
module load stacks/2.3d

for var in $(seq 1 $params)
do

#Run denovo_map for each parameter setting
denovo_map.pl -T 15 -m 3 -M $var -n $var -o ./05_parameter_opt/m3_M${var}_n${var} --popmap ./popmap.txt --samples ./05_parameter_opt/clone_filter_test --paired -X "populations: --vcf" -X "populations: --max-obs-het 0.7"

#Get r80 loci
populations -t 15 -P ./05_parameter_opt/m3_M${var}_n${var} -O ./05_parameter_opt/m3_M${var}_n${var}/r80 --popmap ./popmap.txt --vcf -r 0.8 --max-obs-het 0.7

#Get individual heterozygosity
populations -t 15 -P ./05_parameter_opt/m3_M${var}_n${var} -O ./05_parameter_opt/m3_M${var}_n${var}/ind_het --popmap ./het_popmap.txt --max-obs-het 0.7

done

wait

mkdir 05_parameter_opt/vcf_files

#Gather vcf files from different parameter settings runs
#Do we need vcf files from r80 run?
for i in $(seq 1 $params)
do

haps_file=$(ls ./05_parameter_opt/m3_M${i}_n${i}/populations.haps.vcf)
snps_file=$(ls ./05_parameter_opt/m3_M${i}_n${i}/populations.snps.vcf)
cp $haps_file ./05_parameter_opt/vcf_files/m3_M${i}_n${i}.haps.vcf
cp $snps_file ./05_parameter_opt/vcf_files/m3_M${i}_n${i}.snps.vcf

r80_haps=$(ls ./05_parameter_opt/m3_M${i}_n${i}/r80/populations.haps.vcf)
r80_snps=$(ls ./05_parameter_opt/m3_M${i}_n${i}/r80/populations.snps.vcf)
cp $r80_haps ./05_parameter_opt/vcf_files/m3_M${i}_n${i}_r80.haps.vcf
cp $r80_snps ./05_parameter_opt/vcf_files/m3_M${i}_n${i}_r80.snps.vcf

done

wait

for var in $(seq 1 $params)
do
cp ./05_parameter_opt/m3_M${var}_n${var}/denovo_map.log ./05_parameter_opt/logs/m3_M${var}_n${var}.log
cp ./05_parameter_opt/m3_M${var}_n${var}/r80/populations.log ./05_parameter_opt/logs/m3_M${var}_n${var}_r80.log
done

wait

# get metrics of interest from the log files that are output by the program for each parameter setting and r80

for var in $(seq 1 $params)
do

#Build simple text file for extracting metrics information later with awk for full parameter setting run
##Get sample number, input reads, coverage, and percent of reads used for stacks

file=./05_parameter_opt/logs/m3_M${var}_n${var}.log
log=${file%%.log}
output=./metrics/m3_M${var}_n${var}

touch ${log}_pop_metrix.txt

grep -e "Sample . of ." -e "reads; formed" -e "Final coverage" $file  > ${log}_pop_metrix.txt

#Get depth of coverage for all samples
##Change the number after -A to the number of samples
grep -A $num_samps "Depths of Coverage" $file >> ${log}_pop_metrix.txt

#Get number of loci in final catalog
grep "Final catalog" $file >> ${log}_pop_metrix.txt

#Get number of genotyped loci, effective per-sample coverage, sites (bases) per locus, and phasing
grep -A 3 "Genotyped" $file >> ${log}_pop_metrix.txt

#Get number of sites (bases) per locus
grep "Mean genotyped" $file >> ${log}_pop_metrix.txt

#Get population summary stats
grep -A 1 "Population summary" $file >> ${log}_pop_metrix.txt

##Extract metrics with awk
#Store sample names as variable
sample_names=$(awk -F"\'" '/Sample/ {print $2}' ${log}_pop_metrix.txt)

#Get reads input to stacks
in_stacks=$(awk -F" " '/Loaded/ {print $2}' ${log}_pop_metrix.txt)

#extract number of reads used by ustacks - per sample
reads_used=$(awk -F "=" '/Final coverage/ {print $5}' ${log}_pop_metrix.txt | cut -d'(' -f1)

#extract mean coverage - per sample
mean_cov=$(awk -F "=" '/Final coverage/ {print $2}' ${log}_pop_metrix.txt | cut -d';' -f1)

#extract number of loci in catalog - per experiment
cat_loci=$(awk -F " " '/Final catalog/ {print $4}' ${log}_pop_metrix.txt)

#extract number of loci genotyped - per experiment
gen_loci=$(awk -F " " '/Genotyped/ {print $2}' ${log}_pop_metrix.txt)

#Effective per-sample coverage - per experiment
eff_cov=$(awk -F "=" '/effective per-sample/ {print $2}' ${log}_pop_metrix.txt | cut -d'x' -f1)

#Minimum coverage - per experiment
min_cov=$(awk -F "=" '/effective per-sample/ {print $4}' ${log}_pop_metrix.txt | cut -d'x' -f1)

#Maximum coverage - per experiment
max_cov=$(awk -F "=" '/effective per-sample/ {print $5}' ${log}_pop_metrix.txt | cut -d'x' -f1)

#Mean number of sites per locus - per experiment
sites_per=$(awk -F ": " '/mean number/ {print $2}' ${log}_pop_metrix.txt)

#percent reads with phasing - per experiment
phased=$(awk -F "(" '/a consistent phasing/ {print $2}' ${log}_pop_metrix.txt | cut -d'%' -f1)

#Mean genotyped sites per locus - per experiment
mean_gen_loc=$(awk -F " " '/Mean genotyped/ {print $6}' ${log}_pop_metrix.txt | cut -d'b' -f1)

#samples represented per locus - per experiment
samp_loc=$(awk -F " " '/samples per locus/ {print $2}' ${log}_pop_metrix.txt)

#pi - per experiment
pi=$(awk -F " " '/samples per locus/ {print $7}' ${log}_pop_metrix.txt | cut -d';' -f1)

#polymorphic sites - per experiment
SNPs=$(awk -F "/" '/samples per locus/ {print $5}' ${log}_pop_metrix.txt | cut -d';' -f1)

touch ${output}_sample_metrics.csv
echo "sample_names "$sample_names > ${output}_sample_metrics.csv
echo "total_input_reads_to_stacks "$in_stacks >> ${output}_sample_metrics.csv
echo "reads_retained_by_ustacks "$reads_used >> ${output}_sample_metrics.csv
echo "mean_coverage "$mean_cov >> ${output}_sample_metrics.csv

sed -i "s/ /,/g" ${output}_sample_metrics.csv

touch ${output}_population_metrics.csv
echo "catalog loci","loci genotyped","effective per sample coverage","min coverage","max coverage","mean sites per locus","percent reads with phasing","mean genotyped sites per locus","samples genotyped per locus",pi,SNPs > ${output}_population_metrics.csv
echo $cat_loci","$gen_loci","$eff_cov","$min_cov","$max_cov","$sites_per","$phased","$mean_gen_loc","$samp_loc","$pi","$SNPs >> ${output}_population_metrics.csv

rm ${log}_pop_metrix.txt

########## Extract r80 SNPs and loci ##########
#Get number of r80 loci
loci=$(grep -e "Kept" ./05_parameter_opt/logs/m3_M${var}_n${var}_r80.log | awk -F" " '{print $2}')

#r80 SNPs
SNPs=$(grep -e "Kept" ./05_parameter_opt/logs/m3_M${var}_n${var}_r80.log | awk -F" " '{print $14}')

touch ${output}_population_metrics_r80.csv
echo "loci","SNPs" > ${output}_population_metrics_r80.csv
echo $loci","$SNPs >> ${output}_population_metrics_r80.csv

done

######## format individual heterozygosity for import into R #########
for var in $(seq 1 $params)

do

file=./05_parameter_opt/m3_M${var}_n${var}/ind_het/populations.sumstats_summary.tsv

variant_lines=$((num_samps+2))
sed -n "2,${variant_lines}p" $file | sed "s/^#\s//" > ./metrics/m3_M${var}_n${var}_variant_sites.tsv

all_loci_start=$((variant_lines+2))
all_loci_end=$((all_loci_start + num_samps))
sed -n "$all_loci_start,${all_loci_end}p" $file | sed "s/^#\s//" > ./metrics/m3_M${var}_n${var}_all_sites.tsv

done

06_run_denovo_map.sh

#!/bin/bash
#run entire stacks pipline on full dataset with optimal parameters

#BSUB -q long
#BSUB -W 72:00
#BSUB -R rusage[mem=16000]
#BSUB -n 20
#BSUB -R span[hosts=1]
#BSUB -e ./logs/06_run_denovo_map.err
#BSUB -oo ./logs/06_run_denovo_map.log

#Specify optimal parameter values and popmap
M=3
n=3
popmap=./pop_map.txt

#Run Stacks
module load stacks/2.3d

denovo_map.pl -T 20 -m 3 -M $M -n $n -o ./06_denovo_map_out --popmap $popmap --samples ./03_clone_filter_out --paired -X "populations: --vcf"

07_run_HDplot.sh

#!/bin/bash

#BSUB -q short
#BSUB -W 4:00
#BSUB -R rusage[mem=1000]
#BSUB -n 1
#BSUB -e logs/07_run_HDplot.err
#BSUB -oo logs/07_run_HDplot.log

module load python/2.7.9
module load R/3.6.0

#Calculate read ratio deviation and heterozygosity
python ./scripts/HDplot_process_vcf.py -i ./06_denovo_map_out/populations.snps.vcf

wait

#Plot above metrics
Rscript ./scripts/HDplot_graphs.R -i ./.depthsBias --minD -80 --maxD 80

# Rename depthsBias so it's not hidden
mv ./.depthsBias ./population.depthsBias

08_run_populations.sh

#!/bin/bash
#BSUB -q long
#BSUB -W 30:00
#BSUB -R rusage[mem=16000]
#BSUB -n 20
#BSUB -R span[hosts=1]
#BSUB -e logs/08_run_populations.err
#BSUB -oo logs/08_run_populations.log

#Specify popmap
popmap=./pop_map.txt

module load stacks/2.3d

populations -t 20 -P ./06_denovo_map_out -O ./07_FINAL_populations_out --popmap $popmap --vcf -B ./paralogs.blacklist -W ./singletons.whitelist --hwe --fstats

metrics4Rmarkdown.sh

#!/bin/bash

#BSUB -q long
#BSUB -W 8:00
#BSUB -R rusage[mem=1000]
#BSUB -n 1
#BSUB -R span[hosts=1]
#BSUB -e ./logs/metrics_prep.err
#BSUB -oo ./logs/metrics_prep.log

module load R/3.6.0 
module load gcc/8.1.0
module load perl/5.18.1 
module load vcftools/0.1.14 

## Make a text file that lists the absolute paths to vcf SNPs file
cd ./05_parameter_opt/vcf_files
ls -d "$PWD"/*snps.vcf > ../../metrics/vcf_all.txt
cd ../../metrics/

# Remove the vcf file paths that we don't need (haps.vcf or r80.snps.vcf)
cat vcf_all.txt | grep -vwE "*haps.vcf" | grep -vwE "*r80.snps.vcf" > vcflist.txt

# Return back to main directory
cd ..

#########################################################
######## Metric 4: Cumulative variance in PCAs ##########
#########################################################

# List the number of PCs with the numerical value after the vcf text file and write to new file
Rscript ./metrics/clustOpt/vcfToPCAvarExplained.R ./metrics/vcflist.txt 2 > ./metrics/PCAvarExplained_2.txt
Rscript ./metrics/clustOpt/vcfToPCAvarExplained.R ./metrics/vcflist.txt 3 > ./metrics/PCAvarExplained_3.txt
Rscript ./metrics/clustOpt/vcfToPCAvarExplained.R ./metrics/vcflist.txt 4 > ./metrics/PCAvarExplained_4.txt
Rscript ./metrics/clustOpt/vcfToPCAvarExplained.R ./metrics/vcflist.txt 5 > ./metrics/PCAvarExplained_5.txt
Rscript ./metrics/clustOpt/vcfToPCAvarExplained.R ./metrics/vcflist.txt 7 > ./metrics/PCAvarExplained_7.txt
Rscript ./metrics/clustOpt/vcfToPCAvarExplained.R ./metrics/vcflist.txt 9 > ./metrics/PCAvarExplained_9.txt

# for each of the PCAvarExplained files within the directory (PCs 2-9)
for file in $(seq 2 9)

do

# Remove the extra text in the file, and only keep what is needed for plotting then output to a new file
sed 's/^.*vcf_files//g' ./metrics/PCAvarExplained_${file}.txt | sed 's/^.//' |  sed 's/_pops.snps.vcf://g' > ./metrics/PCAplot${file}.txt

done


##################################################################
### Metric 5: Data missingness vs. pairwise genetic similarity ###
##################################################################

# move back into the clustOpt space (necessary for the next step to run within that folder)
cd metrics/clustOpt/

# This will run and call on missingnessHeatMaps.R script to produce the pdf heat maps 
perl ./vcfMissingness.pl --vcflist ../vcflist.txt

# we are going to tell the script to pause for 30 minutes to give the 
# perl script enough time to complete (create the PDF missingness heat maps)
# the command 'wait' doesn't work in this pipeline FYI
sleep 30m

# move back out of the clustOpt space and into the main directory
cd ../../

# Copy the output files from the vcfMissingness.pl script. This was outputted to the directory where 
# the vcffiles are located. We want to copy them over to the metrics folder so it's all stored together
# with other metrics for an easier time with the downstream stem. 
# cp -v ./05_parameter_opt/vcf_files/*HM* ./metrics
cp -v ./05_parameter_opt/vcf_files/*pdf ./metrics

# Print out correlations from log file 
grep -i -e '^Correlation' ./logs/metrics_prep.log > ./metrics/missingnessHeatMaps_correlations.txt

done

R Markdown to plot data for parameter optimization

Assumes all files successfully outputted into the metrics folder following the parameter optimization .
  1. Be sure you ran the metrics4Rmarkdown.sh script successfully prior to moving onto the R markdown. This will not work if you don't.
  2. Copy the "metrics" folder from your main directory locally to your computer. This will be much easier to run and the file sizes are not large.
  3. Clone the R markdown locally. Be sure the set the working directory of your R Markdown to run from wthin the metrics folder.

Notes to consider for future

  • Might be able to use the* rm-pcr-duplicates flag in denovo_map.pl and remove the clone_filter step entirely.
  • Is it a better idea to trim the first two bases from each read, or to add GG to the beginning of each barcode? (see Julian's comment here)
  • Thoughts about checking for barcodes? Wouldn't be exact (due to mismatches), but maybe a rough estimate?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment