Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save justincbagley/8f1a263790609ed5275db065aa96b42b to your computer and use it in GitHub Desktop.
Save justincbagley/8f1a263790609ed5275db065aa96b42b to your computer and use it in GitHub Desktop.
This Gist describes how to process and analyze SNPs from ddRAD tag loci in SNAPP (BEAST2)

Method for Analyzing ddRADseq Data in SNAPP (BEAST v2.4++)

Justin C. Bagley, September 11, 2017, Richmond, VA, USA

This markdown note describes how I used several software programs to process and eventually analyze SNPs from ddRAD tag loci (contigs) in SNAPP (Bryant et al. 2012), which is implemented in BEAST (Drummond et al. 2012; Bouckaert et al. 2014) and is of broad interest in evolutionary biology for inferring species trees (e.g. Demos et al. 2015; Stange et al. 2017). I provide a perspective based on my experiences analyzing data generated using Next-Generation Sequencing on ddRADseq genomic libraries prepped for several species/lineages of Neotropical freshwater fishes from the Brazilian Cerrado (Central Brazil).

My account is given in first person and represents merely one way to analyze data in SNAPP; there are other approaches, and other documents (e.g. this BFD* tutorial; Leaché and Ogilvie 2016) also present a general approach. However, all of the brief SNAPP guides and tutorials that are currently available require the user to consult the manual, A Rought Guide to SNAPP, written by Bouckaert and Bryant (2012; available here or here). Since SNAPP is amply covered by Bryant et al. (2012), Leaché and Ogilvie (2016), and other papers, I'll skip the introduction to SNAPP and assume that the reader is acquainted with the details of the method, as well as the input and output of SNAPP runs conducted in BEAST.

Data Preparation

1. Fastq files.

  • I converted raw NGS data in .bam format to .fastq format, using samtools (link). I did this while working on Ion Torrent data, and so this step may not be applicable to cases involving other sequencing platforms or non-second-generation sequencers. Separate .fastq files from multiple runs were concatenated together.

2. FastQC.

  • I conducted quality checks on the fastq files using the software program, FastQC.

3. Assembly and SNP calling in pyRAD, then processing pyRAD output.

  • I ran fastq files in pyRAD under different parameter settings, then chose a final pyRAD run, and its corresponding output assembly, stats, and output files in various phylogenetic and population genetic formats (e.g. ".phy", ".nex", ".str", ".geno", ".snps").
  • I specified for pyRAD to output all single nucleotide polymorphisms (SNPs) in Phylip format, and also an unlinked .snps file to work from.
  • Before going further, I removed the outgroup from each datafile so that only ingroup samples were included in subsequent analyses (so I had the original SNPs file and the new file with no outgroups).

4. Phyrnomics R package. I installed the phrynomics R package (devtools::install_github("bbanbury/phrynomics")) and loaded the full set of SNP loci (with variable and non-variable sites), which had the ".snps" file extension, into phyrnomics, where I then

    1. removed nonbinary sites (RemoveNonBinary function),
    1. identified and pulled out a single SNP from each locus to use in the analyses (TakeSingleSNPfromEachLocus),
    1. translated bases to appropriate format/coding (TranslateBases), and 4) wrote modified SNPs to a NEXUS file with missing data coded as ?'s.
  • I also took advantage of this opportunity to convert my original SNPs data to MrBayes and RAxML format within phrynomics.
  • All of my R code for step #4 here is in a file named "Hh_and_Hyp_Phrynomics_R_code... txt" in the bin folder I usually use as my R working directory...

Input File Generation

5. Discrete NEXUS data.

  • I then modified the NEXUS file so that the datatype was specified as integer, i.e. by setting FORMAT DATATYPE=integerdata symbols="012" gap=- MISSING=?; in the nexus header. This way, the data was set to be discrete.

6. Tip names / taxon labels.

  • I modified each individual taxon name so that it was preceded by a code with the format "Species_Taxon", where Species = the hypothetical species delimitation of the taxon. I did this because SNAPP expects this format, and it will automatically pull the Species data from the filenames and use it to help set up the species tree and population parameters in the analysis, e.g. branch lengths and thetas. The final datafile was obviously a nexus file with extension ".nex".

7. Make input files in BEAUti.

  • I opened a new SNAPP template in BEAUti v2.4.2 and added the final SNAPP-formatted data file as an alignment, then set up different runs with different priors/models and saved them as different XML files (input files for BEAST), with different names.

Running SNAPP

8. Organize SNAPP run folders/scripts and submit jobs.

  • I used my shell script SNAPPRunner.sh to automate creating run folders and queue submission shell scripts for five runs of each XML file, each starting from a different random starting seed selected by calling on python to generate a random number.
  • SNAPPRunner also was used to submit all jobs to the job queue on a remote supercomputing cluster.

Tricky Stuff

One of the tricks to getting SNAPP to work for you is to experiment with different models and prior settings, in order to find the most suitable priors for your given dataset. For example, for some datasets, the default SNAPP settings will work fine and result in nicely converged runs with parameter traces that reached stationarity. However, I'm pretty sure that for many other datasets, the SNAPP defaults will not produce good results. For some of my analyes, fixing parameter u and/or parameter v, or setting uninformative uniform priors within reasonable bounds, was necessary to achieve good sampling properties (nice, "fuzzy caterpillar" parameter traces and good ESS scores). For datasets that "worked well", I have seen good chain mixing behavior over 1 million samples of MCMC simulations (which took 2 weeks to run on phylogenomic data):

References

  • Bouckaert R, Heled J, Künert D, Vaughan TG, Wu CH, Xie D, Suchard MA, Rambaut A, Drummond AJ (2014) BEAST2: a software platform for Bayesian evolutionary analysis. PLoS Computational Biology, 10, e1003537.
  • Bouckaert R, Bryant D (2012) A rough guide to SNAPP. Available at: https://github.com/BEAST2-Dev/SNAPP/releases/download/v1.2.0/SNAPP.pdf.
  • Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A (2012) Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Molecular Biology and Evolution, 29, 1917–1932.
  • Demos TC, Peterhans JCK, Joseph TA, Robinson JD, Agwanda B, Hickerson MJ (2015) Comparative population genomics of African montane forest mammals support population persistence across a climatic gradient and Quaternary climatic cycles. PLoS One 10: e0131800.
  • Drummond AJ, Suchard MA, Xie D, Rambaut A (2012) Bayesian phylogenetics with BEAUti and the BEAST 1.7. Molecular Biology and Evolution, 29, 1969-1973.
  • Leaché A, Ogilvie HA (2016) Bayes Factor Delimitation of Species (* with genomic data; BFD*): A Tutorial and Worked Example. Available at: http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2016/01/BFDstar-tutorial1.pdf.
  • Stange M, Sánchez-Villagra MR, Salzburger W, Matschiner M (2017) Bayesian divergence-time estimation with genome-wide SNP data of sea catfishes (Ariidae) supports Miocene closure of the Panamanian Isthmus. bioRxiv, 102129.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment