justincbagley/Notes_on_Analyzing_RADseq_SNPs_in_SNAPP.md

## Notes_on_Analyzing_RADseq_SNPs_in_SNAPP.md

      
    Raw
  

              Notes_on_Analyzing_RADseq_SNPs_in_SNAPP.md
            
          
    Notes on Analyzing ddRADseq SNP Data in SNAPP (BEAST v2.4++)

This markdown note describes how I used several software programs to process and eventually analyze SNPs from ddRAD tag loci in SNAPP (Bryant et al. 2012). The data were generated using Next-Generation Sequencing on ddRADseq genomic libraries prepped for several species/lineages of Neotropical freshwater fishes from the Brazilian Cerrado (Central Brazil).
My account is given in first person and represents merely one way to analyze data in SNAPP; there are other approaches, and other documents (e.g. the Leaché et al. BDF* tutorial doc) also present a general approach. However, all the brief SNAPP guides and tutorials that are currently available require the user to consult the manual, A Rought Guide  to SNAPP, written by Bouckaert and Bryant. Since SNAPP is amply covered by Bryant et al. (2012), Leaché et al. (2014), and other papers, I'll skip the introduction to SNAPP and assume the reader is acquainted with the details of the method, input, and output.
Data Preparation


I converted raw NGS data in .bam format to .fastq format, using samtools. This was done on Ion Torrent data, and may not be applicable to other platforms or non-second-generation sequencers.


I conducted quality checks on fastq files using the software program, fastqc.


pyRAD assembly and processing pyRAD output.

I ran fastq files in pyRAD under different parameter settings, then chose a final pyRAD run, and its corresponding output assembly, stats, and output files in various phylogenetic and population genetic formats (e.g. ".phy", ".nex", ".str", ".geno", ".snps").
I specified for pyRAD to output all single nucleotide polymorphisms (SNPs) in phylip format, and also an unlinked .snps file to work from.
Before going further, I removed the outgroup from each datafile so that only ingroup samples were included in subsequent analyses (so I had the original SNPs file and the new file with no outgroups).


I downloaded the phrynomics R package and loaded the full set of SNP loci (with variable and non-variable sites), which had the ".snps" file extension, into phyrnomics, where I then


removed nonbinary sites ("RemoveNonBinary"),


identified and pulled out a single SNP from each locus to use in the analyses ("TakeSingleSNPfromEachLocus"),


translated bases to appropriate format/coding ("TranslateBases"), and 4) wrote modified SNPs to a NEXUS file with missing data coded as ?'s.


I also took advantage of this opportunity to convert my original SNPs data to MrBayes and RAxML format within phrynomics.
All of my R code for step #4 here is in a file named "Hh_and_Hyp_Phrynomics_R_code... txt" in the bin folder I usually use as my R working directory...


Input File Generation


I then modified the NEXUS file so that the datatype was specified as integer, i.e. by setting "FORMAT DATATYPE=integerdata  symbols="012" gap=- MISSING=?;" in the nexus header. This way the data was set to be discrete.


I modified each individual taxon name so that it was preceded by a code with the format "Species_Taxon", where Species = the hypothetical species delimitation of the taxon. I did this because SNAPP expects this format, and it will automatically pull the Species data from the filenames and use it to help set up the species tree and population parameters in the analysis, e.g. branch lengths and thetas. The final datafile was obviously a nexus file with extension ".nex".


I opened a new SNAPP template in BEAUti 2.4.2 and added the final SNAPP-formatted data file as an alignment, then set up different runs with different priors/models and saved them as different XML files (input files for BEAST), with different names.


Running SNAPP


I used my shell script "BEASTRunner_newRand.sh" to automate creating run folders and queue submission shell scripts for five runs of each XML file, each starting from a different random starting seed selected by calling on python to generate a random number.


I manually simultaneously submitted all jobs to the supercomputer by logging in with passwordless ssh access and pasting, at once, commands to submit all of the jobs using qsub.


Tricky Stuff

One of the tricks to getting SNAPP to work for you is to experiment with different models and prior settings, in order to find the most suitable priors for your given dataset. For example, for some datasets, the default SNAPP settings will work fine and result in nicely converged runs with parameter traces that reached stationarity. However, I'm pretty sure that for many other datasets, the SNAPP defaults will not produce good results. For some of my analyes, fixing u and/or v, or setting uninformative uniform priors within reasonable bounds, was necessary to achieve good sampling properties (nice, "fuzzy caterpillar" parameter traces and good ESS scores).
References


Bouckaert et al. (2014)
Bouckaert and Bryant (REF)
Bryant et al. (2012)
Demos et al. (2015)
Drummond et al. (2012)
Leaché et al. (REF)
Stange et al. (2017). bioarxiv.