Skip to content

Instantly share code, notes, and snippets.

@brantfaircloth
Created November 9, 2011 22:58
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save brantfaircloth/47e03463db0573c4252f to your computer and use it in GitHub Desktop.
Save brantfaircloth/47e03463db0573c4252f to your computer and use it in GitHub Desktop.
2011-Fairclothetal-SysBiol-Notes

Identifying UCEs in other organisms

  • get count of all UCEs identified that were non-dupes:

    sqlite> select count(*) from cons, blast where blast.id = cons.id and blast.matches = 1 and duplicate = 0;
    count(*)  
    ----------
    5599      
  • get distance btw. UCE loci in set above w/:

    python other-organisms/get_distances_in_chicken.py
  • create a new table in the probe dbase for the ids and count of all "conservation" sureselect probes:

    sqlite> create table sureselect_all_probe_counts as select seq, id, count(seq) as cnt, 
    sqlite> data_source from sureselect where data_source = 'conservation' group by seq order by id;
    
    sqlite> select count(*) from sureselect_all_probe_source;
    count(*)  
    ----------
    5138
  • get the probe sequences from the dbase for all the conserved probes:

    python assembly/get_sureselect_probes_or_cons.py --output all-probes.fasta --all --probes probe.sqlite
  • move that into MS folder in Dropbox:

    mv all-probes.fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers
  • [lastz] this file against several genomes (used --huge option on anoCar2 because of chromo/scaffold number). The output files are spread btw. snp/lastz and matches/lastz directories:

    * mm9
    * hg19
    * anoCar2
    * taeGut1
    * galGal3
    * melGal2
    * croPor1 (really 0.1)
    * monDom5
    * ornAna1
    * xenTro2
    
    python share/run_lastz.py \
        --target /Volumes/Data/Genomes/taeGut1/taeGut1.2bit \
        --query all-probes.fasta --nprocs 8 \
        --output lastz/all-probes-to-taeGut1.lastz
  • check probes for dupes:

    python share/easy_lastz.py \
        --target all-probes.fasta --query all-probes.fasta \
        --identity 85 --output all-to-all.lastz
  • generate probe counts for critters in matches/lastz:

    python other-organisms/get_probes_matches_from_lastz.py ./ \
        all-probes.fasta test.sqlite \
        --dupefile all-to-all.lastz
  • data will reside in data/. split files into appropriate directories - some into snp/ and some into matches/

Primate tree test

  • update genomes by dloading new build, if avaialble
  • run lastz of probes against genomes:

    python align/un_multiple_lastzs.py \
        --output=/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates \
        --probefile=all-probes.fasta \
        --chromolist="hg19","gorGor3","ponAbe2","rheMac2","mm9" \
        --readlist="nomLeu1","panTro3","calJac3","papAnu1","otoGar3"
  • rename all the files to similar names used previously. In ipython:

    [shutil.copy(i,"all-probes-to-{0}".format(os.path.basename(i).split('_')[-1])) for i in glob.glob('*.lastz')]
  • first move SNP-related and initial db-related stuff to its own SNP folder. Put these data in "primates" folder.
  • Then (ran w/ 200 AND 150), build "fake" data sets from the extant genomes that are just like the data we collect from UCE loci:

    for i in {"hg19","gorGor3","ponAbe2","rheMac2","mm9","nomLeu1","calJac3","otoGar3"}; 
        do \
        python share/get_fake_velvet_contigs_from_genomes.py  \
        /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/lastz/all-probes-to-$i.lastz \
        /Volumes/Data/Genomes/$i/$i.2bit \
        --dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
        --fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/fake-velvet-fasta/$i-150.fasta --flank 150;
        done;
  • I originally ran the above w/ panTro3 and papAnu1, but it failed for panTro3 (some chromo names have 3 parts) and papAnu1 (chromo names have many parts) because they have crazy-ass name structures. Updated code and ran, instead, w/:

    python share/get_fake_velvet_contigs_from_genomes.py  \
            /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/lastz/all-probes-to-panTro3.lastz \
            /Volumes/Data/Genomes/panTro3/panTro3.2bit \
            --dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
            --fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/fake-velvet-fasta/panTro3-150.fasta \
            --flank 150 --name-components 3
    
    python share/get_fake_velvet_contigs_from_genomes.py  \
        /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/lastz/all-probes-to-papHam1.lastz \
        /Volumes/Data/Genomes/papHam1/papHam1.2bit \
        --dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
        --fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/fake-velvet-fasta/papHam1-150.fasta \
        --flank 150 --splitchar None
  • make symlinks to fastas in contigs for each of {"hg19","gorGor3","ponAbe2","rheMac2","mm9","nomLeu1","panTro3","calJac3","papAnu1","otoGar3"}:

    ln -s ../fake-velvet-fasta/mm9-200.fasta mus-musculus.fasta
  • re-lastz probes against "contigs" and store group matches in db:

    python assembly/match_contigs_to_probes.py \
        ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/contigs \
        ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-probes.fasta \
        ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/probe-match-for-db \
        --dupefile ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz
  • run get_match_counts on reptile data to build format file:

    python assembly/get_match_counts.py \
        ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/probe-match-for-db/primate-probes-matches.sqlite \
        ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/config/primate-match-count.config \
        'Primates' \
        --output ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/config/primate-loci-for-fasta.out
  • add asterisks to primate-loci-for-fasta.out for each organism
  • turn those data into a giant fasta for alignment:

    python assembly/get_fastas_from_match_counts.py \
        ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/contigs \
        ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/probe-match-for-db/primate-probes-matches.sqlite \
        ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/config/primate-loci-for-fasta.out \
        --output ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/alignment/primates.fasta
  • align those puppies:

    python align/eqcap_align.py \
        --input primates.fasta --output nexus \
        --species 10 --trim-running --multiprocessing --processors 7
  • within ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/alignment/ convert these to phylip so we can run mrAIC on them:

    python align/convert_nexus_to_phylip.py \
        --input nexus/ --output phylip/ --shorten-name --positions '-2,-1'
  • estimate subs models:

    python align/run_mraic.py --input phylip --output aicc
  • remove the locus name from each nexus file:

    python align/remove_locus_name_from_nexus_lines.py \
        --input nexus --output nexus-rename
  • prep the mrbayes file:

    python align/nexus_to_concat.py \
        --models ../output_models.txt \
        --aligns ../nexus-rename/ \
        --concat primate-each-part.nex \
        --mr-bayes --interleave --unlink
  • crank up some ec2. setup mrbayes per: https://gist.github.com/950010, then:

    ec2-run-instances --key ec2-keypair ami-349d645d \
        --instance-type=c1.xlarge --availability-zone=us-east-1b \
        --block-device-mapping '/dev/sda2=ephemeral0' \
        --block-device-mapping '/dev/sda3=ephemeral1' 
  • sync up nexus file:

    rsync -avz -e "ssh -i /Users/bcf/.ec2/ec2-keypair" \
        ./ ec2-user@ec2-67-202-26-236.compute-1.amazonaws.com:/mnt/data/primate-rate-variation/
  • after starting in mpi-mode, set ngen = 5,000,000 and run w/ mcmc. run. wait.
  • connect to instance:

    ssh -i ec2-keypair ec2-user@xx-yy-zz.compute-1.amazonaws.com
  • get summary data across nexus (trimmed alignments):

    python align/get_cons_region_align_stats.py \
        --input nexus-rename --output primate-align-stats.csv
  • open up R, then:

    # this gives U-shaped figure
    primates = read.table('/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/stats/primate-trimmed-align-stats.csv',header=TRUE, sep = ',')
    nobase = primates[primates$greaterthanonediff >0.0,]
    nooutlier = nobase[nobase$greaterthanonediff <= 0.10,]
    ggplot(nooutlier, aes(x=bp,y=greaterthanonediff,color=greaterthanonediff))   geom_point(size=3)   scale_colour_gradientn(colour = rainbow(5))   scale_y_continuous('Frequency of variant bases\n')   scale_x_continuous('\nDistance from center of alignment')   theme_bw()   opts(legend.position = "none")
    pdf('name.pdf')
    # remove some gridlines
    p
    g = ggplotGrob(p)
    grid.ls(g)
    grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.major.y.polyline"),grep=T)
    grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.minor.y.polyline"),grep=T)
    dev.off()

Bird deep tree from in vitro data

  • unzip data from sequencer
  • clean adapter contamination and trim using sliding window:

    python assembly/process_reads.py \
        --sample-map Illumina_Data/SamplesDirectories.csv Bin_012
  • several tags were parsed incorrectly (not at all). go ahead and re-parse the unknown file to see if we can pull out necessary reads:

    python assembly/reparse_tags.py --input s_5_sequence.txt --tag 'TCTGCC' --output anser-erythropus.fastq
    python assembly/reparse_tags.py --input s_6_sequence.txt --tag 'GCATCC' --output zanclostomus-javanicus.fastq
    python assembly/reparse_tags.py --input s_6_sequence.txt --tag 'GTAATC' --output nyctibius-grandis.fastq
  • put these in Bin_unk and run through trimming/assembly as above. Then get coverage and read lengths, etc. as above.
  • upload data on individual basis to ec2 (need a machine w/ shitloads of RAM) - ZONE us-east-1:

    ec2-run-instances --key ec2-keypair ami-349d645d --region us-east-1\
        --instance-type=m2.2xlarge \
        --block-device-mapping '/dev/sda2=ephemeral0' \
        --block-device-mapping '/dev/sda3=ephemeral1'
  • built velvet (max_kmer = 96) and assoc. code (bioperl, etc.), then assemble using VelvetOptimiser.pl (see stats folder for each Bin for actual run code):

    ~/src/velvet_1.1.04/contrib/VelvetOptimiser-2.1.7/VelvetOptimiser.pl \
        -s 59 -e 79 -f '-short -fastq genus-species.fastq' \
        -o '-amos_file yes' && tar cvjf ../done/genus-species.tar.bz2 auto_*/
  • assembled reads and zipped up assemblies and contigs as above, downloaded those. generated checksums and checked those and uploaded to webserver
  • for several species, had to tweak contig generation. See readme.md in retry folder
  • determine size of contigs (unzipped):

    for i in contigs/*; do echo $i; faSize $i; done
  • determine kmer coverage of contig, given knowledge of kmer and contig output:

    python assembly/coverage_from_kmers.py --csv True \
        --input Bin_005/contigs/mus-musculus.assembly-retry.contigs.fasta
  • align probes that we used to themselves to make a dupefile:

    python ~alignment/easy_lastz.py --target=probe-subset-2560-synthesized.fasta --query=probe-subset-2560-synthesized.fasta --coverage=80 --identity=80 --output=probe-subset-2560-to-self.fasta

  • match the contigs that we have to the probes that we used using lastz and store the matching results in a database. also write output (lastz files) to a folder:

    python assembly/match_contigs_to_probes.py \
        /Volumes/Data3/lsu/birds/contigs \
        probe-subset-2560-synthesized.fasta \
        /Volumes/Data3/lsu/birds/lastz \
        --dupefile probe-subset-2560-to-self.fasta
    
    anser_erythropus: 1596 (69.63%) uniques of 2292 contigs, 0 dupe probes, 0 dupe probe matches, 45 dupe node matches
    dromaius_novaehollandiae: 1518 (76.40%) uniques of 1987 contigs, 0 dupe probes, 2 dupe probe matches, 24 dupe node matches
    eudromia_elegans: 1484 (77.29%) uniques of 1920 contigs, 0 dupe probes, 1 dupe probe matches, 52 dupe node matches
    gallus_gallus: 1438 (77.65%) uniques of 1852 contigs, 0 dupe probes, 1 dupe probe matches, 30 dupe node matches
    megalaima_virens: 1174 (85.69%) uniques of 1370 contigs, 0 dupe probes, 0 dupe probe matches, 10 dupe node matches
    phalacrocorax_carbo: 1384 (86.45%) uniques of 1601 contigs, 0 dupe probes, 3 dupe probe matches, 10 dupe node matches
    pitta_guajana: 1572 (66.36%) uniques of 2369 contigs, 0 dupe probes, 2 dupe probe matches, 32 dupe node matches
    struthio_camelus: 1591 (75.19%) uniques of 2116 contigs, 0 dupe probes, 2 dupe probe matches, 37 dupe node matches
    urocolius_indicus: 1495 (67.71%) uniques of 2208 contigs, 0 dupe probes, 2 dupe probe matches, 43 dupe node matches
  • screen various groups and/or optimize individuals in groups (in terms of UCEs located) to max the number of UCEs shared across members of the group:

    python assembly/get_match_counts.py \
        /Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
        ~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run/faircloth/match-count.config \
        'Birds - UCE paper'
  • to extend capture data with genomic/outgroup data and build probe data set (genomic outgroup data are indicated in the config file using asterisks):

    [Birds - UCE paper]
    dromaius_novaehollandiae
    struthio_camelus
    eudromia_elegans
    anser_erythropus
    gallus_gallus
    phalacrocorax_carbo
    megalaima_virens
    urocolius_indicus
    pitta_guajana
    anolis_carolinensis*
    
    python assembly/get_match_counts.py \
        /Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
        ~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run/faircloth/match-count.config \
        --extend /Users/bcf/Dropbox/research/faircloth/UCEs-as-genetic-markers/data/snp/probe-match-for-db/genome.matches.sqlite \
        'Birds - UCE paper' \
        --output /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.out
  • turn those data into a giant fasta for alignment:

    python assembly/get_fastas_from_match_counts.py \
        /Volumes/Data3/lsu-seqcap/contig/ \
        /Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
        /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.out \
        --extend-db /Users/bcf/Dropbox/research/faircloth/UCEs-as-genetic-markers/data/snp/probe-match-for-db/genome.matches.sqlite \
        --extend-dir /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/snp/contig/ \
        --output /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.fasta
  • align:

    python align/seqcap_align.py \
        --input /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.fasta \
        --output /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/nexus/ \
        --species 10 --trim-running --multiprocessing --processors=7
  • convert these to phylip so we can run mrAIC on them:

    python align/convert_nexus_to_phylip.py \
        --input nexus/ --output phylip/ --shorten-name --positions '-2,-1'
  • get subs models:

    python align/run_mraic.py --input phylip --output aicc
  • remove the locus name from each nexus file:

    python align/remove_locus_name_from_nexus_lines.py \
        --input nexus --output nexus-rename
  • prep the mrbayes file:

    python align/nexus_to_concat.py \
        --models output_models.txt --aligns nexus-rename/ \
        --concat mrbayes/deep-bird-each-part.nex --mr-bayes \
        --interleave --unlink
  • crank up some AWS:

    ec2-run-instances --key keypair ami-349d645d \
        --instance-type=c1.xlarge --availability-zone=us-east-1b \
        --block-device-mapping '/dev/sda2=ephemeral0' \
        --block-device-mapping '/dev/sda3=ephemeral1' 
  • login and setup mrbayes:

    ssh -i ec2-keypair ec2-user@ec2-174-129-57-11.compute-1.amazonaws.com
  • use rsync to up/dload data:

    rsync -avz -e "ssh -i /Users/bcf/.ec2/ec2-keypair" \
        ec2-user@ec2-174-129-57-11.compute-1.amazonaws.com:/mnt/data/birds/ ./
  • get variant stats for birds:

    python align/get_cons_region_align_stats.py --input nexus-rename --output bird-align-stats.csv
  • make figures:

    # this gives U-shaped figure
    birds = read.table('/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/stats/bird-trimmed-align-stats.csv', header = TRUE, sep = ',')
    birdnobase = birds[birds$greaterthanonediff >0.0,]
    birdnooutlier = birdnobase[birdnobase$greaterthanonediff <= 0.10,]
    ggplot(birdnooutlier, aes(x=bp,y=greaterthanonediff,color=greaterthanonediff))   geom_point(size=3, alpha = 5/10)   scale_colour_gradient(low = "red", high = "blue")   scale_y_continuous('Frequency of variable positions\n')   scale_x_continuous('\nDistance from center of alignment')   theme_bw()   opts(legend.position = "none")
    pdf('bird-variable-positions.pdf')
    # remove some gridlines
    p
    g = ggplotGrob(p)
    grid.ls(g)
    grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.major.y.polyline"),grep=T)
    grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.minor.y.polyline"),grep=T)
    dev.off()

Enrichment stats (for birds)

  • extract AMOS file from velvet tarball and store in tmp:

    tar -jtvf /Volumes/Data3/lsu-seqcap/Bin_008/assembled/urocolius-indicus.assembly.tar.bz2 
    tar -xjvf /Volumes/Data3/lsu-seqcap/Bin_008/assembled/urocolius-indicus.assembly.tar.bz2 auto_data_73/velvet_asm.afg
  • convert AMOS afg file to AMOS bnk file:

    ~/Source/amos-3.0.0/src/Bank/bank-transact -m megalaima-virens.afg -b megalaima-virens.bnk -c
  • output coverage file, so we can get iids:

    ~/Source/amos-3.0.0/src/Utils/cvgStat -b megalaima-virens.bnk > megalaima-virens.cvg
  • get loci from query of sqlite data:

    sqlite> .output megalaima-virens.loci
    sqlite> select megalaima_virens from match_map where megalaima_virens IS NOT NULL;
    sqlite> .output stdout
  • parse that bad-boy and the cvg file to get iids:

    mv /Volumes/Data3/lsu-seqcap/lastz/megalaima-virens.loci ~/Tmp/bird-coverage/megalaima-virens
    
    python assembly/get_iids_from_cvg_stat.py 
         \ megalaima-virens.cvg megalaima-virens.loci --output megalaima-virens.iid
  • now get depth of coverage of locus-specific reads:

    ~/Source/amos-3.0.0/src/Validation/analyze-read-depth megalaima-virens.bnk -d -I megalaima-virens.iid
  • get depth of coverage of all reads in contigs:

    ~/Source/amos-3.0.0/src/Validation/analyze-read-depth megalaima-virens.bnk -d
  • in order to derive the average contig length from loci matching UCEs (there is another script for only UCE loci present in all orgs - it parses the fasta we input to alignment rather than the assembled contigs from velvet):

    python assembly/get_contig_lengths_for_all_uce_loci.py \
        /Volumes/Data3/lsu-seqcap/contig \
        /Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
        ~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run/faircloth/match-count.config \
        "Birds - UCE paper"

Additional SNP Analyses

  • screen lastz files for dupes, starts and ends, and generate velvet-like contigs, so we can insert these data into the same pipeline we have for the seqcap data:

    python share/get_fake_velvet_contigs_from_genomes.py \
        all-probes-to-hg19.lastz \
        /Volumes/Data/Genomes/hg19/hg19.2bit \
        fake-velvet-fasta/taeGut1-all-probe-fake-velvet.fasta
        --flank 50
  • Also get BED files for flanks w/ flank dist @ 50, 100, 250:

    python share/get_fake_velvet_contigs_from_genomes.py \
        all-probes-to-hg19.lastz /Volumes/Data/Genomes/hg19/hg19.2bit \
        --bed fake-velvet-bed/fake-velvet-all-probe-hg19-50.bed \
        --flank 50
  • Or, do both at once:

    python share/get_fake_velvet_contigs_from_genomes.py  \
        data/lastz/all-probes-to-hg19.lastz \
        /Volumes/Data/Genomes/hg19/hg19.2bit \
        --fasta data/fake-velvet-fasta/fake-velvet-all-probe-hg19-500.fasta \
        --bed data/fake-velvet-bed/fake-velvet-all-probe-hg19-500.bed \
        --flank 500
  • using UCSC browser, intersect those with dbSNP128 results (for mm9 and hg19) and bgiSNP results (for galGal3):

    results => snp/
  • associate snps w/ the correct UCEs (using coordinate overlaps, since UCSC doesn't return this):

    python snps/find_snps_in_bed_interval.py \
        data/fake-velvet-bed/fake-velvet-all-probe-hg19-500.bed \
        data/snp/dbSNP132-intersect-all-probe-hg19-500.bed \
        --output data/snp/dbSNP132-to-hg19-uce-500.csv
  • upload SNP RSs to dbSNP to pull down heterozygosity data
  • from hg19 snps, get heterozygosity data:

    python snps/parse_dbsnp_xml.py \
        hg19-dbSNP-data > variation-data-hg19-dbSNP.csv
  • do same in mouse:

    python snps/parse_dbsnp_xml.py \
        mm9-dbSNP-data > variation-data-mm9-dbSNP.csv

Plotting additional figures

  • R code (didn't use these boxplots):

    > data = read.table('contig-lengths-by-probe.csv', header = TRUE, sep = ',')
    > ggplot(data, aes(factor(probes), contiglength))   geom_boxplot()
    > ggplot(data, aes(factor(probes), coverage))   geom_boxplot()
    > p1 = ggplot(data, aes(factor(probes), coverage))   geom_boxplot()
    > ggsave('coverage-by-probes.pdf')
    > p2 = ggplot(data, aes(factor(probes), contiglength))   geom_boxplot()
    > ggsave('contiglength-by-probes.pdf')
    > f1 = ggplot(data, aes(factor(probes), contiglength))   geom_boxplot()   facet_wrap(~ organism, ncol = 6)
    > ggsave('contiglength-by-probes-facet.pdf')
    > f2 = ggplot(data, aes(factor(probes), coverage))   geom_boxplot()   facet_wrap(~ organism, ncol = 6)
    > ggsave('coverage-by-probes-facet.pdf')
  • using intersection BED data, prep CSV file associating SNPs w/ UCE names and coords associate variation data with figure with:

    python snps/get_dbsnp_variability_for_all_uces.py \
        dbSNP132-to-hg19-uce-200.csv \
        ~/Tmp/dbsnp/hg19/hg19-dbSNP-data > dbSNP132-to-hg19-uce-200-ready-to-plot.csv
  • get the SNP variability data for SNPS within 200 bp of UCEs:

    python snps/get_dbsnp_freq_stats.py dbSNP132-to-hg19-uce-200.csv \
        ~/Tmp/dbsnp/hg19/hg19-dbSNP-data \
        --dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
        --output test1.csv --output2 test2.csv
  • get that into R and:

    raw = read.table('test2.csv', sep = ',', header = TRUE)
  • convert freqs to floats:

    raw$freq = as.numeric(as.character(raw$freq))
    # add pos column
    raw$pos = raw$snp.start - (raw$uce.start   raw$uce.end)/2
  • plot out just the allele freqs for the loci:

    raw1 = subset(raw,freq > 0.0)
    raw1 = subset(raw1, X1000gvalidated == 'true') #4653 loci
    p = ggplot(raw1, aes(x=pos,y=freq,color=freq)) + geom_point(size=3, alpha = 5/10) 
        + scale_colour_gradient(low = "red", high = "blue") \
        + scale_y_continuous('Minor allele frequency (MAF)\n') \
        + scale_x_continuous('\nDistance from center of UCE locus') \
        + theme_bw() + opts(legend.position = "none")
    
    np = read.table('test1.csv', sep = ',', header = TRUE)
  • running average (window 25) of the number of SNPs per base across all UCEs:

    p = ggplot(subset(np, datatype == 'running'), aes(x=pos,y=avg,color=avg)) \
        + geom_point(size=3, alpha = 5/10) \
        + scale_colour_gradient(low = "red", high = "blue") \
        + scale_y_continuous('Average number (25 bp window) of 1000 Genomes Project validated SNPs\n') \
        + scale_x_continuous('\nDistance from center of UCE locus') \
        + theme_bw() \
        + opts(legend.position = "none")
  • running average (window 25) of SNP heterozygosity across all UCEs:

    p = ggplot(subset(np, datatype == 'running_hetero'), aes(x=pos,y=avg,color=avg)) \
        + geom_point(size=3, alpha = 5/10)
        + scale_colour_gradient(low = "red", high = "blue") \
        + scale_y_continuous('Mean number of 1000 Genomes Project validated SNPs\n') \ 
        + scale_x_continuous('\nDistance from center of UCE locus') \
        + theme_bw() \ 
        + opts(legend.position = "none")
  • running average (window 25) of SNP heterozygosity across all UCEs:

    p = ggplot(subset(np, datatype == 'mean_hetero'), aes(x=pos,y=avg,color=avg)) \
        + geom_point(size=3, alpha = 5/10) \
        + scale_colour_gradient(low = "red", high = "blue") \
        + scale_y_continuous('Mean Minor Allele Frequency (MAF) of 1000 Genomes Project validated SNPs\n') \
        + scale_x_continuous('\nDistance from center of UCE locus')   theme_bw()   opts(legend.position = "none")
    
    pdf('uce-running-average-1000G-snps-trunc-at-225.pdf')
  • remove some gridlines:

    p
    g = ggplotGrob(p)
    grid.ls(g)
    grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.major.y.polyline"),grep=T)
    grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.minor.y.polyline"),grep=T)
    dev.off()

UCE Capture Data Figures

  • summarize data:

    python get_hits_misses_length_and_gc.py
  • import data:

    d = read.table('gc-length-species-matches.csv', sep = ',', header = TRUE)
  • order True/False correctly:

    d$detected =  factor(d$detected, c("TRUE", "FALSE"))
  • uce_gc_content_v_detection_facet_by_taxon.pdf:

    ggplot(d, aes(x = detected, y = gc)) +
    geom_jitter(aes(colour = detected), alpha = 0.4) +
    geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
    scale_colour_manual(values = c("#377EB8","#E41A1C")) +
    scale_x_discrete('UCE Locus Detected') + scale_y_continuous('UCE Locus GC Content') +
    opts(legend.position = "none")
  • uce_tm_v_detection_facet_by_taxon.pdf:

    ggplot(d, aes(x = detected, y = tm)) +
    geom_jitter(aes(colour = detected), alpha = 0.4) +
    geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
    scale_colour_manual(values = c("#377EB8","#E41A1C")) +
    scale_x_discrete('UCE Locus Detected') + scale_y_continuous('Average (by Locus) Probe Tm') +
    opts(legend.position = "none")
  • uce_length_v_detection_facet_by_taxon.pdf:

    ggplot(d, aes(x = detected, y = length)) +
    geom_jitter(aes(colour = detected), alpha = 0.4) +
    geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
    scale_colour_manual(values = c("#377EB8","#E41A1C")) +
    scale_x_discrete('UCE Locus Detected') + scale_y_continuous('UCE Locus Length') +
    opts(legend.position = "none")
  • uce_bases_added_v_detection_facet_by_taxon.pdf:

    ggplot(d, aes(x = detected, y = added)) +
    geom_jitter(aes(colour = detected), alpha = 0.4) +
    geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
    scale_colour_manual(values = c("#377EB8","#E41A1C")) +
    scale_x_discrete('UCE Locus Detected') +
    scale_y_continuous('Bases Added to Probes Targeting UCE') +
    opts(legend.position = "none")
  • uce_gc_v_target_uce_length.pdf:

    ggplot(d, aes(y = gc, x = length)) + 
    geom_point(aes(colour = detected), alpha=0.5, size = 2) + 
    facet_wrap(~taxon) + scale_x_continuous('Target UCE Length') + 
    scale_y_continuous('Target UCE GC Content') + 
    scale_colour_manual(values = c("#377EB8","#E41A1C"))
  • uce_average_probe_tm_v_target_length.pdf:

    ggplot(d, aes(y = tm, x = length)) + 
    geom_point(aes(colour = detected), alpha=0.5, size = 2) + 
    facet_wrap(~taxon) + scale_x_continuous('Target UCE Length') + 
    scale_y_continuous('Average (by Locus) Probe Tm') + 
    scale_colour_manual(values = c("#377EB8","#E41A1C"))
  • uce_target_length_v_probes_targeting.pdf:

    ggplot(d, aes(x = count, y = length)) +
    geom_jitter(aes(colour = detected), alpha = 0.4) +
    facet_wrap(~taxon) +
    scale_colour_manual(values = c("#377EB8","#E41A1C")) +
    scale_x_continuous('Number of Probes Targeting UCE') +
    scale_y_continuous('Target UCE Length')
  • import contig data:

    d2 = read.table('../archive/birds-contig-lengths-by-probe-filtered-birds.csv', header = TRUE, sep = ',')
  • uce_contig_length_by_probe_count.pdf:

    ggplot(d2, aes(x = factor(probes), y = contiglength)) +
    geom_jitter(aes(colour = factor(probes))) +
    geom_boxplot(alpha = 0.3) +
    scale_colour_brewer(palette="Set1") +
    opts(legend.position = "none") +
    scale_x_discrete('Number of Probes Targeting UCE') +
    scale_y_continuous('Contig Length of UCE-associated Locus)')
  • uce_sequencing_depth_by_probe_count.pdf:

    ggplot(d2, aes(x = factor(probes), y = coverage)) +
    geom_jitter(aes(colour = factor(probes))) +
    geom_boxplot(alpha = 0.3) +
    scale_colour_brewer(palette="Set1") +
    opts(legend.position = "none") +
    scale_x_discrete('Number of Probes Targeting UCE') +
    scale_y_continuous('Sequencing Depth of UCE-associated Locus')
  • uce_contig_length_by_probe_count_by_taxon.pdf:

    ggplot(d2, aes(x = factor(probes), y = contiglength)) +
    geom_jitter(aes(colour = factor(probes))) +
    geom_boxplot(alpha = 0.3) +
    facet_wrap(~organism) +
    scale_colour_brewer(palette="Set1") +
    opts(legend.position = "none") +
    scale_x_discrete('Number of Probes Targeting UCE') +
    scale_y_continuous('Contig Length of UCE-associated Locus')
  • uce_loss_of_loci_with_randomly_selected_groups.pdf:

    for i in {1..7}; do python assembly/get_match_counts.py \
        ../archive/birds-probe-matches.sqlite \
        ~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run1/faircloth/match-count.config \
        --extend /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/snp/probe-match-for-db/genome.matches.sqlite \
        'Birds - UCE paper' --optimize --random --sample-size $i \
        --samples 1001 --keep-counts --output sample$i-sim1000.txt; \
        echo "\n";done
    
    for i in {1..7}; do cat sample$i-sim1000.txt >> all-sample-sim1000.txt; done
    
    ggplot(a, aes(x = factor(taxa), y = probes)) +
    geom_jitter(aes(colour = factor(taxa))) +
    geom_boxplot(alpha = 0.3) +
    scale_colour_brewer(palette="Set1") +
    scale_x_discrete('Number of Randomly Selected Group Members') +
    scale_y_continuous('Number of UCE Loci Recovered From All Group Members', limits = c(1000,2300)) +
    opts(legend.position = "none")
  • uce_loci_missed_by_taxon_count.pdf:

    new = d$uce[d$detected == FALSE]
    test = table(new)
    test2 = as.data.frame(test)
    
    ggplot(test2, aes(factor(Freq), fill = factor(Freq))) +
    geom_bar(colour="black") +
    scale_fill_manual(values = c("#A6CEE3", "#1F78B4", "#B2DF8A", 
    "#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6",
    "#6A3D9A")) + scale_x_discrete('Number of Taxa') +
    scale_y_continuous('Failures to Detect Targeted UCEs')
    
    
    ggsave('uce_loci_missed_by_taxon_count.pdf')
  • uce_coverage_by_taxon.pdf:

    ggplot(data = d, aes(x = uce, y = coverage, group = organism)) +
    geom_line(aes(colour = organism)) + facet_wrap(~organism) +
    scale_x_discrete(breaks=NA) +
    scale_fill_manual(values = c("#A6CEE3", "#1F78B4", "#B2DF8A",
    "#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A")) +
    scale_x_discrete(breaks=NA, 'UCE-associated Locus (Name Not Shown)') +
    scale_y_continuous('Coverage') + opts(legend.position = "none")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment