brantfaircloth/1-uces-in-many-critters.rst Secret

## 1-uces-in-many-critters.rst

      
    Raw
  

              1-uces-in-many-critters.rst
            
          
    Identifying UCEs in other organisms


get count of all UCEs identified that were non-dupes:
sqlite> select count(*) from cons, blast where blast.id = cons.id and blast.matches = 1 and duplicate = 0;
count(*)
----------
5599


get distance btw. UCE loci in set above w/:
python other-organisms/get_distances_in_chicken.py


create a new table in the probe dbase for the ids and count of all
"conservation" sureselect probes:
sqlite> create table sureselect_all_probe_counts as select seq, id, count(seq) as cnt,
sqlite> data_source from sureselect where data_source = 'conservation' group by seq order by id;

sqlite> select count(*) from sureselect_all_probe_source;
count(*)
----------
5138


get the probe sequences from the dbase for all the conserved probes:
python assembly/get_sureselect_probes_or_cons.py --output all-probes.fasta --all --probes probe.sqlite


move that into MS folder in Dropbox:
mv all-probes.fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers


[lastz] this file against several genomes (used --huge option on anoCar2
because of chromo/scaffold number).  The output files are spread btw.
snp/lastz and matches/lastz directories:
* mm9
* hg19
* anoCar2
* taeGut1
* galGal3
* melGal2
* croPor1 (really 0.1)
* monDom5
* ornAna1
* xenTro2

python share/run_lastz.py \
    --target /Volumes/Data/Genomes/taeGut1/taeGut1.2bit \
    --query all-probes.fasta --nprocs 8 \
    --output lastz/all-probes-to-taeGut1.lastz


check probes for dupes:
python share/easy_lastz.py \
    --target all-probes.fasta --query all-probes.fasta \
    --identity 85 --output all-to-all.lastz


generate probe counts for critters in matches/lastz:
python other-organisms/get_probes_matches_from_lastz.py ./ \
    all-probes.fasta test.sqlite \
    --dupefile all-to-all.lastz


data will reside in data/.  split files into appropriate directories - some
into snp/ and some into matches/


## 2-uces-in-primates.rst

      
    Raw
  

              2-uces-in-primates.rst
            
          
    Primate tree test


update genomes by dloading new build, if avaialble

run lastz of probes against genomes:
python align/un_multiple_lastzs.py \
    --output=/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates \
    --probefile=all-probes.fasta \
    --chromolist="hg19","gorGor3","ponAbe2","rheMac2","mm9" \
    --readlist="nomLeu1","panTro3","calJac3","papAnu1","otoGar3"


rename all the files to similar names used previously.  In ipython:
[shutil.copy(i,"all-probes-to-{0}".format(os.path.basename(i).split('_')[-1])) for i in glob.glob('*.lastz')]


first move SNP-related and initial db-related stuff to its own SNP folder.
Put these data in "primates" folder.

Then (ran w/ 200 AND 150), build "fake" data sets from the extant genomes
that are just like the data we collect from UCE loci:
for i in {"hg19","gorGor3","ponAbe2","rheMac2","mm9","nomLeu1","calJac3","otoGar3"};
    do \
    python share/get_fake_velvet_contigs_from_genomes.py  \
    /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/lastz/all-probes-to-$i.lastz \
    /Volumes/Data/Genomes/$i/$i.2bit \
    --dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
    --fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/fake-velvet-fasta/$i-150.fasta --flank 150;
    done;


I originally ran the above w/ panTro3 and papAnu1, but it failed for panTro3
(some chromo names have 3 parts) and papAnu1 (chromo names have many parts)
because they have crazy-ass name structures.  Updated code and ran, instead,
w/:
python share/get_fake_velvet_contigs_from_genomes.py  \
        /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/lastz/all-probes-to-panTro3.lastz \
        /Volumes/Data/Genomes/panTro3/panTro3.2bit \
        --dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
        --fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/fake-velvet-fasta/panTro3-150.fasta \
        --flank 150 --name-components 3

python share/get_fake_velvet_contigs_from_genomes.py  \
    /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/lastz/all-probes-to-papHam1.lastz \
    /Volumes/Data/Genomes/papHam1/papHam1.2bit \
    --dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
    --fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/fake-velvet-fasta/papHam1-150.fasta \
    --flank 150 --splitchar None


make symlinks to fastas in contigs for each of
{"hg19","gorGor3","ponAbe2","rheMac2","mm9","nomLeu1","panTro3","calJac3","papAnu1","otoGar3"}:
ln -s ../fake-velvet-fasta/mm9-200.fasta mus-musculus.fasta


re-lastz probes against "contigs" and store group matches in db:
python assembly/match_contigs_to_probes.py \
    ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/contigs \
    ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-probes.fasta \
    ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/probe-match-for-db \
    --dupefile ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz


run get_match_counts on reptile data to build format file:
python assembly/get_match_counts.py \
    ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/probe-match-for-db/primate-probes-matches.sqlite \
    ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/config/primate-match-count.config \
    'Primates' \
    --output ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/config/primate-loci-for-fasta.out


add asterisks to primate-loci-for-fasta.out for each organism

turn those data into a giant fasta for alignment:
python assembly/get_fastas_from_match_counts.py \
    ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/contigs \
    ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/probe-match-for-db/primate-probes-matches.sqlite \
    ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/config/primate-loci-for-fasta.out \
    --output ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/alignment/primates.fasta


align those puppies:
python align/eqcap_align.py \
    --input primates.fasta --output nexus \
    --species 10 --trim-running --multiprocessing --processors 7


within
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/alignment/
convert these to phylip so we can run mrAIC on them:
python align/convert_nexus_to_phylip.py \
    --input nexus/ --output phylip/ --shorten-name --positions '-2,-1'


estimate subs models:
python align/run_mraic.py --input phylip --output aicc


remove the locus name from each nexus file:
python align/remove_locus_name_from_nexus_lines.py \
    --input nexus --output nexus-rename


prep the mrbayes file:
python align/nexus_to_concat.py \
    --models ../output_models.txt \
    --aligns ../nexus-rename/ \
    --concat primate-each-part.nex \
    --mr-bayes --interleave --unlink


crank up some ec2.  setup mrbayes per:  https://gist.github.com/950010, then:
ec2-run-instances --key ec2-keypair ami-349d645d \
    --instance-type=c1.xlarge --availability-zone=us-east-1b \
    --block-device-mapping '/dev/sda2=ephemeral0' \
    --block-device-mapping '/dev/sda3=ephemeral1'


sync up nexus file:
rsync -avz -e "ssh -i /Users/bcf/.ec2/ec2-keypair" \
    ./ ec2-user@ec2-67-202-26-236.compute-1.amazonaws.com:/mnt/data/primate-rate-variation/


after starting in mpi-mode, set ngen = 5,000,000 and run w/ mcmc.  run. wait.

connect to instance:
ssh -i ec2-keypair ec2-user@xx-yy-zz.compute-1.amazonaws.com


get summary data across nexus (trimmed alignments):
python align/get_cons_region_align_stats.py \
    --input nexus-rename --output primate-align-stats.csv


open up R, then:
# this gives U-shaped figure
primates = read.table('/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/stats/primate-trimmed-align-stats.csv',header=TRUE, sep = ',')
nobase = primates[primates$greaterthanonediff >0.0,]
nooutlier = nobase[nobase$greaterthanonediff <= 0.10,]
ggplot(nooutlier, aes(x=bp,y=greaterthanonediff,color=greaterthanonediff))   geom_point(size=3)   scale_colour_gradientn(colour = rainbow(5))   scale_y_continuous('Frequency of variant bases\n')   scale_x_continuous('\nDistance from center of alignment')   theme_bw()   opts(legend.position = "none")
pdf('name.pdf')
# remove some gridlines
p
g = ggplotGrob(p)
grid.ls(g)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.major.y.polyline"),grep=T)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.minor.y.polyline"),grep=T)
dev.off()


## 3-uces-in-deep-bird.rst

      
    Raw
  

              3-uces-in-deep-bird.rst
            
          
Bird deep tree from in vitro data


unzip data from sequencer

clean adapter contamination and trim using sliding window:
python assembly/process_reads.py \
    --sample-map Illumina_Data/SamplesDirectories.csv Bin_012


several tags were parsed incorrectly (not at all).  go ahead and re-parse the
unknown file to see if we can pull out necessary reads:
python assembly/reparse_tags.py --input s_5_sequence.txt --tag 'TCTGCC' --output anser-erythropus.fastq
python assembly/reparse_tags.py --input s_6_sequence.txt --tag 'GCATCC' --output zanclostomus-javanicus.fastq
python assembly/reparse_tags.py --input s_6_sequence.txt --tag 'GTAATC' --output nyctibius-grandis.fastq


put these in Bin_unk and run through trimming/assembly as above.  Then get
coverage and read lengths, etc. as above.

upload data on individual basis to ec2 (need a machine w/ shitloads of RAM) - ZONE us-east-1:
ec2-run-instances --key ec2-keypair ami-349d645d --region us-east-1\
    --instance-type=m2.2xlarge \
    --block-device-mapping '/dev/sda2=ephemeral0' \
    --block-device-mapping '/dev/sda3=ephemeral1'


built velvet (max_kmer = 96) and assoc. code (bioperl, etc.), then assemble
using VelvetOptimiser.pl (see stats folder for each Bin for actual run code):
~/src/velvet_1.1.04/contrib/VelvetOptimiser-2.1.7/VelvetOptimiser.pl \
    -s 59 -e 79 -f '-short -fastq genus-species.fastq' \
    -o '-amos_file yes' && tar cvjf ../done/genus-species.tar.bz2 auto_*/


assembled reads and zipped up assemblies and contigs as above, downloaded those.
generated checksums and checked those and uploaded to webserver

for several species, had to tweak contig generation.  See readme.md in
retry folder

determine size of contigs (unzipped):
for i in contigs/*; do echo $i; faSize $i; done


determine kmer coverage of contig, given knowledge of kmer and contig output:
python assembly/coverage_from_kmers.py --csv True \
    --input Bin_005/contigs/mus-musculus.assembly-retry.contigs.fasta


align probes that we used to themselves to make a dupefile:

python ~alignment/easy_lastz.py 
--target=probe-subset-2560-synthesized.fasta --query=probe-subset-2560-synthesized.fasta --coverage=80 --identity=80 --output=probe-subset-2560-to-self.fasta


match the contigs that we have to the probes that we used using lastz and
store the matching results in a database.  also write output (lastz files) to
a folder:
python assembly/match_contigs_to_probes.py \
    /Volumes/Data3/lsu/birds/contigs \
    probe-subset-2560-synthesized.fasta \
    /Volumes/Data3/lsu/birds/lastz \
    --dupefile probe-subset-2560-to-self.fasta

anser_erythropus: 1596 (69.63%) uniques of 2292 contigs, 0 dupe probes, 0 dupe probe matches, 45 dupe node matches
dromaius_novaehollandiae: 1518 (76.40%) uniques of 1987 contigs, 0 dupe probes, 2 dupe probe matches, 24 dupe node matches
eudromia_elegans: 1484 (77.29%) uniques of 1920 contigs, 0 dupe probes, 1 dupe probe matches, 52 dupe node matches
gallus_gallus: 1438 (77.65%) uniques of 1852 contigs, 0 dupe probes, 1 dupe probe matches, 30 dupe node matches
megalaima_virens: 1174 (85.69%) uniques of 1370 contigs, 0 dupe probes, 0 dupe probe matches, 10 dupe node matches
phalacrocorax_carbo: 1384 (86.45%) uniques of 1601 contigs, 0 dupe probes, 3 dupe probe matches, 10 dupe node matches
pitta_guajana: 1572 (66.36%) uniques of 2369 contigs, 0 dupe probes, 2 dupe probe matches, 32 dupe node matches
struthio_camelus: 1591 (75.19%) uniques of 2116 contigs, 0 dupe probes, 2 dupe probe matches, 37 dupe node matches
urocolius_indicus: 1495 (67.71%) uniques of 2208 contigs, 0 dupe probes, 2 dupe probe matches, 43 dupe node matches


screen various groups and/or optimize individuals in groups (in terms of UCEs
located) to max the number of UCEs shared across members of the group:
python assembly/get_match_counts.py \
    /Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
    ~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run/faircloth/match-count.config \
    'Birds - UCE paper'


to extend capture data with genomic/outgroup data and build probe data set
(genomic outgroup data are indicated in the config file using asterisks):
[Birds - UCE paper]
dromaius_novaehollandiae
struthio_camelus
eudromia_elegans
anser_erythropus
gallus_gallus
phalacrocorax_carbo
megalaima_virens
urocolius_indicus
pitta_guajana
anolis_carolinensis*

python assembly/get_match_counts.py \
    /Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
    ~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run/faircloth/match-count.config \
    --extend /Users/bcf/Dropbox/research/faircloth/UCEs-as-genetic-markers/data/snp/probe-match-for-db/genome.matches.sqlite \
    'Birds - UCE paper' \
    --output /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.out


turn those data into a giant fasta for alignment:
python assembly/get_fastas_from_match_counts.py \
    /Volumes/Data3/lsu-seqcap/contig/ \
    /Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
    /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.out \
    --extend-db /Users/bcf/Dropbox/research/faircloth/UCEs-as-genetic-markers/data/snp/probe-match-for-db/genome.matches.sqlite \
    --extend-dir /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/snp/contig/ \
    --output /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.fasta


align:
python align/seqcap_align.py \
    --input /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.fasta \
    --output /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/nexus/ \
    --species 10 --trim-running --multiprocessing --processors=7


convert these to phylip so we can run mrAIC on them:
python align/convert_nexus_to_phylip.py \
    --input nexus/ --output phylip/ --shorten-name --positions '-2,-1'


get subs models:
python align/run_mraic.py --input phylip --output aicc


remove the locus name from each nexus file:
python align/remove_locus_name_from_nexus_lines.py \
    --input nexus --output nexus-rename


prep the mrbayes file:
python align/nexus_to_concat.py \
    --models output_models.txt --aligns nexus-rename/ \
    --concat mrbayes/deep-bird-each-part.nex --mr-bayes \
    --interleave --unlink


crank up some AWS:
ec2-run-instances --key keypair ami-349d645d \
    --instance-type=c1.xlarge --availability-zone=us-east-1b \
    --block-device-mapping '/dev/sda2=ephemeral0' \
    --block-device-mapping '/dev/sda3=ephemeral1'


login and setup mrbayes:
ssh -i ec2-keypair ec2-user@ec2-174-129-57-11.compute-1.amazonaws.com


use rsync to up/dload data:
rsync -avz -e "ssh -i /Users/bcf/.ec2/ec2-keypair" \
    ec2-user@ec2-174-129-57-11.compute-1.amazonaws.com:/mnt/data/birds/ ./


get variant stats for birds:
python align/get_cons_region_align_stats.py --input nexus-rename --output bird-align-stats.csv


make figures:
# this gives U-shaped figure
birds = read.table('/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/stats/bird-trimmed-align-stats.csv', header = TRUE, sep = ',')
birdnobase = birds[birds$greaterthanonediff >0.0,]
birdnooutlier = birdnobase[birdnobase$greaterthanonediff <= 0.10,]
ggplot(birdnooutlier, aes(x=bp,y=greaterthanonediff,color=greaterthanonediff))   geom_point(size=3, alpha = 5/10)   scale_colour_gradient(low = "red", high = "blue")   scale_y_continuous('Frequency of variable positions\n')   scale_x_continuous('\nDistance from center of alignment')   theme_bw()   opts(legend.position = "none")
pdf('bird-variable-positions.pdf')
# remove some gridlines
p
g = ggplotGrob(p)
grid.ls(g)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.major.y.polyline"),grep=T)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.minor.y.polyline"),grep=T)
dev.off()


Enrichment stats (for birds)


extract AMOS file from velvet tarball and store in tmp:
tar -jtvf /Volumes/Data3/lsu-seqcap/Bin_008/assembled/urocolius-indicus.assembly.tar.bz2
tar -xjvf /Volumes/Data3/lsu-seqcap/Bin_008/assembled/urocolius-indicus.assembly.tar.bz2 auto_data_73/velvet_asm.afg


convert AMOS afg file to AMOS bnk file:
~/Source/amos-3.0.0/src/Bank/bank-transact -m megalaima-virens.afg -b megalaima-virens.bnk -c


output coverage file, so we can get iids:
~/Source/amos-3.0.0/src/Utils/cvgStat -b megalaima-virens.bnk > megalaima-virens.cvg


get loci from query of sqlite data:
sqlite> .output megalaima-virens.loci
sqlite> select megalaima_virens from match_map where megalaima_virens IS NOT NULL;
sqlite> .output stdout


parse that bad-boy and the cvg file to get iids:
mv /Volumes/Data3/lsu-seqcap/lastz/megalaima-virens.loci ~/Tmp/bird-coverage/megalaima-virens

python assembly/get_iids_from_cvg_stat.py
     \ megalaima-virens.cvg megalaima-virens.loci --output megalaima-virens.iid


now get depth of coverage of locus-specific reads:
~/Source/amos-3.0.0/src/Validation/analyze-read-depth megalaima-virens.bnk -d -I megalaima-virens.iid


get depth of coverage of all reads in contigs:
~/Source/amos-3.0.0/src/Validation/analyze-read-depth megalaima-virens.bnk -d


in order to derive the average contig length from loci matching UCEs (there
is another script for only UCE loci present in all orgs - it parses the fasta
we input to alignment rather than the assembled contigs from velvet):
python assembly/get_contig_lengths_for_all_uce_loci.py \
    /Volumes/Data3/lsu-seqcap/contig \
    /Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
    ~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run/faircloth/match-count.config \
    "Birds - UCE paper"


## 4-additional-snp-analyses.rst

      
    Raw
  

              4-additional-snp-analyses.rst
            
          
    Additional SNP Analyses


screen lastz files for dupes, starts and ends, and generate velvet-like
contigs, so we can insert these data into the same pipeline we have for the
seqcap data:
python share/get_fake_velvet_contigs_from_genomes.py \
    all-probes-to-hg19.lastz \
    /Volumes/Data/Genomes/hg19/hg19.2bit \
    fake-velvet-fasta/taeGut1-all-probe-fake-velvet.fasta
    --flank 50


Also get BED files for flanks w/ flank dist @ 50, 100, 250:
python share/get_fake_velvet_contigs_from_genomes.py \
    all-probes-to-hg19.lastz /Volumes/Data/Genomes/hg19/hg19.2bit \
    --bed fake-velvet-bed/fake-velvet-all-probe-hg19-50.bed \
    --flank 50


Or, do both at once:
python share/get_fake_velvet_contigs_from_genomes.py  \
    data/lastz/all-probes-to-hg19.lastz \
    /Volumes/Data/Genomes/hg19/hg19.2bit \
    --fasta data/fake-velvet-fasta/fake-velvet-all-probe-hg19-500.fasta \
    --bed data/fake-velvet-bed/fake-velvet-all-probe-hg19-500.bed \
    --flank 500


using UCSC browser, intersect those with dbSNP128 results (for mm9 and hg19)
and bgiSNP results (for galGal3):
results => snp/


associate snps w/ the correct UCEs (using coordinate overlaps, since UCSC
doesn't return this):
python snps/find_snps_in_bed_interval.py \
    data/fake-velvet-bed/fake-velvet-all-probe-hg19-500.bed \
    data/snp/dbSNP132-intersect-all-probe-hg19-500.bed \
    --output data/snp/dbSNP132-to-hg19-uce-500.csv


upload SNP RSs to dbSNP to pull down heterozygosity data

from hg19 snps, get heterozygosity data:
python snps/parse_dbsnp_xml.py \
    hg19-dbSNP-data > variation-data-hg19-dbSNP.csv


do same in mouse:
python snps/parse_dbsnp_xml.py \
    mm9-dbSNP-data > variation-data-mm9-dbSNP.csv


## 5-additional-figure-data-manipulations.rst

      
    Raw
  

              5-additional-figure-data-manipulations.rst
            
          
    Plotting additional figures


R code (didn't use these boxplots):
> data = read.table('contig-lengths-by-probe.csv', header = TRUE, sep = ',')
> ggplot(data, aes(factor(probes), contiglength))   geom_boxplot()
> ggplot(data, aes(factor(probes), coverage))   geom_boxplot()
> p1 = ggplot(data, aes(factor(probes), coverage))   geom_boxplot()
> ggsave('coverage-by-probes.pdf')
> p2 = ggplot(data, aes(factor(probes), contiglength))   geom_boxplot()
> ggsave('contiglength-by-probes.pdf')
> f1 = ggplot(data, aes(factor(probes), contiglength))   geom_boxplot()   facet_wrap(~ organism, ncol = 6)
> ggsave('contiglength-by-probes-facet.pdf')
> f2 = ggplot(data, aes(factor(probes), coverage))   geom_boxplot()   facet_wrap(~ organism, ncol = 6)
> ggsave('coverage-by-probes-facet.pdf')


using intersection BED data, prep CSV file associating SNPs w/ UCE names and
coords associate variation data with figure with:
python snps/get_dbsnp_variability_for_all_uces.py \
    dbSNP132-to-hg19-uce-200.csv \
    ~/Tmp/dbsnp/hg19/hg19-dbSNP-data > dbSNP132-to-hg19-uce-200-ready-to-plot.csv


get the SNP variability data for SNPS within 200 bp of UCEs:
python snps/get_dbsnp_freq_stats.py dbSNP132-to-hg19-uce-200.csv \
    ~/Tmp/dbsnp/hg19/hg19-dbSNP-data \
    --dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
    --output test1.csv --output2 test2.csv


get that into R and:
raw = read.table('test2.csv', sep = ',', header = TRUE)


convert freqs to floats:
raw$freq = as.numeric(as.character(raw$freq))
# add pos column
raw$pos = raw$snp.start - (raw$uce.start   raw$uce.end)/2


plot out just the allele freqs for the loci:
raw1 = subset(raw,freq > 0.0)
raw1 = subset(raw1, X1000gvalidated == 'true') #4653 loci
p = ggplot(raw1, aes(x=pos,y=freq,color=freq)) + geom_point(size=3, alpha = 5/10)
    + scale_colour_gradient(low = "red", high = "blue") \
    + scale_y_continuous('Minor allele frequency (MAF)\n') \
    + scale_x_continuous('\nDistance from center of UCE locus') \
    + theme_bw() + opts(legend.position = "none")

np = read.table('test1.csv', sep = ',', header = TRUE)


running average (window 25) of the number of SNPs per base across all UCEs:
p = ggplot(subset(np, datatype == 'running'), aes(x=pos,y=avg,color=avg)) \
    + geom_point(size=3, alpha = 5/10) \
    + scale_colour_gradient(low = "red", high = "blue") \
    + scale_y_continuous('Average number (25 bp window) of 1000 Genomes Project validated SNPs\n') \
    + scale_x_continuous('\nDistance from center of UCE locus') \
    + theme_bw() \
    + opts(legend.position = "none")


running average (window 25) of SNP heterozygosity across all UCEs:
p = ggplot(subset(np, datatype == 'running_hetero'), aes(x=pos,y=avg,color=avg)) \
    + geom_point(size=3, alpha = 5/10)
    + scale_colour_gradient(low = "red", high = "blue") \
    + scale_y_continuous('Mean number of 1000 Genomes Project validated SNPs\n') \
    + scale_x_continuous('\nDistance from center of UCE locus') \
    + theme_bw() \
    + opts(legend.position = "none")


running average (window 25) of SNP heterozygosity across all UCEs:
p = ggplot(subset(np, datatype == 'mean_hetero'), aes(x=pos,y=avg,color=avg)) \
    + geom_point(size=3, alpha = 5/10) \
    + scale_colour_gradient(low = "red", high = "blue") \
    + scale_y_continuous('Mean Minor Allele Frequency (MAF) of 1000 Genomes Project validated SNPs\n') \
    + scale_x_continuous('\nDistance from center of UCE locus')   theme_bw()   opts(legend.position = "none")

pdf('uce-running-average-1000G-snps-trunc-at-225.pdf')


remove some gridlines:
p
g = ggplotGrob(p)
grid.ls(g)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.major.y.polyline"),grep=T)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.minor.y.polyline"),grep=T)
dev.off()


UCE Capture Data Figures


summarize data:
python get_hits_misses_length_and_gc.py


import data:
d = read.table('gc-length-species-matches.csv', sep = ',', header = TRUE)


order True/False correctly:
d$detected =  factor(d$detected, c("TRUE", "FALSE"))


uce_gc_content_v_detection_facet_by_taxon.pdf:
ggplot(d, aes(x = detected, y = gc)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_discrete('UCE Locus Detected') + scale_y_continuous('UCE Locus GC Content') +
opts(legend.position = "none")


uce_tm_v_detection_facet_by_taxon.pdf:
ggplot(d, aes(x = detected, y = tm)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_discrete('UCE Locus Detected') + scale_y_continuous('Average (by Locus) Probe Tm') +
opts(legend.position = "none")


uce_length_v_detection_facet_by_taxon.pdf:
ggplot(d, aes(x = detected, y = length)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_discrete('UCE Locus Detected') + scale_y_continuous('UCE Locus Length') +
opts(legend.position = "none")


uce_bases_added_v_detection_facet_by_taxon.pdf:
ggplot(d, aes(x = detected, y = added)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_discrete('UCE Locus Detected') +
scale_y_continuous('Bases Added to Probes Targeting UCE') +
opts(legend.position = "none")


uce_gc_v_target_uce_length.pdf:
ggplot(d, aes(y = gc, x = length)) +
geom_point(aes(colour = detected), alpha=0.5, size = 2) +
facet_wrap(~taxon) + scale_x_continuous('Target UCE Length') +
scale_y_continuous('Target UCE GC Content') +
scale_colour_manual(values = c("#377EB8","#E41A1C"))


uce_average_probe_tm_v_target_length.pdf:
ggplot(d, aes(y = tm, x = length)) +
geom_point(aes(colour = detected), alpha=0.5, size = 2) +
facet_wrap(~taxon) + scale_x_continuous('Target UCE Length') +
scale_y_continuous('Average (by Locus) Probe Tm') +
scale_colour_manual(values = c("#377EB8","#E41A1C"))


uce_target_length_v_probes_targeting.pdf:
ggplot(d, aes(x = count, y = length)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_continuous('Number of Probes Targeting UCE') +
scale_y_continuous('Target UCE Length')


import contig data:
d2 = read.table('../archive/birds-contig-lengths-by-probe-filtered-birds.csv', header = TRUE, sep = ',')


uce_contig_length_by_probe_count.pdf:
ggplot(d2, aes(x = factor(probes), y = contiglength)) +
geom_jitter(aes(colour = factor(probes))) +
geom_boxplot(alpha = 0.3) +
scale_colour_brewer(palette="Set1") +
opts(legend.position = "none") +
scale_x_discrete('Number of Probes Targeting UCE') +
scale_y_continuous('Contig Length of UCE-associated Locus)')


uce_sequencing_depth_by_probe_count.pdf:
ggplot(d2, aes(x = factor(probes), y = coverage)) +
geom_jitter(aes(colour = factor(probes))) +
geom_boxplot(alpha = 0.3) +
scale_colour_brewer(palette="Set1") +
opts(legend.position = "none") +
scale_x_discrete('Number of Probes Targeting UCE') +
scale_y_continuous('Sequencing Depth of UCE-associated Locus')


uce_contig_length_by_probe_count_by_taxon.pdf:
ggplot(d2, aes(x = factor(probes), y = contiglength)) +
geom_jitter(aes(colour = factor(probes))) +
geom_boxplot(alpha = 0.3) +
facet_wrap(~organism) +
scale_colour_brewer(palette="Set1") +
opts(legend.position = "none") +
scale_x_discrete('Number of Probes Targeting UCE') +
scale_y_continuous('Contig Length of UCE-associated Locus')


uce_loss_of_loci_with_randomly_selected_groups.pdf:
for i in {1..7}; do python assembly/get_match_counts.py \
    ../archive/birds-probe-matches.sqlite \
    ~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run1/faircloth/match-count.config \
    --extend /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/snp/probe-match-for-db/genome.matches.sqlite \
    'Birds - UCE paper' --optimize --random --sample-size $i \
    --samples 1001 --keep-counts --output sample$i-sim1000.txt; \
    echo "\n";done

for i in {1..7}; do cat sample$i-sim1000.txt >> all-sample-sim1000.txt; done

ggplot(a, aes(x = factor(taxa), y = probes)) +
geom_jitter(aes(colour = factor(taxa))) +
geom_boxplot(alpha = 0.3) +
scale_colour_brewer(palette="Set1") +
scale_x_discrete('Number of Randomly Selected Group Members') +
scale_y_continuous('Number of UCE Loci Recovered From All Group Members', limits = c(1000,2300)) +
opts(legend.position = "none")


uce_loci_missed_by_taxon_count.pdf:
new = d$uce[d$detected == FALSE]
test = table(new)
test2 = as.data.frame(test)

ggplot(test2, aes(factor(Freq), fill = factor(Freq))) +
geom_bar(colour="black") +
scale_fill_manual(values = c("#A6CEE3", "#1F78B4", "#B2DF8A",
"#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6",
"#6A3D9A")) + scale_x_discrete('Number of Taxa') +
scale_y_continuous('Failures to Detect Targeted UCEs')


ggsave('uce_loci_missed_by_taxon_count.pdf')


uce_coverage_by_taxon.pdf:
ggplot(data = d, aes(x = uce, y = coverage, group = organism)) +
geom_line(aes(colour = organism)) + facet_wrap(~organism) +
scale_x_discrete(breaks=NA) +
scale_fill_manual(values = c("#A6CEE3", "#1F78B4", "#B2DF8A",
"#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A")) +
scale_x_discrete(breaks=NA, 'UCE-associated Locus (Name Not Shown)') +
scale_y_continuous('Coverage') + opts(legend.position = "none")