./snpMerge.pl <bam categories> <snp table of intersnp output> <1 - maf> <tolerable missing rate>
bam_categories is a simple, tab-delimited text file that specifies the category of each bam file desired in a group. It should not need all of the bam files used for intersnp (untested). Example:
BPS1206.recal.bam H
TX_1678.recal.bam H
TX_2211.recal.bam H
BPS1152.recal.bam H
BPS1149.recal.bam H
BPS1148.recal.bam H
BPS1151.recal.bam H
GB-0536.recal.bam B
GB-0307.recal.bam B
GB-0371.recal.bam B
GB-0394.recal.bam B
GB-0594.recal.bam B
snp table is the tab-delimited output of interSNP.
1-maf is a number between 0 and 1 that represents the minimum allele frequency required within each category to include in the SNP index. For example, suppose TX_1678.recal.bam is different than all the other 'H' lines in a certain genomic region. Setting a 1-MAF greater than 1 - 0.142 (i.e. 1/7) would cause the region of the genome where TX_1678.recal.bam has unique alleles to be skipped in the index file. In other words, the SNP base would need to be uniform for all 7 lines if the 1-MAF parameter > 1-0.142. Setting this parameter to 0.85 would allow one mismatch among the 7 H lines - and if the 6 H lines different than the 5 B lines, then the SNP would be included in the index. The 1-MAF parameter is applied to both H and B categories for each SNP position and it does not scale with N. Such that 1-0.142 would allow one mismatch amoung the H lines, but all B lines would need to agree because one mismatch of a B line = 0.20.
The tolerable missing rate is the amount of missing data that is tolerable for any one SNP for each category. 1) each category must have enough data to make a confident SNP. If one of the categories of individuals has a higher proportion of missing than is tolerable, the locus will not be included in the output. I developed the script using a rate of 0.40. If 40% of the individuals in a category (H or B) had missing data at the SNP position, then I skipped that position.
the results are written to stout.
snpMerge.pl bam_categories.txt GhWGbW.SNP.2.txt 0.9 0.4 > GhWGbW.SNP.2.index
The full example of interSNP output is too big to show here.
#Chr Pos Ref BPS1206.recal.bam TX_1678.recal.bam TX_2211.recal.bam BPS1152.recal.bam BPS1149.recal.bam BPS1148.recal.bam BPS1151.recal.bam GB-0536.recal.bam GB-0307.recal.bam GB-0371.recal.bam GB-0394.recal.bam GB-0594.recal.bam
A01 2928 C N N N N N T T T T T C N
A01 2995 G N N N N G G G G G G G N
A01 3004 A N N N N A A A A A A A N
A01 3005 C N N N N C C C C C C C N
A01 4675 G N N N N G AG G AG A AG AG G
A01 4718 C C N N C C CT C CT CT CT CT C
A01 4807 T T N T T T T T T T T T T
A01 5281 C N N N N N C C C C C C C
A01 5304 A N N N N A AG A AG AG AG AG A
A01 5462 C AC N N N C C C C C C C C
A01 5470 G CG N N N G C G CG CG CG CG G
A01 5498 G G N N G G G G G G G G G
A01 5519 A A N N A A A G AG AG AG A AG
A01 5726 T T N T N T AT A AT AT AT T T
A01 5759 C AC C C N C A C AC AC AC AC C
A01 5884 C C C C N C C C C C C C N
A01 5903 T A T T N T A A A A A AT N
A01 5942 C C N C N C AC A AC AC AC C N
A01 6033 G G G G G G GT T GT GT GT G T
output of stdout
#Chr Pos H B
A01 6204 G A
A01 6957 C A
A01 7240 T G
A01 7723 A C
A01 64302 A T
A01 82161 C G
A01 83141 T C
A01 83166 T G
A01 84192 G A
Hi Joshua could you please add a bam_categories.txt to this example?