Skip to content

Instantly share code, notes, and snippets.

@jaudall
Last active March 14, 2018 20:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jaudall/de14e367b208ccbe3b3be1465167b39b to your computer and use it in GitHub Desktop.
Save jaudall/de14e367b208ccbe3b3be1465167b39b to your computer and use it in GitHub Desktop.

How to use snpMerge.pl

./snpMerge.pl <bam categories> <snp table of intersnp output> <1 - maf> <tolerable missing rate>

bam_categories is a simple, tab-delimited text file that specifies the category of each bam file desired in a group. It should not need all of the bam files used for intersnp (untested). Example:

BPS1206.recal.bam       H
TX_1678.recal.bam       H
TX_2211.recal.bam       H
BPS1152.recal.bam       H
BPS1149.recal.bam       H
BPS1148.recal.bam       H
BPS1151.recal.bam       H
GB-0536.recal.bam       B
GB-0307.recal.bam       B
GB-0371.recal.bam       B
GB-0394.recal.bam       B
GB-0594.recal.bam       B

snp table is the tab-delimited output of interSNP.

1-maf is a number between 0 and 1 that represents the minimum allele frequency required within each category to include in the SNP index. For example, suppose TX_1678.recal.bam is different than all the other 'H' lines in a certain genomic region. Setting a 1-MAF greater than 1 - 0.142 (i.e. 1/7) would cause the region of the genome where TX_1678.recal.bam has unique alleles to be skipped in the index file. In other words, the SNP base would need to be uniform for all 7 lines if the 1-MAF parameter > 1-0.142. Setting this parameter to 0.85 would allow one mismatch among the 7 H lines - and if the 6 H lines different than the 5 B lines, then the SNP would be included in the index. The 1-MAF parameter is applied to both H and B categories for each SNP position and it does not scale with N. Such that 1-0.142 would allow one mismatch amoung the H lines, but all B lines would need to agree because one mismatch of a B line = 0.20.

The tolerable missing rate is the amount of missing data that is tolerable for any one SNP for each category. 1) each category must have enough data to make a confident SNP. If one of the categories of individuals has a higher proportion of missing than is tolerable, the locus will not be included in the output. I developed the script using a rate of 0.40. If 40% of the individuals in a category (H or B) had missing data at the SNP position, then I skipped that position.

the results are written to stout.

Example

snpMerge.pl bam_categories.txt GhWGbW.SNP.2.txt 0.9 0.4 > GhWGbW.SNP.2.index

The full example of interSNP output is too big to show here.

#Chr    Pos     Ref    BPS1206.recal.bam TX_1678.recal.bam TX_2211.recal.bam BPS1152.recal.bam BPS1149.recal.bam BPS1148.recal.bam BPS1151.recal.bam GB-0536.recal.bam GB-0307.recal.bam GB-0371.recal.bam GB-0394.recal.bam GB-0594.recal.bam
A01	2928	C	N	N	N	N	N	T	T	T	T	T	C	N
A01	2995	G	N	N	N	N	G	G	G	G	G	G	G	N
A01	3004	A	N	N	N	N	A	A	A	A	A	A	A	N
A01	3005	C	N	N	N	N	C	C	C	C	C	C	C	N
A01	4675	G	N	N	N	N	G	AG	G	AG	A	AG	AG	G
A01	4718	C	C	N	N	C	C	CT	C	CT	CT	CT	CT	C
A01	4807	T	T	N	T	T	T	T	T	T	T	T	T	T
A01	5281	C	N	N	N	N	N	C	C	C	C	C	C	C
A01	5304	A	N	N	N	N	A	AG	A	AG	AG	AG	AG	A
A01	5462	C	AC	N	N	N	C	C	C	C	C	C	C	C
A01	5470	G	CG	N	N	N	G	C	G	CG	CG	CG	CG	G
A01	5498	G	G	N	N	G	G	G	G	G	G	G	G	G
A01	5519	A	A	N	N	A	A	A	G	AG	AG	AG	A	AG
A01	5726	T	T	N	T	N	T	AT	A	AT	AT	AT	T	T
A01	5759	C	AC	C	C	N	C	A	C	AC	AC	AC	AC	C
A01	5884	C	C	C	C	N	C	C	C	C	C	C	C	N
A01	5903	T	A	T	T	N	T	A	A	A	A	A	AT	N
A01	5942	C	C	N	C	N	C	AC	A	AC	AC	AC	C	N
A01	6033	G	G	G	G	G	G	GT	T	GT	GT	GT	G	T

output of stdout

#Chr	Pos	H	B
A01	6204	G	A
A01	6957	C	A
A01	7240	T	G
A01	7723	A	C
A01	64302	A	T
A01	82161	C	G
A01	83141	T	C
A01	83166	T	G
A01	84192	G	A
#!/usr/bin/perl
#
use strict;
use warnings;
sub get_base($$$);
my $usage = "$0 <bam categories> <snp table of intersnp output> <1 - maf> <missing_rate>\n";
my $list = shift or die $usage;
my $table = shift or die $usage;
my $maf = shift or die $usage;
my $missing_rate = shift or die $usage;
my %categories = ();
open (IN, $list) or die "Can\'t open $list: $!\n";
my $cat = "";
while (<IN>)
{
chomp;
my @line = split /\t/, $_;
push @{ $categories{ $line[1] } }, $line[0];
# print "Categories\t", $line[0], "\t", $line[1], "\t", $categories{ $line[1] }[0], "\n";
# print "$cat\n" if $cat ne $line[1];
$cat = $line[1];
}
foreach my $jj (keys %categories)
{
# print $jj, "\t", $categories{ $jj}, "\n";
}
close(IN);
my $line = 1;
open (IN, $table) or die "Can\'t open $table: $!\n";
my %names = ();
SNP: while(<IN>)
{
chomp;
my @line = split /\t/, $_;
if ($line == 1)
{
#$line[0] = Chr
#$line[1] = Pos
#$line[2] = ref_base
# print the header
print $line[0], "\t";
print $line[1];
foreach my $jj (keys %categories)
{
print "\t", $jj;
}
# Load the column names to the $i-th position
foreach my $i (3 .. $#line)
{
my @filename = split /\//, $line[$i];
$names{$i} = $filename[$#filename];
}
print "\n";
}
else
{
my @bases = ();
foreach my $cat (sort keys %categories)
{
my $base = get_base($cat, $maf, \@line);
if (defined($base))
{
# Occassionally, the $base might be something like "AC" or "CT", these are from heterzygous individuals
# hets have not been filtered at this point.
# This would mean that of the lines in the category, most of them were heterozygous. It seems unlikely
# in cotton. It could be poorly mapped homoeologs too. For now, a heterzygous locus wouldn't help us
# categorize the lines much better than if we simply skip the locus. That's what I'm doing for now.
if ( length($base) > 1 )
{
$base = "N";
}
# print $base, "\t";
# this print works well, but not with the 'if' statement below
# It occurred to me that I only want to print infromative SNPs, positions with N or monomorphic are not informative
# The rest of the script should work with N categories, but this last hack will only work with two. I made it obviously so.
push @bases, $base;
}
}
if ($bases[0] eq $bases[1] || $bases[0] eq "N" || $bases[1] eq "N")
{
}
else
{
print $line[0], "\t"; #Chr
print $line[1], "\t"; #Pos
print $bases[0], "\t", $bases[1];
print "\n";
}
# print $line[0], "\t"; #Chr for DEBUG
# print $line[1], "\t"; #Pos for DEBUG
# print "----------------\n"; # for DEBUG
# print $bases[0], "\t", $bases[1];
# print "\n";
}
$line++;
}
sub get_base($$$)
{
my ($type, $freq, $line_ref) = @_;
my @snps = @{ $line_ref };
my %base_table = ();
my $total = 0;
my $max = 0;
my $base = "";
# print "CAT=$type ;;;;;";
foreach my $i (3 .. $#snps)
{
# print $i, "\t", $names{ $i }, "\t";
# print "==$i==", "\t==", $snps[$i], "==\t";
# lookup type in the categories hash
# my $lookup_value = "";
foreach my $index (0 .. $#{ $categories{ $type } })
{
# print $index, "\t", $categories{ $type }[$index], "\t", $names{ $i }, "+<<_+_>>>\t";
if ( $categories{ $type }[$index] eq $names{ $i } )
{
# $lookup_value = $type;
# print $snps[$i], "<<SNPS>>\t";
# let's assume the 'missing' bases are all N
$base_table{ $snps[$i] }++ if $snps[$i] ne "N";
$total++;
}
}
}
# print "i<<<<<\n";
#
my $non_Ns = 0;
foreach my $nuc (keys %base_table)
{
# print "type = $type\t", $nuc, "\t", $base_table{ $nuc }, "\n"; # for DEBUG
$max = $base_table{ $nuc } if $max < $base_table{ $nuc };
$non_Ns = $non_Ns + $base_table{ $nuc } if $nuc ne "N";
}
# if the ratio of total/N is more than 50% - i.e. each category is missing 40% of its bases (>60% are N's), then return an N for that category
# print "($total - $max)/$total = RATIO", ($total - $max)/$total, "\n";
# print "total = $total, nonNs = $non_Ns, total-nonNs/total = ", ($total - $non_Ns)/$total, "< 0.4\n";
if ( ($total - $non_Ns)/$total < $missing_rate)
{
# print "total = $total, nonNs = $non_Ns, total-nonNs/total = ", ($total - $non_Ns)/$total, "< 0.9<<<<<<<<<\n";
# print "$max MAX, $total\n";
my $poly = 1;
my $other_bases_count = 0;
foreach my $nuc (keys %base_table)
{
if ($base_table{ $nuc } == $max)
{
$base = $nuc;
# print "$nuc because $max is MAX\n";
# occasionally, two bases might have the same number of individuals.
# If that happens, it shouldn't matter because I'm assuming that the desired $freq will almost always be greater than 0.5
# (i.e. 0.5 is max value for two bases be be equal value)
}
else
{
$other_bases_count = $other_bases_count + $base_table{ $nuc };
}
$poly++;
}
# print "BASE: $base\n";
if ($poly == 1)
{
#this position is NOT polymorphic
# print $base, "\t";
return $base;
}
else
{
# print "$max - $other_bases_count/$max > $freq=", (($max - $other_bases_count)/$max), "\n";
if (($max - $other_bases_count)/$max > $freq)
{
# print $base, "\t";
return $base;
}
else
{
# print $base, "\t";
return "N";
}
# print $base, "\t";
}
}
else
{
# print $base, "\t";
return "N";
}
}
@cfljam
Copy link

cfljam commented Mar 14, 2018

Hi Joshua could you please add a bam_categories.txt to this example?

@cfljam
Copy link

cfljam commented Mar 14, 2018

Working example at https://gist.github.com/cfljam/a0d4966a0a8d28b6ad17b5b733330fe3

Turned out the script doesnt like bam paths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment