jaudall/README.md Secret

## README.md

      
    Raw
  

              README.md
            
          
    How to use snpMerge.pl

./snpMerge.pl <bam categories> <snp table of intersnp output> <1 - maf> <tolerable missing rate>

bam_categories is a simple, tab-delimited text file that specifies the category of each bam file desired in a group. It should not need all of the bam files used for intersnp (untested). Example:
BPS1206.recal.bam       H
TX_1678.recal.bam       H
TX_2211.recal.bam       H
BPS1152.recal.bam       H
BPS1149.recal.bam       H
BPS1148.recal.bam       H
BPS1151.recal.bam       H
GB-0536.recal.bam       B
GB-0307.recal.bam       B
GB-0371.recal.bam       B
GB-0394.recal.bam       B
GB-0594.recal.bam       B

snp table is the tab-delimited output of interSNP.
1-maf is a number between 0 and 1 that represents the minimum allele frequency required within each category to include in the SNP index. For example, suppose TX_1678.recal.bam is different than all the other 'H' lines in a certain genomic region. Setting a 1-MAF greater than 1 - 0.142 (i.e. 1/7) would cause the region of the genome where TX_1678.recal.bam has unique alleles to be skipped in the index file. In other words, the SNP base would need to be uniform for all 7 lines if the 1-MAF parameter > 1-0.142. Setting this parameter to 0.85 would allow one mismatch among the 7 H lines - and if the 6 H lines different than the 5 B lines, then the SNP would be included in the index. The 1-MAF parameter is applied to both H and B categories for each SNP position and it does not scale with N. Such that 1-0.142 would allow one mismatch amoung the H lines, but all B lines would need to agree because one mismatch of a B line = 0.20.
The tolerable missing rate is the amount of missing data that is tolerable for any one SNP for each category. 1) each category must have enough data to make a confident SNP. If one of the categories of individuals has a higher proportion of missing than is tolerable, the locus will not be included in the output. I developed the script using a rate of 0.40. If 40% of the individuals in a category (H or B) had missing data at the SNP position, then I skipped that position.
the results are written to stout.
Example

snpMerge.pl bam_categories.txt GhWGbW.SNP.2.txt 0.9 0.4 > GhWGbW.SNP.2.index

The full example of interSNP output is too big to show here.
#Chr    Pos     Ref    BPS1206.recal.bam TX_1678.recal.bam TX_2211.recal.bam BPS1152.recal.bam BPS1149.recal.bam BPS1148.recal.bam BPS1151.recal.bam GB-0536.recal.bam GB-0307.recal.bam GB-0371.recal.bam GB-0394.recal.bam GB-0594.recal.bam
A01	2928	C	N	N	N	N	N	T	T	T	T	T	C	N
A01	2995	G	N	N	N	N	G	G	G	G	G	G	G	N
A01	3004	A	N	N	N	N	A	A	A	A	A	A	A	N
A01	3005	C	N	N	N	N	C	C	C	C	C	C	C	N
A01	4675	G	N	N	N	N	G	AG	G	AG	A	AG	AG	G
A01	4718	C	C	N	N	C	C	CT	C	CT	CT	CT	CT	C
A01	4807	T	T	N	T	T	T	T	T	T	T	T	T	T
A01	5281	C	N	N	N	N	N	C	C	C	C	C	C	C
A01	5304	A	N	N	N	N	A	AG	A	AG	AG	AG	AG	A
A01	5462	C	AC	N	N	N	C	C	C	C	C	C	C	C
A01	5470	G	CG	N	N	N	G	C	G	CG	CG	CG	CG	G
A01	5498	G	G	N	N	G	G	G	G	G	G	G	G	G
A01	5519	A	A	N	N	A	A	A	G	AG	AG	AG	A	AG
A01	5726	T	T	N	T	N	T	AT	A	AT	AT	AT	T	T
A01	5759	C	AC	C	C	N	C	A	C	AC	AC	AC	AC	C
A01	5884	C	C	C	C	N	C	C	C	C	C	C	C	N
A01	5903	T	A	T	T	N	T	A	A	A	A	A	AT	N
A01	5942	C	C	N	C	N	C	AC	A	AC	AC	AC	C	N
A01	6033	G	G	G	G	G	G	GT	T	GT	GT	GT	G	T

output of stdout
#Chr	Pos	H	B
A01	6204	G	A
A01	6957	C	A
A01	7240	T	G
A01	7723	A	C
A01	64302	A	T
A01	82161	C	G
A01	83141	T	C
A01	83166	T	G
A01	84192	G	A


## snpMerge.pl
#!/usr/bin/perl
#

use strict;
use warnings;

sub get_base($$$);

my $usage = "$0 <bam categories> <snp table of intersnp output> <1 - maf> <missing_rate>\n";
my $list = shift or die $usage;
my $table = shift or die $usage;
my $maf = shift or die $usage;
my $missing_rate = shift or die $usage;

my %categories = ();
open (IN, $list) or die "Can\'t open $list: $!\n";
my $cat = "";
while (<IN>)
{
	chomp;
	my @line = split /\t/, $_;
	push @{ $categories{ $line[1] } }, $line[0];
	# print "Categories\t", $line[0], "\t", $line[1], "\t", $categories{ $line[1] }[0], "\n";
	# print "$cat\n" if $cat ne $line[1];
	$cat = $line[1];
}

foreach my $jj (keys %categories)
{
	# print $jj, "\t", $categories{ $jj}, "\n";
}
close(IN);


my $line = 1;
open (IN, $table) or die "Can\'t open $table: $!\n";
my %names = ();
SNP: while(<IN>)
{
	chomp;
	my @line = split /\t/, $_;
	if ($line == 1)
	{
		#$line[0] = Chr
		#$line[1] = Pos
		#$line[2] = ref_base
		# print the header
		print $line[0], "\t";
		print $line[1];
		foreach my $jj (keys %categories)
		{
			print "\t", $jj;
        	}
		# Load the column names to the $i-th position
		foreach my $i (3 .. $#line)
		{
			my @filename = split /\//, $line[$i];
			$names{$i} = $filename[$#filename];
		}
		print "\n";
	}
	else
	{
		my @bases = ();
		foreach my $cat (sort keys %categories)
		{
			my $base = get_base($cat, $maf, \@line);
			if (defined($base))
			{
				# Occassionally, the $base might be something like "AC" or "CT", these are from heterzygous individuals
				# hets have not been filtered at this point.
				# This would mean that of the lines in the category, most of them were heterozygous. It seems unlikely
				# in cotton. It could be poorly mapped homoeologs too. For now, a heterzygous locus wouldn't help us
				# categorize the lines much better than if we simply skip the locus. That's what I'm doing for now.
				if ( length($base) > 1 )
				{
					$base = "N";
				}
				# print $base, "\t";
				# this print works well, but not with the 'if' statement below
				# It occurred to me that I only want to print infromative SNPs, positions with N or monomorphic are not informative
				# The rest of the script should work with N categories, but this last hack will only work with two. I made it obviously so.
				push @bases, $base;
			}
		}

		if ($bases[0] eq $bases[1] || $bases[0] eq "N" || $bases[1] eq "N")
		{
		}
		else
		{
			print $line[0], "\t"; #Chr
			print $line[1], "\t"; #Pos
			print $bases[0], "\t", $bases[1];
			print "\n";
		}
			# print $line[0], "\t"; #Chr for DEBUG
			# print $line[1], "\t"; #Pos for DEBUG
			# print "----------------\n"; # for DEBUG

		#	print $bases[0], "\t", $bases[1];
		#	print "\n";

	}
	$line++;
}


sub get_base($$$)
{
	my ($type, $freq, $line_ref) = @_;
	my @snps = @{ $line_ref };
	my %base_table = ();
	my $total = 0;
	my $max = 0;
	my $base = "";
	# print "CAT=$type ;;;;;";
	foreach my $i (3 .. $#snps)
	{
		# print $i, "\t", $names{ $i }, "\t";
		# print "==$i==", "\t==", $snps[$i], "==\t";

		# lookup type in the categories hash
		# my $lookup_value = "";
		foreach my $index (0 .. $#{ $categories{ $type } })
		{
			# print $index, "\t", $categories{ $type }[$index], "\t", $names{ $i }, "+<<_+_>>>\t";
			if ( $categories{ $type }[$index] eq $names{ $i } )
			{
				# $lookup_value = $type;
				# print $snps[$i], "<<SNPS>>\t";
				# let's assume the 'missing' bases are all N
				$base_table{ $snps[$i] }++ if $snps[$i] ne "N";
				$total++;
			}
		}
	}
	# print "i<<<<<\n";
	#
	my $non_Ns = 0;
	foreach my $nuc (keys %base_table)
	{
		# print "type = $type\t", $nuc, "\t", $base_table{ $nuc }, "\n"; # for DEBUG
		$max = $base_table{ $nuc } if $max < $base_table{ $nuc };
		$non_Ns = $non_Ns + $base_table{ $nuc } if $nuc ne "N";
	}

	# if the ratio of total/N is more than 50% - i.e. each category is missing 40% of its bases (>60% are N's), then return an N for that category
	# print "($total - $max)/$total = RATIO", ($total - $max)/$total, "\n";

	# print "total = $total, nonNs = $non_Ns, total-nonNs/total = ", ($total - $non_Ns)/$total, "< 0.4\n";
	if ( ($total - $non_Ns)/$total < $missing_rate)
	{
		# print "total = $total, nonNs = $non_Ns, total-nonNs/total = ", ($total - $non_Ns)/$total, "< 0.9<<<<<<<<<\n";
		# print "$max MAX, $total\n";
        	my $poly = 1;
		my $other_bases_count = 0;
		foreach my $nuc (keys %base_table)
        	{
                	if ($base_table{ $nuc } == $max)
			{
				$base = $nuc;
				# print "$nuc because $max is MAX\n";
				# occasionally, two bases might have the same number of individuals.
				# If that happens, it shouldn't matter because I'm assuming that the desired $freq will almost always be greater than 0.5
				# (i.e. 0.5 is max value for two bases be be equal value)
			}
			else
			{
				$other_bases_count = $other_bases_count + $base_table{ $nuc };
			}
			$poly++;
        	}

		# print "BASE: $base\n";
		if ($poly == 1)
		{
			#this position is NOT polymorphic
			# print $base, "\t";
			return $base;
		}
		else
		{
			# print "$max - $other_bases_count/$max > $freq=", (($max - $other_bases_count)/$max),  "\n";
			if (($max - $other_bases_count)/$max > $freq)
			{
				# print $base, "\t";
				return $base;
			}
			else
			{
				# print $base, "\t";
				return "N";
			}
			# print $base, "\t";
		}
	}
	else
	{
		# print $base, "\t";
		return "N";
	}
}
	#!/usr/bin/perl
	#

	use strict;
	use warnings;

	sub get_base($$$);

	my $usage = "$0 <bam categories> <snp table of intersnp output> <1 - maf> <missing_rate>\n";
	my $list = shift or die $usage;
	my $table = shift or die $usage;
	my $maf = shift or die $usage;
	my $missing_rate = shift or die $usage;

	my %categories = ();
	open (IN, $list) or die "Can\'t open $list: $!\n";
	my $cat = "";
	while (<IN>)
	{
	chomp;
	my @line = split /\t/, $_;
	push @{ $categories{ $line[1] } }, $line[0];
	# print "Categories\t", $line[0], "\t", $line[1], "\t", $categories{ $line[1] }[0], "\n";
	# print "$cat\n" if $cat ne $line[1];
	$cat = $line[1];
	}

	foreach my $jj (keys %categories)
	{
	# print $jj, "\t", $categories{ $jj}, "\n";
	}
	close(IN);


	my $line = 1;
	open (IN, $table) or die "Can\'t open $table: $!\n";
	my %names = ();
	SNP: while(<IN>)
	{
	chomp;
	my @line = split /\t/, $_;
	if ($line == 1)
	{
	#$line[0] = Chr
	#$line[1] = Pos
	#$line[2] = ref_base
	# print the header
	print $line[0], "\t";
	print $line[1];
	foreach my $jj (keys %categories)
	{
	print "\t", $jj;
	}
	# Load the column names to the $i-th position
	foreach my $i (3 .. $#line)
	{
	my @filename = split /\//, $line[$i];
	$names{$i} = $filename[$#filename];
	}
	print "\n";
	}
	else
	{
	my @bases = ();
	foreach my $cat (sort keys %categories)
	{
	my $base = get_base($cat, $maf, \@line);
	if (defined($base))
	{
	# Occassionally, the $base might be something like "AC" or "CT", these are from heterzygous individuals
	# hets have not been filtered at this point.
	# This would mean that of the lines in the category, most of them were heterozygous. It seems unlikely
	# in cotton. It could be poorly mapped homoeologs too. For now, a heterzygous locus wouldn't help us
	# categorize the lines much better than if we simply skip the locus. That's what I'm doing for now.
	if ( length($base) > 1 )
	{
	$base = "N";
	}
	# print $base, "\t";
	# this print works well, but not with the 'if' statement below
	# It occurred to me that I only want to print infromative SNPs, positions with N or monomorphic are not informative
	# The rest of the script should work with N categories, but this last hack will only work with two. I made it obviously so.
	push @bases, $base;
	}
	}

	if ($bases[0] eq $bases[1] \|\| $bases[0] eq "N" \|\| $bases[1] eq "N")
	{
	}
	else
	{
	print $line[0], "\t"; #Chr
	print $line[1], "\t"; #Pos
	print $bases[0], "\t", $bases[1];
	print "\n";
	}
	# print $line[0], "\t"; #Chr for DEBUG
	# print $line[1], "\t"; #Pos for DEBUG
	# print "----------------\n"; # for DEBUG

	# print $bases[0], "\t", $bases[1];
	# print "\n";

	}
	$line++;
	}


	sub get_base($$$)
	{
	my ($type, $freq, $line_ref) = @_;
	my @snps = @{ $line_ref };
	my %base_table = ();
	my $total = 0;
	my $max = 0;
	my $base = "";
	# print "CAT=$type ;;;;;";
	foreach my $i (3 .. $#snps)
	{
	# print $i, "\t", $names{ $i }, "\t";
	# print "==$i==", "\t==", $snps[$i], "==\t";

	# lookup type in the categories hash
	# my $lookup_value = "";
	foreach my $index (0 .. $#{ $categories{ $type } })
	{
	# print $index, "\t", $categories{ $type }[$index], "\t", $names{ $i }, "+<<_+_>>>\t";
	if ( $categories{ $type }[$index] eq $names{ $i } )
	{
	# $lookup_value = $type;
	# print $snps[$i], "<<SNPS>>\t";
	# let's assume the 'missing' bases are all N
	$base_table{ $snps[$i] }++ if $snps[$i] ne "N";
	$total++;
	}
	}
	}
	# print "i<<<<<\n";
	#
	my $non_Ns = 0;
	foreach my $nuc (keys %base_table)
	{
	# print "type = $type\t", $nuc, "\t", $base_table{ $nuc }, "\n"; # for DEBUG
	$max = $base_table{ $nuc } if $max < $base_table{ $nuc };
	$non_Ns = $non_Ns + $base_table{ $nuc } if $nuc ne "N";
	}

	# if the ratio of total/N is more than 50% - i.e. each category is missing 40% of its bases (>60% are N's), then return an N for that category
	# print "($total - $max)/$total = RATIO", ($total - $max)/$total, "\n";

	# print "total = $total, nonNs = $non_Ns, total-nonNs/total = ", ($total - $non_Ns)/$total, "< 0.4\n";
	if ( ($total - $non_Ns)/$total < $missing_rate)
	{
	# print "total = $total, nonNs = $non_Ns, total-nonNs/total = ", ($total - $non_Ns)/$total, "< 0.9<<<<<<<<<\n";
	# print "$max MAX, $total\n";
	my $poly = 1;
	my $other_bases_count = 0;
	foreach my $nuc (keys %base_table)
	{
	if ($base_table{ $nuc } == $max)
	{
	$base = $nuc;
	# print "$nuc because $max is MAX\n";
	# occasionally, two bases might have the same number of individuals.
	# If that happens, it shouldn't matter because I'm assuming that the desired $freq will almost always be greater than 0.5
	# (i.e. 0.5 is max value for two bases be be equal value)
	}
	else
	{
	$other_bases_count = $other_bases_count + $base_table{ $nuc };
	}
	$poly++;
	}

	# print "BASE: $base\n";
	if ($poly == 1)
	{
	#this position is NOT polymorphic
	# print $base, "\t";
	return $base;
	}
	else
	{
	# print "$max - $other_bases_count/$max > $freq=", (($max - $other_bases_count)/$max), "\n";
	if (($max - $other_bases_count)/$max > $freq)
	{
	# print $base, "\t";
	return $base;
	}
	else
	{
	# print $base, "\t";
	return "N";
	}
	# print $base, "\t";
	}
	}
	else
	{
	# print $base, "\t";
	return "N";
	}
	}