Using samtools
This is my own exploration of samtools behavior and snippets of code I use in my own analyses.
The data
I am using the GRCh38 human reference from December, 2013.
There are two focal datasets of reads:
- Nanomia, NA19 V3698-D2. SRR23143273
- Homo sapiens SRR13586124
I am subsetted each of these down to 100k read pairs and 1M read pears to have some test datasets. One of my goals is to identify what settings are appropriate for detecting human contaminatin in non-human datasets. So we want to find a sweet spot where human reads map well to human genome, but reads from other species do not.
Setup
Here is what I do on my system to set up for the analyses:
module load BEDTools
module load SAMtools
module load BWA
module list
HUMAN_GENOME="/gpfs/gibbs/data/genomes/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa"
HUMAN_100K_R1=/gpfs/gibbs/pi/dunn/sequences/human/SRR13586124_R1_100K.fastq
HUMAN_100K_R2=/gpfs/gibbs/pi/dunn/sequences/human/SRR13586124_R2_100K.fastq
HUMAN_1M_R1=/gpfs/gibbs/pi/dunn/sequences/human/SRR13586124_R1_1M.fastq
HUMAN_1M_R2=/gpfs/gibbs/pi/dunn/sequences/human/SRR13586124_R2_1M.fastq
NANOMIA_100K_R1=/gpfs/gibbs/pi/dunn/sequences/NA19_152_041_S6_L002_R1_001_100k.fastq
NANOMIA_100K_R2=/gpfs/gibbs/pi/dunn/sequences/NA19_152_041_S6_L002_R2_001_100k.fastq
NANOMIA_1M_R1=/gpfs/gibbs/pi/dunn/sequences/NA19_152_041_S6_L002_R1_001_1M.fastq
NANOMIA_1M_R2=/gpfs/gibbs/pi/dunn/sequences/NA19_152_041_S6_L002_R2_001_1M.fastq
Mapping
Here is how I generate the 4 mapping files:
bwa mem -t 4 $HUMAN_GENOME $HUMAN_100K_R1 $HUMAN_100K_R2 | samtools view -b -h - > human_100k.bam
bwa mem -t 4 $HUMAN_GENOME $HUMAN_1M_R1 $HUMAN_1M_R2 | samtools view -b -h - > human_1m.bam
bwa mem -t 4 $HUMAN_GENOME $NANOMIA_100K_R1 $NANOMIA_100K_R2 | samtools view -b -h - > nanomia_100k.bam
bwa mem -t 4 $HUMAN_GENOME $NANOMIA_1M_R1 $NANOMIA_1M_R2 | samtools view -b -h - > nanomia_1m.bam
BAM files
BAM files are a binary representation of SAM files.
Flags
Each read has an associated set of binary mapping flags, packed into a 16 bit integer. They are:
Decimal | Hexadecimal | Binary | Description |
---|---|---|---|
1 | 0x0001 |
0000000000000001 |
Template having multiple segments in sequencing, eg paired |
2 | 0x0002 |
0000000000000010 |
Each segment properly aligned according to the aligner |
4 | 0x0004 |
0000000000000100 |
Segment unmapped |
8 | 0x0008 |
0000000000001000 |
Next segment in the template unmapped |
16 | 0x0010 |
0000000000010000 |
SEQ being reverse complemented |
32 | 0x0020 |
0000000000100000 |
SEQ of the next segment in the template being reversed |
64 | 0x0040 |
0000000001000000 |
The first segment in the template, ie read1 |
128 | 0x0080 |
0000000010000000 |
The last segment in the template, ie read2 |
256 | 0x0100 |
0000000100000000 |
Secondary alignment |
512 | 0x0200 |
0000001000000000 |
Not passing filters, such as platform/vendor quality controls |
1024 | 0x0400 |
0000010000000000 |
PCR or optical duplicate |
2048 | 0x0800 |
0000100000000000 |
Supplementary alignment |
The binary representation is easiest to understand, but decimal and hex are more compact. When multiple flags are set, you just sum them.
This is a nice tool for viewing flags.
samtools flagstat
samtools flagstat outputs the number of reads with flags that satisfy certain criteria. The flags are not explicitly noted in the output, but they are in the .
Below is annotated output, based on the documentation. Note that the first column is for reads with the 0x200
flag is 0
and the second for reads where the 0x200
flag is 1
.
$ samtools flagstat nanomia_100k.bam
203467 + 0 in total (QC-passed reads + QC-failed reads) # Total number of reads
200000 + 0 primary # Reads with neither 0x100 nor 0x800 bit set
0 + 0 secondary # Reads with 0x100 bit set
3467 + 0 supplementary # Reads with 0x800 bit set
0 + 0 duplicates # Reads with 0x400 bit set
0 + 0 primary duplicates # Reads with 0x400 bit set and neither 0x100 nor 0x800 bit set
96208 + 0 mapped (47.28% : N/A) # Reads with 0x4 bit not set
92741 + 0 primary mapped (46.37% : N/A) # Reads with 0x4, 0x100, and 0x800 bits not set
200000 + 0 paired in sequencing # Reads with 0x1 bit set
100000 + 0 read1 # Reads with both 0x1 and 0x40 bits set
100000 + 0 read2 # Reads with both 0x1 and 0x80 bits set
86384 + 0 properly paired (43.19% : N/A) # Reads with both 0x1 and 0x2 bits set and 0x4 bit not set
90994 + 0 with itself and mate mapped # Reads with 0x1 bit set and neither 0x4 nor 0x8 bits set
1747 + 0 singletons (0.87% : N/A) # Reads with both 0x1 and 0x8 bits set and 0x4 bit not set
3008 + 0 with mate mapped to a different chr # Reads with 0x1 bit set and neither 0x4 nor 0x8 bits set and mate (MRNM) not on the same chromosome (RNAME)
617 + 0 with mate mapped to a different chr (mapQ>=5) # Reads with 0x1 bit set and neither 0x4 nor 0x8 bits set and mate (MRNM) not on the same chromosome (RNAME) and mapping quality (MAPQ) >= 5
A few things to note in the above example:
- The counts are for reads, not read pairs. There were 100k read pairs in this file, so
100000 read1
and100000 read2
,200000 primary
and200000 paired in sequencing
. - The total number of reads in the BAM exceeds the number of reads in the input, because the total number includes primary, secondary, and supplementary reads.
samtools view -c
can be used to report just the first number from the samtool flagstat
output:
$ samtools view nanomia_100k.bam -c
203467
samtools view
samtools view can be used to format and filter mapped reads.
There are three ways to filter reads:
-q
excludes reads with MAPQ (mapping quality) below the specified threshold-F
excludes reads that have the specified flags set-f
includes only reads that have the specified flags set
Examples
Remove non-primary alignments
We saw above that bam files can contain primary and supplementary read mappings, so the number of reads in the file can be greater than the number of reads that were mapped:
$ samtools view nanomia_100k.bam -c
203467
But if we filter out the supplementary alignments, then we get only the primary occurances:
$ samtools view nanomia_100k.bam -F 0x800 -c
200000
This file happend to only have supplementary reads. If we wanted to exclude secondary reads as well, we would want to set the 0x100
flag also:
$ samtools view nanomia_100k.bam -F 0x900 -c
200000
Read pairing
Note that the 0x0001
does not indicate that a read pair mapped as a pair, it just indicates that a read is part of a pair. to see if a read pair is mapped as a pair, we need to check that both 0x0001
and 0x0002
bits set and 0x0004
is not:
$ samtools view nanomia_100k.bam -f 0x0003 -F 0x0004 0x900 -c
86397
This corresponds to the properly paired
flagstat line:
$ samtools flagstat nanomia_100k.bam
20346 7 + 0 in total (QC-passed reads + QC-failed reads)
200000 + 0 primary
0 + 0 secondary
3467 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
96208 + 0 mapped (47.28% : N/A)
92741 + 0 primary mapped (46.37% : N/A)
200000 + 0 paired in sequencing
100000 + 0 read1
100000 + 0 read2
86384 + 0 properly paired (43.19% : N/A)
90994 + 0 with itself and mate mapped
1747 + 0 singletons (0.87% : N/A)
3008 + 0 with mate mapped to a different chr
617 + 0 with mate mapped to a different chr (mapQ>=5)
This is a relatively permisive definition of paired mapping:
-f 0x0001
requires that the read have a pair-f 0x0002
requires that each segment (member of the pair) is properly aligned according to the aligner-F 0x0004
excludes unmapped segments (members of the pair)
So establishing that a read is part of a properly mapped pair is really up to the aligner, not samtools.
For bwa mem
, reads are assumed to be paired if there is a second fastq file, and:
In the paired-end mode, the mem command will infer the read orientation and the insert size distribution from a batch of reads.
When mapping genomes across species, the insert size distribution estimated from the sample may be way off. This could lead to a high false positive rate of paired mappings.
There is a great look at insert sizes here.
We can get read1 from mapped pairs, get the absolute value of the insert size, and take the average:
$ samtools view -f 0x0042 nanomia_100k.bam | cut -f9 | awk '{sum += sqrt($0^2)} END {print sum/NR}'
20.6552
This shows that when mapping across species, the properly paired alignents have much smaller than expected insert sizes. What is probably happening is that these are reads from low complexity/ repeat regions that are shared between species, and because the insert is so small the reads are basically reverse complements of each other and mapping successfully to the same spot.
If we look at the median instead:
$ samtools view -f 0x0042 nanomia_100k.bam | cut -f9 | awk '{a[NR] = sqrt($0^2)} END {asort(a); if (NR%2 == 1) print a[(NR+1)/2]; else print (a[NR/2] + a[NR/2+1]) / 2}'
20
And for humans:
$ samtools view -f 0x0042 human_100k.bam | cut -f9 | awk '{a[NR] = sqrt($0^2)} END {asort(a); if (NR%2 == 1) print a[(NR+1)/2]; else print (a[NR/2] + a[NR/2+1]) / 2}'
446
We see that when mapping human reads to a human dataset we get insert sizes much closer to what is expected for illumina data.
We can filter primary mappings based on insert size to accept only those in a reasonable 200-1000 range for Illumina:
$ samtools view -F 0x900 -h human_100k.bam | awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' | samtools view -c
173184
$ samtools view -F 0x900 -h nanomia_100k.bam | awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' | samtools view -c
78
New we can see that there are is a huge difference between intraspecific and interspecific mappings.
And if we apply a quality threshold the difference is even more aparent:
$ samtools view -q 30 -F 0x900 -h human_100k.bam | awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' | samtools view -c
163496
$ samtools view -q 30 -F 0x900 -h nanomia_100k.bam | awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' | samtools view -c
0
Synthetic contamination
I made some files wit mostly nanomia data, but human data mixed in. Pairing was maintained, and the order of the reads randomized.
# All human
$ bwa mem -t 4 $HUMAN_GENOME $HUMAN_100K_R1 $HUMAN_100K_R2 | samtools view -q 30 -F 0x900 -f 0x0002 -h - | awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' | samtools view -c -
163490 # 163490/200000= 81.8%
# All nanomia
$ bwa mem -t 4 $HUMAN_GENOME $NANOMIA_100K_R1 $NANOMIA_100K_R2 | samtools view -q 30 -F 0x900 -f 0x0002 -h - | awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' | samtools view -c -
0 # 0%
# Nanomia with 1% human
bwa mem -t 4 $HUMAN_GENOME /home/cwd7/pi/sequences/human/nano_with_01percent_human_R1.fastq /home/cwd7/pi/sequences/human/nano_with_01percent_human_R2.fastq | samtools view -q 30 -F 0x900 -f 0x0002 -h - | awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' | samtools view -c -
1712 # 1712/200000 = 0.856%
# Nanomia with 10% human
bwa mem -t 4 $HUMAN_GENOME /home/cwd7/pi/sequences/human/nano_with_10percent_human_R1.fastq /home/cwd7/pi/sequences/human/nano_with_10percent_human_R2.fastq | samtools view -q 30 -F 0x900 -f 0x0002 -h - | awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' | samtools view -c -
16478 # 16478/200000 = 8.239%
So this approach of filtering reads to those with reasonable insert sizes does an excellent job distincguishing between human and other reads.
Read filtering and mapping setting impacts on interspecific paired mapping
Here is a simple way to map and count primary reads with propper pairing (according to mapper):
$ bwa mem -t 4 $HUMAN_GENOME $NANOMIA_100K_R1 $NANOMIA_100K_R2 | samtools view -q 30 -f 0x0002 -F 0x0900 - -c
23979
The -M
flag, which marks split hits (where different parts of the read map to different parts of the genome) as secondary, has no impact on this context:
$ bwa mem -t 4 -M $HUMAN_GENOME $NANOMIA_100K_R1 $NANOMIA_100K_R2 | samtools view -q 30 -f 0x0002 -F 0x0900 - -c
23979
Check impact of read trimming. Residual adapters could have something to do with the very small insert sizes seen above:
ml Trimmomatic
echo -e ">PrefixPE/1\nTACACTCTTTCCCTACACGACGCTCTTCCGATCT\n>PrefixPE/2\nGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT" > TruSeq3-PE.fa
java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.39.jar PE -threads 4 -phred33 $NANOMIA_100K_R1 $NANOMIA_100K_R2 nanomia_R1_paired.fastq nanomia_R1_unpaired.fastq nanomia_R2_paired.fastq nanomia_R2_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True LEADING:15 TRAILING:15 MINLEN:50
Now map these:
$ bwa mem -t 4 $HUMAN_GENOME nanomia_R1_paired.fastq nanomia_R2_paired.fastq | samtools view -q 30 -f 0x0002 -F 0x0900 - -c