Skip to content

Instantly share code, notes, and snippets.

@caseywdunn
Last active August 8, 2023 19:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save caseywdunn/e401f94809984bb7e505450b24b7a9be to your computer and use it in GitHub Desktop.
Save caseywdunn/e401f94809984bb7e505450b24b7a9be to your computer and use it in GitHub Desktop.

Using samtools

This is my own exploration of samtools behavior and snippets of code I use in my own analyses.

The data

I am using the GRCh38 human reference from December, 2013.

There are two focal datasets of reads:

I am subsetted each of these down to 100k read pairs and 1M read pears to have some test datasets. One of my goals is to identify what settings are appropriate for detecting human contaminatin in non-human datasets. So we want to find a sweet spot where human reads map well to human genome, but reads from other species do not.

Setup

Here is what I do on my system to set up for the analyses:

module load BEDTools
module load SAMtools
module load BWA
module list

HUMAN_GENOME="/gpfs/gibbs/data/genomes/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa"

HUMAN_100K_R1=/gpfs/gibbs/pi/dunn/sequences/human/SRR13586124_R1_100K.fastq
HUMAN_100K_R2=/gpfs/gibbs/pi/dunn/sequences/human/SRR13586124_R2_100K.fastq
HUMAN_1M_R1=/gpfs/gibbs/pi/dunn/sequences/human/SRR13586124_R1_1M.fastq
HUMAN_1M_R2=/gpfs/gibbs/pi/dunn/sequences/human/SRR13586124_R2_1M.fastq

NANOMIA_100K_R1=/gpfs/gibbs/pi/dunn/sequences/NA19_152_041_S6_L002_R1_001_100k.fastq
NANOMIA_100K_R2=/gpfs/gibbs/pi/dunn/sequences/NA19_152_041_S6_L002_R2_001_100k.fastq
NANOMIA_1M_R1=/gpfs/gibbs/pi/dunn/sequences/NA19_152_041_S6_L002_R1_001_1M.fastq
NANOMIA_1M_R2=/gpfs/gibbs/pi/dunn/sequences/NA19_152_041_S6_L002_R2_001_1M.fastq

Mapping

Here is how I generate the 4 mapping files:

bwa mem -t 4 $HUMAN_GENOME $HUMAN_100K_R1 $HUMAN_100K_R2 | samtools view -b -h - > human_100k.bam
bwa mem -t 4 $HUMAN_GENOME $HUMAN_1M_R1 $HUMAN_1M_R2 | samtools view -b -h - > human_1m.bam

bwa mem -t 4 $HUMAN_GENOME $NANOMIA_100K_R1 $NANOMIA_100K_R2 | samtools view -b -h - > nanomia_100k.bam
bwa mem -t 4 $HUMAN_GENOME $NANOMIA_1M_R1 $NANOMIA_1M_R2 | samtools view -b -h - > nanomia_1m.bam

BAM files

BAM files are a binary representation of SAM files.

Flags

Each read has an associated set of binary mapping flags, packed into a 16 bit integer. They are:

Decimal Hexadecimal Binary Description
1 0x0001 0000000000000001 Template having multiple segments in sequencing, eg paired
2 0x0002 0000000000000010 Each segment properly aligned according to the aligner
4 0x0004 0000000000000100 Segment unmapped
8 0x0008 0000000000001000 Next segment in the template unmapped
16 0x0010 0000000000010000 SEQ being reverse complemented
32 0x0020 0000000000100000 SEQ of the next segment in the template being reversed
64 0x0040 0000000001000000 The first segment in the template, ie read1
128 0x0080 0000000010000000 The last segment in the template, ie read2
256 0x0100 0000000100000000 Secondary alignment
512 0x0200 0000001000000000 Not passing filters, such as platform/vendor quality controls
1024 0x0400 0000010000000000 PCR or optical duplicate
2048 0x0800 0000100000000000 Supplementary alignment

The binary representation is easiest to understand, but decimal and hex are more compact. When multiple flags are set, you just sum them.

This is a nice tool for viewing flags.

samtools flagstat

samtools flagstat outputs the number of reads with flags that satisfy certain criteria. The flags are not explicitly noted in the output, but they are in the .

Below is annotated output, based on the documentation. Note that the first column is for reads with the 0x200 flag is 0 and the second for reads where the 0x200 flag is 1.

$ samtools flagstat nanomia_100k.bam
203467 + 0 in total (QC-passed reads + QC-failed reads)      # Total number of reads
200000 + 0 primary                                           # Reads with neither 0x100 nor 0x800 bit set
0 + 0 secondary                                              # Reads with 0x100 bit set
3467 + 0 supplementary                                       # Reads with 0x800 bit set
0 + 0 duplicates                                             # Reads with 0x400 bit set
0 + 0 primary duplicates                                     # Reads with 0x400 bit set and neither 0x100 nor 0x800 bit set
96208 + 0 mapped (47.28% : N/A)                              # Reads with 0x4 bit not set
92741 + 0 primary mapped (46.37% : N/A)                      # Reads with 0x4, 0x100, and 0x800 bits not set
200000 + 0 paired in sequencing                              # Reads with 0x1 bit set
100000 + 0 read1                                             # Reads with both 0x1 and 0x40 bits set
100000 + 0 read2                                             # Reads with both 0x1 and 0x80 bits set
86384 + 0 properly paired (43.19% : N/A)                     # Reads with both 0x1 and 0x2 bits set and 0x4 bit not set
90994 + 0 with itself and mate mapped                        # Reads with 0x1 bit set and neither 0x4 nor 0x8 bits set
1747 + 0 singletons (0.87% : N/A)                            # Reads with both 0x1 and 0x8 bits set and 0x4 bit not set
3008 + 0 with mate mapped to a different chr                 # Reads with 0x1 bit set and neither 0x4 nor 0x8 bits set and mate (MRNM) not on the same chromosome (RNAME)
617 + 0 with mate mapped to a different chr (mapQ>=5)        # Reads with 0x1 bit set and neither 0x4 nor 0x8 bits set and mate (MRNM) not on the same chromosome (RNAME) and mapping quality (MAPQ) >= 5

A few things to note in the above example:

  • The counts are for reads, not read pairs. There were 100k read pairs in this file, so 100000 read1 and 100000 read2, 200000 primary and 200000 paired in sequencing.
  • The total number of reads in the BAM exceeds the number of reads in the input, because the total number includes primary, secondary, and supplementary reads.

samtools view -c can be used to report just the first number from the samtool flagstat output:

$ samtools view nanomia_100k.bam -c
203467

samtools view

samtools view can be used to format and filter mapped reads.

There are three ways to filter reads:

  • -q excludes reads with MAPQ (mapping quality) below the specified threshold
  • -F excludes reads that have the specified flags set
  • -f includes only reads that have the specified flags set

Examples

Remove non-primary alignments

We saw above that bam files can contain primary and supplementary read mappings, so the number of reads in the file can be greater than the number of reads that were mapped:

$ samtools view nanomia_100k.bam -c
203467

But if we filter out the supplementary alignments, then we get only the primary occurances:

$ samtools view nanomia_100k.bam -F 0x800 -c
200000

This file happend to only have supplementary reads. If we wanted to exclude secondary reads as well, we would want to set the 0x100 flag also:

$ samtools view nanomia_100k.bam -F 0x900 -c
200000

Read pairing

Note that the 0x0001 does not indicate that a read pair mapped as a pair, it just indicates that a read is part of a pair. to see if a read pair is mapped as a pair, we need to check that both 0x0001 and 0x0002 bits set and 0x0004 is not:

$ samtools view nanomia_100k.bam -f 0x0003 -F 0x0004 0x900 -c
86397

This corresponds to the properly paired flagstat line:

$ samtools flagstat nanomia_100k.bam
20346    7 + 0 in total (QC-passed reads + QC-failed reads)
200000 + 0 primary
0 + 0 secondary
3467 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
96208 + 0 mapped (47.28% : N/A)
92741 + 0 primary mapped (46.37% : N/A)
200000 + 0 paired in sequencing
100000 + 0 read1
100000 + 0 read2
86384 + 0 properly paired (43.19% : N/A)
90994 + 0 with itself and mate mapped
1747 + 0 singletons (0.87% : N/A)
3008 + 0 with mate mapped to a different chr
617 + 0 with mate mapped to a different chr (mapQ>=5)

This is a relatively permisive definition of paired mapping:

  • -f 0x0001 requires that the read have a pair
  • -f 0x0002 requires that each segment (member of the pair) is properly aligned according to the aligner
  • -F 0x0004 excludes unmapped segments (members of the pair)

So establishing that a read is part of a properly mapped pair is really up to the aligner, not samtools.

For bwa mem, reads are assumed to be paired if there is a second fastq file, and:

In the paired-end mode, the mem command will infer the read orientation and the insert size distribution from a batch of reads.

When mapping genomes across species, the insert size distribution estimated from the sample may be way off. This could lead to a high false positive rate of paired mappings.

There is a great look at insert sizes here.

We can get read1 from mapped pairs, get the absolute value of the insert size, and take the average:

$ samtools view -f 0x0042 nanomia_100k.bam | cut -f9 | awk '{sum += sqrt($0^2)} END {print sum/NR}'
20.6552

This shows that when mapping across species, the properly paired alignents have much smaller than expected insert sizes. What is probably happening is that these are reads from low complexity/ repeat regions that are shared between species, and because the insert is so small the reads are basically reverse complements of each other and mapping successfully to the same spot.

If we look at the median instead:

$ samtools view -f 0x0042 nanomia_100k.bam | cut -f9 | awk '{a[NR] = sqrt($0^2)} END {asort(a); if (NR%2 == 1) print a[(NR+1)/2]; else print (a[NR/2] + a[NR/2+1]) / 2}'
20

And for humans:

$ samtools view -f 0x0042 human_100k.bam | cut -f9 | awk '{a[NR] = sqrt($0^2)} END {asort(a); if (NR%2 == 1) print a[(NR+1)/2]; else print (a[NR/2] + a[NR/2+1]) / 2}'
446

We see that when mapping human reads to a human dataset we get insert sizes much closer to what is expected for illumina data.

We can filter primary mappings based on insert size to accept only those in a reasonable 200-1000 range for Illumina:

$ samtools view -F 0x900 -h human_100k.bam |   awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' |   samtools view -c
173184
$ samtools view -F 0x900 -h nanomia_100k.bam |   awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' |   samtools view -c
78

New we can see that there are is a huge difference between intraspecific and interspecific mappings.

And if we apply a quality threshold the difference is even more aparent:

$ samtools view -q 30 -F 0x900 -h human_100k.bam |   awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' |   samtools view -c
163496
$ samtools view -q 30 -F 0x900 -h nanomia_100k.bam |   awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' |   samtools view -c
0

Synthetic contamination

I made some files wit mostly nanomia data, but human data mixed in. Pairing was maintained, and the order of the reads randomized.

# All human
$ bwa mem -t 4 $HUMAN_GENOME $HUMAN_100K_R1 $HUMAN_100K_R2 | samtools view -q 30 -F 0x900 -f 0x0002 -h - |   awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' | samtools view -c -
163490      # 163490/200000= 81.8%

# All nanomia
$ bwa mem -t 4 $HUMAN_GENOME $NANOMIA_100K_R1 $NANOMIA_100K_R2 | samtools view -q 30 -F 0x900 -f 0x0002 -h - |   awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' | samtools view -c -
0           # 0%

# Nanomia with 1% human
bwa mem -t 4 $HUMAN_GENOME /home/cwd7/pi/sequences/human/nano_with_01percent_human_R1.fastq /home/cwd7/pi/sequences/human/nano_with_01percent_human_R2.fastq | samtools view -q 30 -F 0x900 -f 0x0002 -h - |   awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' | samtools view -c -
1712        # 1712/200000 = 0.856%

# Nanomia with 10% human
bwa mem -t 4 $HUMAN_GENOME /home/cwd7/pi/sequences/human/nano_with_10percent_human_R1.fastq /home/cwd7/pi/sequences/human/nano_with_10percent_human_R2.fastq | samtools view -q 30 -F 0x900 -f 0x0002 -h - |   awk 'substr($0,1,1)=="@" || ($9>= 200 && $9<=1000) || ($9<=-200 && $9>=-1000)' | samtools view -c -
16478       # 16478/200000 = 8.239%

So this approach of filtering reads to those with reasonable insert sizes does an excellent job distincguishing between human and other reads.

Read filtering and mapping setting impacts on interspecific paired mapping

Here is a simple way to map and count primary reads with propper pairing (according to mapper):

$ bwa mem -t 4 $HUMAN_GENOME $NANOMIA_100K_R1 $NANOMIA_100K_R2 | samtools view -q 30 -f 0x0002 -F 0x0900 - -c
23979

The -M flag, which marks split hits (where different parts of the read map to different parts of the genome) as secondary, has no impact on this context:

$ bwa mem -t 4 -M $HUMAN_GENOME $NANOMIA_100K_R1 $NANOMIA_100K_R2 | samtools view -q 30 -f 0x0002 -F 0x0900 - -c
23979

Check impact of read trimming. Residual adapters could have something to do with the very small insert sizes seen above:

ml Trimmomatic
echo -e ">PrefixPE/1\nTACACTCTTTCCCTACACGACGCTCTTCCGATCT\n>PrefixPE/2\nGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT" > TruSeq3-PE.fa
java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.39.jar PE -threads 4 -phred33 $NANOMIA_100K_R1 $NANOMIA_100K_R2 nanomia_R1_paired.fastq nanomia_R1_unpaired.fastq nanomia_R2_paired.fastq nanomia_R2_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True LEADING:15 TRAILING:15 MINLEN:50

Now map these:

$ bwa mem -t 4 $HUMAN_GENOME nanomia_R1_paired.fastq nanomia_R2_paired.fastq | samtools view -q 30 -f 0x0002 -F 0x0900 - -c

samtools source code

samtools flagstat

https://github.com/lh3/samtools/blob/master/bam_stat.c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment