Skip to content

Instantly share code, notes, and snippets.

@drtconway
Created November 20, 2023 04:37
Show Gist options
  • Save drtconway/962697804acd1d89c753b1d9a0378fc6 to your computer and use it in GitHub Desktop.
Save drtconway/962697804acd1d89c753b1d9a0378fc6 to your computer and use it in GitHub Desktop.

Testing alignments with strobealign

Evaluating a new aligner is hard. There are many circumstances where there isn't a single "correct" answer, which makes it difficult to compare alignments from two tools. Or where alignments are nondeterministic, it is not easy to determine if the alignments from two runs are as good as each other.

To compare bwa mem with strobealign, we have taken a well known sample NA24385, with good quality, publicly available read data SRR11321731, sub-sampled it in a reproducible way, then aligned it with both tools.

Preparing the data

To download the data we use the SRA-toolkit.

fastq-dump --split-files --gzip SRR11321731

This results in large files, which we don't really want to process all of.

$ ls -lh SRR*
-rw-r--r--  1 tom.conway  448689569    14G 18 Nov 00:56 SRR11321731_1.fastq.gz
-rw-r--r--  1 tom.conway  448689569    16G 18 Nov 00:56 SRR11321731_2.fastq.gz

So we use deterministic hashing to sample the reads:

import sys
import hashlib

def sig(s):
    x = hashlib.sha256(s.encode()).digest()
    h = 0.0
    for i in range(32):
        b = x[i]
        h += b
        h /= 256.0
    return h

rec = []
for l in sys.stdin:
    rec.append(l)
    if len(rec) == 4:
        read_id = rec[0].split()[0]
        p = sig(read_id)
        if p < 0.0001:
            for m in rec:
                sys.stdout.write(m)
        rec = []

which is used in the following manner:

gzcat SRR11321731_1.fastq.gz | python3 sample-fastq.py | gzip -9 - > SRR11321731_subset_1.fastq.gz
gzcat SRR11321731_2.fastq.gz | python3 sample-fastq.py | gzip -9 - > SRR11321731_subset_2.fastq.gz

The function sig computes a good hash, and converts it into a uniformly distributed number on the interval [0,1). Because it is based on the read ID, it selects the same reads from both ends of the pair, so we can consider the two files independently.

This gives us a much more manageable set of reads to deal with. Note that we don't want to take just the first N reads, or something simple like that, because there may be biases that relate to the position of the read in the file. Two examples are where the position in the file has some correlation to the position on the flow cell in the sequencing machine, which could mean we only pick reads from one part of the flow cell, which may not be representative in terms of quality or sequence content; or the reads (particularly coming from SRA) may be stored in an order that relates to the position they mapped in reference processing, in which case reads that are nearby in the file will map to nearby locations, which again is not representative of the data as a whole.

The resulting data is much more managable.

ls -lh SRR11321731_subset_*
-rw-r--r--  1 tom.conway  448689569   2.9M 20 Nov 08:57 SRR11321731_subset_1.fastq.gz
-rw-r--r--  1 tom.conway  448689569   3.1M 20 Nov 09:22 SRR11321731_subset_2.fastq.gz

To build the BWA index, we are using the current release available on Brew:

$ bwa

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
Contact: Heng Li <lh3@sanger.ac.uk>

...
bwa index GRCh38.primary_assembly.genome.fa

Generating alignments

All pretty straigt forward.

For BWA:

bwa mem GRCh38.primary_assembly.genome.fa SRR11321731_subset_{1,2}.fastq.gz | samtools sort -o SRR11321731_subset_bwa.bam -

For Strobealign:

strobealign GRCh38.primary_assembly.genome.fa SRR11321731_subset_{1,2}.fastq.gz | samtools sort -o SRR11321731_subset_str.bam -

To make it easier to interpret the differences, we pull out just the first few fields from the SAM output, and sort by read ID:

samtools view SRR11321731_subset_bwa.bam | cut -f 1-9 | sort > SRR11321731_subset_bwa.txt
samtools view SRR11321731_subset_str.bam | cut -f 1-9 | sort > SRR11321731_subset_str.txt

Comparing the alignments

The easiest way to get a look at the differences, which are modest, is to use vimdiff:

vimdiff SRR11321731_subset_bwa.txt SRR11321731_subset_str.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment