drtconway/comparing-alignments.md

## comparing-alignments.md

      
    Raw
  

              comparing-alignments.md
            
          
    Testing alignments with strobealign

Evaluating a new aligner is hard. There are many circumstances where there isn't a single "correct" answer,
which makes it difficult to compare alignments from two tools. Or where alignments are nondeterministic,
it is not easy to determine if the alignments from two runs are as good as each other.
To compare bwa mem with strobealign, we have taken a well known sample NA24385, with good quality,
publicly available read data SRR11321731,
sub-sampled it in a reproducible way, then aligned it with both tools.
Preparing the data

To download the data we use the SRA-toolkit.
fastq-dump --split-files --gzip SRR11321731
This results in large files, which we don't really want to process all of.
$ ls -lh SRR*
-rw-r--r--  1 tom.conway  448689569    14G 18 Nov 00:56 SRR11321731_1.fastq.gz
-rw-r--r--  1 tom.conway  448689569    16G 18 Nov 00:56 SRR11321731_2.fastq.gz

So we use deterministic hashing to sample the reads:
import sys
import hashlib

def sig(s):
    x = hashlib.sha256(s.encode()).digest()
    h = 0.0
    for i in range(32):
        b = x[i]
        h += b
        h /= 256.0
    return h

rec = []
for l in sys.stdin:
    rec.append(l)
    if len(rec) == 4:
        read_id = rec[0].split()[0]
        p = sig(read_id)
        if p < 0.0001:
            for m in rec:
                sys.stdout.write(m)
        rec = []
which is used in the following manner:
gzcat SRR11321731_1.fastq.gz | python3 sample-fastq.py | gzip -9 - > SRR11321731_subset_1.fastq.gz
gzcat SRR11321731_2.fastq.gz | python3 sample-fastq.py | gzip -9 - > SRR11321731_subset_2.fastq.gz
The function sig computes a good hash, and converts it into a uniformly distributed number on the interval [0,1).
Because it is based on the read ID, it selects the same reads from both ends of the pair, so we can consider the
two files independently.
This gives us a much more manageable set of reads to deal with. Note that we don't want to take just the first N
reads, or something simple like that, because there may be biases that relate to the position of the read in the
file. Two examples are where the position in the file has some correlation to the position on the flow cell in the
sequencing machine, which could mean we only pick reads from one part of the flow cell, which may not be representative
in terms of quality or sequence content; or the reads (particularly coming from SRA) may be stored in an order that
relates to the position they mapped in reference processing, in which case reads that are nearby in the file will
map to nearby locations, which again is not representative of the data as a whole.
The resulting data is much more managable.
ls -lh SRR11321731_subset_*
-rw-r--r--  1 tom.conway  448689569   2.9M 20 Nov 08:57 SRR11321731_subset_1.fastq.gz
-rw-r--r--  1 tom.conway  448689569   3.1M 20 Nov 09:22 SRR11321731_subset_2.fastq.gz

To build the BWA index, we are using the current release available on Brew:
$ bwa

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
Contact: Heng Li <lh3@sanger.ac.uk>

...

bwa index GRCh38.primary_assembly.genome.fa
Generating alignments

All pretty straigt forward.
For BWA:
bwa mem GRCh38.primary_assembly.genome.fa SRR11321731_subset_{1,2}.fastq.gz | samtools sort -o SRR11321731_subset_bwa.bam -
For Strobealign:
strobealign GRCh38.primary_assembly.genome.fa SRR11321731_subset_{1,2}.fastq.gz | samtools sort -o SRR11321731_subset_str.bam -
To make it easier to interpret the differences, we pull out just the first few fields from the SAM output, and sort by read ID:
samtools view SRR11321731_subset_bwa.bam | cut -f 1-9 | sort > SRR11321731_subset_bwa.txt
samtools view SRR11321731_subset_str.bam | cut -f 1-9 | sort > SRR11321731_subset_str.txt
Comparing the alignments

The easiest way to get a look at the differences, which are modest, is to use vimdiff:
vimdiff SRR11321731_subset_bwa.txt SRR11321731_subset_str.txt