Skip to content

Instantly share code, notes, and snippets.

@crazyhottommy
Last active November 29, 2021 17:10
Show Gist options
  • Save crazyhottommy/ed73c7e2daee8383dccb35f224f99714 to your computer and use it in GitHub Desktop.
Save crazyhottommy/ed73c7e2daee8383dccb35f224f99714 to your computer and use it in GitHub Desktop.

An error

I was using TEQC to do quality control of my WES bam files aligned by bwa-mem. My data are paired end, so a function reads2pairs is called to make the paired-end reads to be a single fragment. I then get this error:

> readpairs <- reads2pairs(reads)
Error in reads2pairs(reads) : read pair IDs do not seem to be unique

I asked in the bioconductor support site and went to the source code of that function. It turns out that TEQC was complaining about the not uniquely mapped reads.

How does BWA-MEM determine and mark the not uniquely mapped reads?

google directed me to biostars again with no surprising. Please read this three posts:
https://www.biostars.org/p/76893/
https://www.biostars.org/p/167892/

one extract multi-mapped reads by looking for mapq < 23 and/or the XA flag on the reads

samtools view -q 40 file.bam
samtools view -q 1 file.bam
samtools view -q 23 file.bam

BWA added tags (Tags starting with ‘X’ are specific to BWA):

Tag Meaning
NM Edit distance
MD Mismatching positions/bases
AS Alignment score
BC Barcode sequence
X0 Number of best hits
X1 Number of suboptimal hits found by BWA
XN Number of ambiguous bases in the referenece
XM Number of mismatches in the alignment
XO Number of gap opens
XG Number of gap extentions
XT Type: Unique/Repeat/N/Mate-sw
XA Alternative hits; format: (chr,pos,CIGAR,NM;)*
XS Suboptimal alignment score
XF Support from forward/reverse alignment
XE Number of supporting seeds

Checking my own bam files

samtools view my.bam | grep "XA:" | head -2

E00514:124:H2FC7CCXY:1:1221:16741:50076	97	1	69019	0	151M	=	69298	430	CTCCTTCTCTTCTTCAAGGTAACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTAT	@@CC?=D@E@=F@>F>?FG==?FAGG?F?FFGA<=?>GFAGF?@=G>?=F?>E@>F=G>>>E?=>>;>D@E@;<EE<E=CAE<>;<D@<<<==D@ED@<D@D@E>E<<=D@E@E>=>C@DD@E=CD@<DD@<>=><>G>>G>>>=;;?;;>	XA:Z:15,-102463184,151M,0;19,+110607,151M,1;	MC:Z:151M	BD:Z:NNNNNNNONONNONNONNOONNNOOONONOONONNNGNNNOONNNOONNOONNNOOOMONFNNONNNNNONONNOOOMOOOOONNNNOONGGGNOOOLOOONONOOOONNONOOONNNONNOONOONNNNNNNNGNNOMNNMNGGGGNMNN	MD:Z:151	BI:Z:QQPQQQQPOPQQQQQQQQQQQQQQRQQRQQQQRQQQMQQQRRQQQRQQQRQQQQQQRQQQKQQQQQQQQQPQQQRRQQQQQRQQQQQQQQMMMQQRRPQRQPQPQRQRQQQPQRQQQQQPQQRQQQQQQQQQQQMQQRQQQQQMMMMQQQQ	NM:i:0	MQ:i:0	AS:i:151	XS:i:151	RG:Z:F17042956987-KY290-2-WES
E00514:124:H2FC7CCXY:1:2219:12855:17641	81	1	69039	0	150M	13	83828251	0	AACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTT	=;C?FEAFAFGF<>>>=FF@DD=<@D=<;E<<?D@C@E=<<?D;<;<<E<E;=@DC@D<E@D=<==E<>>=><E@EDD<E=E=E@D<>>F>G@CF>=F>FGAF=GF@?FG===><=@E??E??????F==??F==?FF@EE=<=>D@B@@	XA:Z:15,+102463165,150M,0;19,-110627,150M,1;	BD:Z:NOONOOONNNOONNFNNOOONNNOONNOONNNNNOMONGNNNNNNNNOONONOOOMOOOOONNNNOONFFFNNOOLOONONONOOONNNOOONNNNOONOOOOOONNNOONNFNNOMNNMNFFFFNMNNNNNONOONNNNNNOONMMNNN	MD:Z:150	BI:Z:QRRQQRQPQQRQQQKQQQRQQQQQQQQRQQQQQPQPQQMQQQQQQQQRQQQQQQQPQRRRQQQQQQRQKKKQQRQPQQQQQQQRQRQQQQQRQQQQRQQRRQRQQQQQQQQQKQQQOQQOQKKKKQOQQQQQQQQQQPQQPPQQQPOQQQ	NM:i:0	AS:i:150	XS:i:150	RG:Z:F17042956987-KY290-2-WES

Indeed, most of the reads with XA: tag has a quality score of 0 (fifth column).

samtools view -q 1 my.bam | wc -l
7787794
samtools view my.bam | grep -v "XA:" | wc -l
7972777

## not exactly the same number, what's wrong?
samtools view my.bam | grep  "XA:" | awk '{print $5}' | sort | uniq -c 
201878 0
    463 1
    688 10
    666 11
    677 12
    693 13
    271 14
    777 15
    281 16
    414 17
    564 18
   1429 19
    192 2
    327 20
   3772 21
   1674 22
    742 23
   1543 24
   3106 25
    368 26
   6223 27
    498 28
    514 29
    760 3
    830 30
   1526 31
    954 32
   3726 33
    367 34
    343 35
    379 36
    641 37
    150 38
   1082 39
    442 4
   3058 40
    673 41
    866 42
   2570 43
   1285 44
   6374 45
   6885 46
   7669 47
  22571 48
  17666 49
    611 5
  16128 50
  13824 51
   9864 52
   6277 53
   4328 54
   2568 55
   1440 56
   4547 57
   1462 58
   1888 59
   1171 6
 169162 60
      2 61
      3 62
      1 64
      5 65
      1 67
      2 69
   1611 7
     86 70
   2234 8
   4013 9

so, many reads with XA tag with mapping quality scores > 0 !!

retain only uniquely mapped reads

samtools view -h my.bam | awk '$17 ~ /XA:/' || $1 ~ /^@/' | samtools view -bS - > my_unique.bam
@ntm
Copy link

ntm commented May 10, 2021

In my hands bwa doesn't produce any alignments with flag 256: secondary alignments are either listed in the XA tag of the primary alignment (if there aren't too many, max 5 by default), or they are not listed at all (if too many).
@xiucz : are you actually seeing bwa alignments with flag 256?

@xiucz
Copy link

xiucz commented May 21, 2021

21061621503640_ pic_hd

@ntm , I see many SAM flags > 256 above picture, after decoding, https://broadinstitute.github.io/picard/explain-flags.html, I take thenot primary alignment (0x100) reads as multialigned reads.

@ntm
Copy link

ntm commented May 21, 2021

@xiucz :
Indeed flag 0x100==256 is set in these alignments. However I notice that their CIGARs all look like split alignments, eg 57S79M, rather than multi-mappings. Are you perhaps running bwa mem with option -M ?
-M Mark shorter split hits as secondary

If not, would you mind sharing the bwa command-line you use?
For reference, I run essentially
bwa mem -p -t12 -K 100000000
and I get no alignments with flag bit 256 set, ie "samtools view -f 256 file.sam" returns nothing.

@xiucz
Copy link

xiucz commented May 22, 2021

@ntm,
My bwa command-line: bwa mem -M -p -t4 -K 100000000 with the option -M. I think I should not use samtools view -f 256...
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment