Skip to content

Instantly share code, notes, and snippets.

@darencard
Last active July 14, 2023 06:19
Show Gist options
  • Star 27 You must be signed in to star a gist
  • Fork 9 You must be signed in to fork a gist
  • Save darencard/72ddd9e6c08aaff5ff64ca512a04a6dd to your computer and use it in GitHub Desktop.
Save darencard/72ddd9e6c08aaff5ff64ca512a04a6dd to your computer and use it in GitHub Desktop.
Extract paired FASTQ reads from a BAM mapping file

Please see the most up-to-date version of this protocol on my blog at https://darencard.net/blog/.

Extracting paired FASTQ read data from a BAM mapping file

Sometimes FASTQ data is aligned to a reference and stored as a BAM file, instead of the normal FASTQ read files. This is okay, because it is possible to recreate raw FASTQ files based on the BAM file. The following outlines this process. The useful software samtools and bedtools are both required.

From each bam, we need to extract:

  1. reads that mapped properly as pairs
  2. reads that didn’t map properly as pairs (both didn’t map, or one didn’t map)

For #1, the following command will work. This was taken from this webpage.

samtools view -u -f 1 -F 12 lib_002.sorted.md.bam > lib_002_map_map.bam

The -f and -F filter using flags in column 2 of the BAM file. These aren't always intuitive, and I won't describe them more here, but you can use this handy tool to better understand. Also note that the -u flag creates uncompressed BAM output rather than default compressed BAM output, so the files will be larger. This helps with quicker reading in later steps, but it isn't necessary to include this if you want to save disk space. samtools is super fast either way.

Resolving #2 is more complicated, as there are three ways a read might not have mapped as a proper pair. A. The first read mapped but the paired read did not. B. The first read did not map but the paired read did. C. Neither paired read mapped at all. Again, flags will be used to filter the original BAM file. This information was found at this webpage.

# R1 unmapped, R2 mapped
samtools view -u -f 4 -F 264 lib_002.sorted.md.bam > lib_002_unmap_map.bam
# R1 mapped, R2 unmapped
samtools view -u -f 8 -F 260 lib_002.sorted.md.bam > lib_002_map_unmap.bam
# R1 & R2 unmapped
samtools view -u -f 12 -F 256 lib_002.sorted.md.bam > lib_002_unmap_unmap.bam

As you might expect, you have to then merge the three files that contain at least one unmapped pair.

samtools merge -u lib_002_unmapped.bam lib_002_unmap_map.bam lib_002_map_unmap.bam lib_002_unmap_unmap.bam

Next, these BAM files must be resorted so that they are ordered by read ID instead of location in the reference.

samtools sort -n lib_002_map_map.bam lib_002_mapped.sort
samtools sort -n lib_002_unmapped.bam lib_002_unmapped.sort

At this time, it is a good idea to check that you have the correct number of reads and no redundancy. You can summarize the original BAM file to get an idea of where you started.

samtools flagstat lib_002.sorted.md.bam
## output
# 160673944 + 0 in total (QC-passed reads + QC-failed reads)
# 44648692 + 0 duplicates
# 136861465 + 0 mapped (85.18%:-nan%)
# 160673944 + 0 paired in sequencing
# 80336972 + 0 read1
# 80336972 + 0 read2
# 123173914 + 0 properly paired (76.66%:-nan%)
# 123173914 + 0 with itself and mate mapped
# 13687551 + 0 singletons (8.52%:-nan%)
# 37581548 + 0 with mate mapped to a different chr
# 28422566 + 0 with mate mapped to a different chr (mapQ>=5)

Notice the toal number of input reads that is found on the first line. You want to be sure that the number of unmapped and mapped reads total this number. It is easy to check using the following commands.

samtools view -c lib_002_mapped.sort.bam
## output
# 123173914
samtools view -c lib_002_unmapped.sort.bam
## output
# 37500030

Note that one paired read is counted as two reads here. If you sum these two numbers, they should equal the number you noted above, as they do here.

If all is good, you can now extract the FASTQ reads into two paired read files, as follows.

bamToFastq -i lib_002_mapped.sort.bam -fq lib_002_mapped.1.fastq -fq2 lib_002_mapped.2.fastq 
bamToFastq -i lib_002_unmapped.sort.bam -fq lib_002_unmapped.1.fastq -fq2 lib_002_unmapped.2.fastq 

And then it also makes sense to combine both the first and paired reads together from the mapped and unmapped files.

cat lib_002_mapped.1.fastq lib_002_unmapped.1.fastq > lib_002.1.fastq
cat lib_002_mapped.2.fastq lib_002_unmapped.2.fastq > lib_002.2.fastq

These two files should now have the same number of reads that are exactly as you would have received them if they had come directly from the sequencer as FASTQ.

Please also note that all of the commands above can be piped together in bash using |, which will save on disk space and time. So it is best to combine commands where possible.

@splaisan
Copy link

splaisan commented Apr 29, 2018

Thanks for this nice tutorial, but does samtools view fastq not do most of it already?

What about doing it from a full genome BAM but only extract read pairs for which at least one read of a pair maps in a region?
I am preparing this for a training where I want students to map quickly so I thought of keeping a nice depth of coverage but a width for only a region or a single shorter chromosome.

From what I find, I need to pass one first time and collect the ID's (read names) of read that map in the region for at least one of the pair, then pass a second time through the BAM/SAM with tools like grep to collect all reads having these IDs/names and save to file. After which I could reformat the read properly in fastq pairs.
I fear I will have multimapping reads that will complicate the story and introduce duplicates (using uniq applied to fastq after concatenation may help but will cost extra time!).

Any suggestions or corrections on my plan?

Thanks

@ipstone
Copy link

ipstone commented Aug 21, 2018

Here is an useful link for a related topic: https://www.biostars.org/p/68358/#68362

The approach listed above, won't extract the matching pair sequence to the unmapped sequence read; for that purpose, I first collect the unmapped sequence, and then filter from the orginal bam files - all the Ids matching the 'unmapped', to obtain all the paired reads (map-unmap, unmap-map, unmap-unmap all together).

@cfc424
Copy link

cfc424 commented Jul 20, 2019

this is awsome!
thanks!!

@jelber2
Copy link

jelber2 commented Jul 25, 2019

Should have -o for newer versions of samtools (at least 1.9)

samtools sort -n lib_002_map_map.bam lib_002_mapped.sort -> samtools sort -n lib_002_map_map.bam -o lib_002_mapped.sort.bam
samtools sort -n lib_002_unmapped.bam lib_002_unmapped.sort -> samtools sort -n lib_002_unmapped.bam -o lib_002_unmapped.sort.bam

@jelber2
Copy link

jelber2 commented Jul 25, 2019

Should also mention that you might need to run [BBTools](https://sourceforge.net/projects/bbmap/) [Repair.sh](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/repair-guide/) to repair the read1 and read2 pairing

repair.sh in=lib_002.1.fastq in2=lib_002.2.fastq out=lib_002.1.fastq.gz out2=lib_002.2.fastq.gz repair

@hauduc
Copy link

hauduc commented Sep 20, 2020

I believe you need to add -h to the samtools view commands to include headers so samtools merge reads in the bams correctly, as well as -b if you want to save on storage space, not just removing the -u tag.

@arumds
Copy link

arumds commented Feb 3, 2021

bam2fastq also outputs unpaired mapped and unpaired unmapped reads. How do we include or concat them to generate paired-end read files?

@JoseAMontero
Copy link

good tutorial

@azurillandfriend
Copy link

I'm trying to follow these instructions, but for my bam file the mapped.sort.bam and the unmapped.sort.bam do not add up to the total number of reads in the original bam file. They add up to slightly less. Where have the missing reads gone?

@azurillandfriend
Copy link

Update to last comment. On further examination using samtools flagstat rather than just samtools view -c, the number of reads in the original bam which were "paired in sequencing" is the same as the sum of the reads "paired in sequencing" in the unmapped.sort.bam and mapped.sort.bam files. My original bam file had some reads which were "secondary" mapped, which added to the total reads. But these reads only need to go once to the fastq, not once for each mapping. So the numbers that should add up are the "paired in sequencing", not to grand total.

samtools flagstat for starting bam:
11920739 + 0 in total (QC-passed reads + QC-failed reads)
1391581 + 0 secondary
0 + 0 supplementary
6685265 + 0 duplicates
10630086 + 0 mapped (89.17% : N/A)
10529158 + 0 paired in sequencing
5264579 + 0 read1
5264579 + 0 read2
8575518 + 0 properly paired (81.45% : N/A)
8714398 + 0 with itself and mate mapped
524107 + 0 singletons (4.98% : N/A)
17162 + 0 with mate mapped to a different chr
13772 + 0 with mate mapped to a different chr (mapQ>=5)

samtools flagstat for mapped.sort.bam
10060695 + 0 in total (QC-passed reads + QC-failed reads)
1346297 + 0 secondary
0 + 0 supplementary
5940561 + 0 duplicates
10060695 + 0 mapped (100.00% : N/A)
8714398 + 0 paired in sequencing
4357199 + 0 read1
4357199 + 0 read2
8575518 + 0 properly paired (98.41% : N/A)
8714398 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
17162 + 0 with mate mapped to a different chr
13772 + 0 with mate mapped to a different chr (mapQ>=5)

samtools flagstat for unmapped.sort.bam
1814760 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
716006 + 0 duplicates
524107 + 0 mapped (28.88% : N/A)
1814760 + 0 paired in sequencing
907380 + 0 read1
907380 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
524107 + 0 singletons (28.88% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

the fastq files produced had 5264579 reads each - so a total of 10529158

I replicated the conversion from the starting bam uisng PicardCommandLine and got the same size fastq files (with a lot fewer commands!)

@soumyarao86
Copy link

Hi, I am trying to extract reads in one chromosome but samtools view fails with the msg:

[main_samview] region "chr11" specifies an unknown reference name. Continue anyway.

Can you help me with it?

@splaisan
Copy link

Did you try calling it 11 instead. Some references use chrN and others N alone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment