Skip to content

Instantly share code, notes, and snippets.

@opplatek
Last active July 23, 2022 15:08
Show Gist options
  • Save opplatek/787049422fec3056715f1b102c2f6283 to your computer and use it in GitHub Desktop.
Save opplatek/787049422fec3056715f1b102c2f6283 to your computer and use it in GitHub Desktop.
Extract read names from SAM/BAM
#!/bin/bash
#
# Quickly extract unique read names from SAM/BAM file
# Source: https://www.biostars.org/p/371705/#371748
#
# This will extract unique read names for all unampped reads (-f 4) from BAM file
# awk is much faster than `sort --parallel=$threads | uniq` because it doesn't have to do the sorting
# On BAM file with 149,909,118 input reads:
# sort | uniq takes (with 12 threads):
#real 1m8.240s
#user 2m57.097s
#sys 0m18.001s
# awk '!x[$0]++' takes (with 1 thread):
#real 0m42.857s
#user 2m30.028s
#sys 0m18.179s
#
threads=12
inbam="in.bam"
# sort | uniq version
#samtools view -@ $threads -f4 $inbam \
# | cut -f1 | sort --parallel=$threads -T $(dirname $inbam) | uniq \
# > out.unmapped-names.txt
# awk version
samtools view -@ $threads -f4 $inbam \
| cut -f 1 | awk '!x[$0]++' \
> out.unmapped-names.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment