Last active
July 23, 2022 15:08
-
-
Save opplatek/787049422fec3056715f1b102c2f6283 to your computer and use it in GitHub Desktop.
Extract read names from SAM/BAM
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# Quickly extract unique read names from SAM/BAM file | |
# Source: https://www.biostars.org/p/371705/#371748 | |
# | |
# This will extract unique read names for all unampped reads (-f 4) from BAM file | |
# awk is much faster than `sort --parallel=$threads | uniq` because it doesn't have to do the sorting | |
# On BAM file with 149,909,118 input reads: | |
# sort | uniq takes (with 12 threads): | |
#real 1m8.240s | |
#user 2m57.097s | |
#sys 0m18.001s | |
# awk '!x[$0]++' takes (with 1 thread): | |
#real 0m42.857s | |
#user 2m30.028s | |
#sys 0m18.179s | |
# | |
threads=12 | |
inbam="in.bam" | |
# sort | uniq version | |
#samtools view -@ $threads -f4 $inbam \ | |
# | cut -f1 | sort --parallel=$threads -T $(dirname $inbam) | uniq \ | |
# > out.unmapped-names.txt | |
# awk version | |
samtools view -@ $threads -f4 $inbam \ | |
| cut -f 1 | awk '!x[$0]++' \ | |
> out.unmapped-names.txt |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment