Skip to content

Instantly share code, notes, and snippets.

@standage
Last active November 26, 2019 17:55
Show Gist options
  • Save standage/407b9c89c20491be43b1 to your computer and use it in GitHub Desktop.
Save standage/407b9c89c20491be43b1 to your computer and use it in GitHub Desktop.
Procedure for removing contaminants from paired-end sequence data. The bwa-mem algorithm is used to map reads against a database of contaminants, a small Perl one-liner is used to filter out reads that map to the contaminants, the SAM data is converted to BAM format, which is then fed (via process substitution) to Tophat's bam2fastx to convert b…
# -q: output in Fastq format
# -Q: ignore BAM quality flags
# -P: paired-end data
bam2fastx -qQP -o clean.fq <(bwa mem contaminants.fasta reads.1.fq reads.2.fq | \
perl -ne '@v = split(/\t/); print if(m/^@/ or ($v[1] & 4 and $v[1] & 8))' | \
samtools view -bhS -)
@lorenapr92
Copy link

Hi. I stumbled to your script and I had a few questions: the perl line what kind of output file does it generate?. Also, the samtools view line does it just create a .bam file that will then be use by Tophat's bam2fastx?

@standage
Copy link
Author

Please note that I haven't used this in over 5 years, and I don't remember how well I validate it. But the perl one-liner is a SAM filter (SAM in, SAM out). The samtools command creates BAM output from SAM input. This is fed to the bam2fastx command via process substitution. Hope this helps.

@lorenapr92
Copy link

Great, one more question. I am assuming you use the SAM out file to create the BAM file not the original SAM file from bwa mem output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment