Skip to content

Instantly share code, notes, and snippets.

@JamesKane
Last active November 20, 2024 11:35
Show Gist options
  • Save JamesKane/1f6fa0c37dedf36cea8760d1d3787d95 to your computer and use it in GitHub Desktop.
Save JamesKane/1f6fa0c37dedf36cea8760d1d3787d95 to your computer and use it in GitHub Desktop.
Realign a BAM to the Telomere-to-telomere reference
#!/bin/bash
# USAGE: realign.sh [Source BAM/CRAM]
# ./realign.sh source.GRCh38.bam
#
# The script produces a new BAM aligned on the reference specified in the variable. Once complete it will apply CallableLoci
# for some quick QC. The script assumes that the working directory name matches the Sample e.g.
# /mnt/md0/B6564/source.GRCh38.bam
#
# There's a generation of Big Y 500 which do not have the pairs marked correctly. This results in treating the reads as SE.
# bbmap's repair will fix them.
# https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/repair-guide/
# Future versions of this script should detect the problem and run repair first.
reference=/mnt/genomics/Reference/CHM13/chm13v2.0.fa.gz
contig=chrY
sample=${PWD##*/}
source=${1}
target=${sample}.chm13.cram
# Exit when any command fails
set -e
# Collate the read pairs from the source BAM. The prefix causes the intermediate BAMs to be output into the work directory
# instead of /tmp, which can be overrun with very large WGS sources depending on system configuration. The results are converted
# into a FASTQ stream retaining the RG and BC tags. BWA-MEM2 is used to align the stream onto the reference with the RG lines from
# the original BAM. Fixmate is applied to ensure they are setup for marking duplicates. Finally, the stream is sorted and duplicates
# marked into a CRAM file for space reduction.
#
# WARNING! When using experimental reference sequences you must retain a copy to get the original reads back with this method
samtools collate -Oun128 ${source} rnd | samtools fastq -OT RG,BC - \
| bwa-mem2 mem -pt$(nproc) -CH <(samtools view -H ${1}|grep ^@RG) ${reference} - \
| samtools fixmate -u -m - - \
| samtools sort -@2 -m4g - \
| samtools markdup -@8 --reference ${reference} - ${target}
samtools index ${target}
# Create the callable loci report with GATK3. Used for the creation of coverage histograms.
java -jar /mnt/genomics/Applications/GenomeAnalysisTK.jar -T CallableLoci \
-R $reference \
-I ${target} -summary table.txt -o callable_status.bed \
-L ${contig}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment