Skip to content

Instantly share code, notes, and snippets.

@edawson
Last active February 5, 2022 15:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save edawson/a2417064ca5440f97277075c6bc65704 to your computer and use it in GitHub Desktop.
Save edawson/a2417064ca5440f97277075c6bc65704 to your computer and use it in GitHub Desktop.
#############
## Download the 30X hg19-aligned bam from Google's public sequencing of HG002
## and the respective BAI file.
#############
wget https://storage.googleapis.com/brain-genomics-public/research/sequencing/grch37/bam/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.dedup.grch37.bam
wget https://storage.googleapis.com/brain-genomics-public/research/sequencing/grch37/bam/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.dedup.grch37.bam.bai
#############
## Prepare the references so we can realign reads
#############
## Download the original hg19 / hsd37d5 reference
## and create and FAI index
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
gunzip hs37d5.fa.gz
samtools faidx hs37d5.fa
## Download GRCh38
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
## Make a .fai index using samtools faidx
samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
## Create the BWA indices
bwa index GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
## Download the Gold Standard indels from 1kg to use as your known-sites file.
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
## Also grab the tabix index for the file
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
############
## Run the bam2fq tool to extract reads from the BAM file
## Adjust the --num-threads argument to reflect the number of cores on your system.
## With 8 GPUs and 64 vCPUs this should take ~45 minutes.
############
time pbrun bam2fq \
--ref hs37d5.fa \
--in-bam HG002.hiseqx.pcr-free.30x.dedup.grch37.bam \
--out-prefix HG002.hiseqx.pcr-free.30x.dedup.grch37.bam2fq \
--num-threads 64
##############
## Run the fq2bam tool to align reads to GRCh38
##############
time pbrun fq2bam \
--in-fq HG002.hiseqx.pcr-free.30x.dedup.grch37.bam2fq_1.fastq.gz HG002.hiseqx.pcr-free.30x.dedup.grch37.bam2fq_1.fastq.gz \
--ref Homo_sapiens_assembly38.fasta \
--knownSites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--out-bam HG002.hiseqx.pcr-free.30x.dedup.grch37.bam2fq.hg38.bam \
--out-recal-file HG002.hiseqx.pcr-free.30x.dedup.grch37.bam2fq.hg38.BQSR-REPORT.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment