reedacartwright/brca1-project.md

## brca1-project.md

      
    Raw
  

              brca1-project.md
            
          
    In this exercise we are going to download human genomic data generated by the New York Genome Center (NYGC). Information about the data can be found here and details about their pipeline can be found here.
Initializing Project

We will start by setting up a git repository on the cluster to hold our project data.
$ mkdir brca1-project
$ cd brca1-project
$ git init

$ mkdir scripts
$ mkdir data
$ mkdir results
$ nano .gitignore
$ git add .gitignore
$ cat .gitignore

*.cram
*.crai
*.bcf

Downloading Data

Next we will download metadata for the project.
$ cd data
$ curl -OLJ http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_698_related_high_coverage.sequence.index
$ git add 1000G_698_related_high_coverage.sequence.index
$ cd ..

We will not be downloading all the data. Instead we will download the portion of the CRAM files that contain data aligned to the BRCA1 locus. This location is located on Chromosome 17 between positions 43044295 and 43125364. To download the data will will use a script to make another script which can download the data for us using a cluster job.
$ nano scripts/pbatch_header.txt
$ cat scripts/pbatch_header.txt

#!/bin/bash
 
#SBATCH -N 1  # number of nodes
#SBATCH -n 1  # number of "tasks" (default: allocates 1 core per task)
#SBATCH -t 0-10:00:00   # time in d-hh:mm:ss
#SBATCH -p serial       # partition 
#SBATCH -q normal       # QOS
#SBATCH -o slurm.%j.out # file to save job's STDOUT (%j = JobId)
#SBATCH -e slurm.%j.err # file to save job's STDERR (%j = JobId)
#SBATCH --mail-type=ALL # Send an e-mail when a job starts, stops, or fails
#SBATCH --mail-user=USERNAME@asu.edu # Mail-to address
#SBATCH --export=NONE   # Purge the job-submitting shell environment

$ nano scripts/make_download.bash
$ cat scripts/make_download.bash

cat scripts/pbatch_header.txt

echo "module purge"
echo "module add samtools/1.12.0"
echo ""

echo "mkdir -p data/cram"
echo "cd data/cram"

cat data/1000G_698_related_high_coverage.sequence.index |
  grep -v '^#' | head -n 30 |
  awk -F'\t' '{print "samtools view --write-index -o " $10 ".bcra1.cram " $1 " chr17:43044295-43125364"}'
Now let's run our script to create our download script and submit it to the cluster.
$ bash scripts/make_download.bash > scripts/download.bash
$ sbatch scripts/download.bash

After the job has finished we can clean up the indexes we don't need anymore.
$ rm data/cram/*.final.cram.crai
$ git add scripts/pbatch_header.txt scripts/make_download.bash scripts/download.bash

Alternative Download

If you are having trouble downloading the genomic data because of network issues, you can try running the following command instead.
curl -sL 'https://www.dropbox.com/s/ifg2s6b0c34lz2w/brca1-project-data.tar.gz?dl=1' | tar xvf -

Merge Data

After acquiring our data we are going to run an interactive session and merge our cram files into a single file. This will make variant calling a bit easier.
$ interactive
$ nano scripts/merge_crams.bash
$ cat scripts/merge_crams.bash

module purge
module add samtools/1.12.0

mkdir -p data/cram_joined 

samtools merge --write-index data/cram_joined/brca1_data.cram data/cram/*.cram
$ bash scripts/merge_crams.bash
$ git add scripts/merge_crams.bash

Calling Variants with BCFTools

We will call variants using two different methods, and the first method will will use is bcftools.
$ nano scripts/bcftools_call.bash
$ cat scripts/bcftools_call.bash

ref=/data/datasets/community/genome/human/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

module purge
module add bcftools/1.12.0

mkdir -p results/vcf

bcftools mpileup -O b -a AD,DP -o results/vcf/bcftools.raw.bcf \
  -d 10000 -f $ref data/cram_joined/brca1_data.cram

bcftools call --ploidy 2 -m -v -o results/vcf/bctools.calls.vcf \
  results/vcf/bcftools.raw.bcf
$ bash scripts/bcftools_call.bash
$ git add scripts/bcftools_call.bash

Calling Variants with GATK

The second method that we will use to call variants is GATK's HaplotypeCaller.
$ nano scripts/gatk_call.bash
$ cat scripts/gatk_call.bash

ref=/data/datasets/community/genome/human/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

module purge
module add gatk/4.2.5.0

gatk HaplotypeCaller \
  -A DepthPerSampleHC \
  -A DepthPerAlleleBySample \
  -A Coverage \
  -A ChromosomeCounts \
  -R $ref \
  -L chr17 \
  -I data/cram_joined/brca1_data.cram \
  -O results/vcf/gatk.calls.vcf
$ bash scripts/gatk_call.bash
$ git add scripts/gatk_call.bash

Normalizing Variants

In order to compare the variant calls we will normalize them. Normalization shifts indels to the left and breaks up complex variant calls into bi-allelic variants.
$ nano scripts/filter_and_norm.bash
$ cat scripts/filter_and_norm.bash

input="$1"
ref=/data/datasets/community/genome/human/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

base=$(basename "$input" .calls.vcf)
dir=$(dirname "$input")

module purge
module add bcftools/1.12.0

bcftools view -o "${dir}/${base}.goodcalls.vcf" -e 'QUAL<50' "$input"

bcftools norm -o "${dir}/${base}.norm.vcf" -m -both -f $ref "${dir}/${base}.goodcalls.vcf"
$ bash scripts/filter_and_norm.bash results/vcf/bcftools.calls.vcf
$ bash scripts/filter_and_norm.bash results/vcf/gatk.calls.vcf
$ git add scripts/filter_and_norm.bash
$ git add results/vcf/*.vcf results/vcf/*.idx