reedacartwright/ga-pipeline-1.md

## ga-pipeline-1.md

      
    Raw
  

              ga-pipeline-1.md
            
          
    After creating your repository on Github, you need to clone it to Agave. Don't forget to use the correct github url for your repository.
$ cd ~/dc_workshop
$ git clone git@github.com:USERNAME/ga-pipeline-1.git ga-pipeline-1
$ cd ga-pipeline-1
Getting Raw Data

Next we are going to setup our raw data folder and create a meta-data file that will hold information about our files and their remote location
$ mkdir -p data/untrimmed_fastq
$ nano data/untrimmed_fastq/data_urls.txt
$ cat data/untrimmed_fastq/data_urls.txt
SRR2589044_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_1.fastq.gz
SRR2589044_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_2.fastq.gz
SRR2584863_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_1.fastq.gz
SRR2584863_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_2.fastq.gz
SRR2584866_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_1.fastq.gz
SRR2584866_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fastq.gz

Next we will create a script to download our raw data file from the urls stored in the meta-data file. This script will use a the Bash builtin, read, to read data_url.txt line by line and assign the filename and url to variables. It will download any files that don't exist and validate all files at the end.
$ mkdir -p script
$ nano scripts/download_fastq.bash
$ cat scripts/download_fastq.bash
cd data/untrimmed_fastq

# Read the meta-data file line by line and download the file if it doesn't exist.
# The while loop will loop over each line, and populate a `filename` and a `url`
# variable from each line.
while read filename url; do
  if [ ! -f "$filename" ]; then
    echo Downloading "$filename" ...
    curl -o "$filename" "$url"
  fi
done < data_urls.txt

echo Validating Files ...

md5sum -c < CHECKSUMS.MD5

Our data already exists on the server, and therefore we will copy it into our data directory instead of downloading it again. The download script can be used by anyone who wants to reproduce our work and needs to download our data files.
$ cp ~/dc_workshop/data/untrimmed_fastq/*.fastq.gz data/untrimmed_fastq
$ cd data/untrimmed_fastq
$ md5sum *.fastq.gz > CHECKSUMS.MD5
Next we will run our download script to verify our files.
$ cd ../..
$ bash scripts/download_fastq.bash
Now is a good time to update our .gitignore to ignore our data files and make a commit to our repository.
$ nano .gitignore
$ cat .gitignore
*.fastq.gz

$ git add .gitignore script/download_fastq.bash
$ git add data/untrimmed_fastq/CHECKSUMS.MD5 data/untrimmed_fastq/data_urls.txt
$ git commit -m "Add data download script and metadata"
$ git push origin main
Quality Control

As we did earlier this semester, we will use fastqc to do quality control on our data. We will write a script to run fastqc according to our server.
$ nano scripts/run_fastqc.bash
$ cat scripts/run_fastqc.bash
# Load fastqc module
module add fastqc/0.11.7

# Set input and output variables
OUTDIR=results/fastqc_untrimmed_reads
INPUT=data/untrimmed_fastq/*.fastq.gz

# Create output directory if necessary
mkdir -p $OUTDIR

# Run fastqc
fastqc -o $OUTDIR $INPUT

And now run the fastqc script and commit work to the repository.
$ bash scripts/run_fastqc.bash
$ git add scripts/run_fastqc.bash results/fastqc_untrimmed_reads/
$ git commit -m "Add fastqc script and results"
$ git push origin main
Read Filtering

We will use trimmomatic to filter and trim our sequencing reads to remove low quality data and artifacts. We will start by creating a wrapper script that will call trimmomatic on paired-end sequencing reads. Only the forward read file will need to be passed as an argument.
$ cp /packages/7x/trimmomatic/0.33/adapters/NexteraPE-PE.fa scripts
$ nano scripts/trimmomatic.bash
$ cat scripts/trimmomatic.bash
#!/bin/bash

# Wrapper Script for Trimmomatic
#   USAGE: bash trimmomatic.bash PATH/TO/INPUT_1.fastq.gz PATH/TO/OUTPUT/DIR

# Load Module
module purge
module add trimmomatic/0.33

# Store script arguments to variables
infile="$1"
outdir="$2"

# Create output directory if neccessary
mkdir -p "$outdir"

# Trimmomatic's path and arguments
TRIM_BIN="java -jar /packages/7x/trimmomatic/0.33/trimmomatic.jar"
TRIM_ARGS="SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:scripts/NexteraPE-PE.fa:2:40:15"

# Find the basename and input directory of our input file
base=$(basename "${infile}" _1.fastq.gz)
indir=$(dirname "${infile}")

# Run Trimmomatic
$TRIM_BIN PE "${indir}/${base}_1.fastq.gz" "${indir}/${base}_2.fastq.gz" \
  "${outdir}/${base}_1.trim.fastq.gz" "${outdir}/${base}_1un.trim.fastq.gz" \
  "${outdir}/${base}_2.trim.fastq.gz" "${outdir}/${base}_2un.trim.fastq.gz" \
  $TRIM_ARGS

To verify that our script works, lets run it in an interactive session on one of the compute nodes.
$ interactive
$ bash scripts/trimmomatic.bash data/untrimmed_fastq/SRR2584863_1.fastq.gz data/trimmed_fastq
Once the script finishes successfully, let's leave our interactive session and
go back to the head node to update our repository.
$ exit
$ git add scripts/trimmomatic.bash scripts/NexteraPE-PE.fa
$ git commit -m "Add a wrapper script for trimmomatic"
$ git push origin main
Submitting Jobs on Agave

Using interactive sessions doesn't scale well when you have lots of jobs to run.
The advantage of using Agave is that you can use its job scheduler to manage
running your jobs for you. Submission of a job is via the sbatch command, and
we need to modify our script to take advantage of the sbatch command.
$ nano scripts/trimmomatic.bash
$ cat scripts/trimmomatic.bash
#!/bin/bash

# Wrapper Script for Trimmomatic
#   USAGE: bash trimmomatic.bash PATH/TO/INPUT_1.fastq.gz PATH/TO/OUTPUT/DIR

#SBATCH -N 1  # number of nodes
#SBATCH -n 1  # number of "tasks" (default: allocates 1 core per task)
#SBATCH -t 0-00:10:00   # time in d-hh:mm:ss
#SBATCH -p serial       # partition 
#SBATCH -q normal       # QOS
#SBATCH -o slurm.%j.out # file to save job's STDOUT (%j = JobId)
#SBATCH -e slurm.%j.err # file to save job's STDERR (%j = JobId)
#SBATCH --mail-type=ALL # Send an e-mail when a job starts, stops, or fails
#SBATCH --mail-user=YOUR-EMAIL-ADDRESS # Mail-to address
#SBATCH --export=NONE   # Purge the job-submitting shell environment

# Load Module
module purge
module add trimmomatic/0.33

# Store script arguments to variables
infile="$1"
outdir="$2"

# Create output directory if neccessary
mkdir -p "$outdir"

# Trimmomatic's path and arguments
TRIM_BIN="java -jar /packages/7x/trimmomatic/0.33/trimmomatic.jar"
TRIM_ARGS="SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:scripts/NexteraPE-PE.fa:2:40:15"

# Find the basename and input directory of our input file
base=$(basename "${infile}" _1.fastq.gz)
indir=$(dirname "${infile}")

# Run Trimmomatic
$TRIM_BIN PE "${indir}/${base}_1.fastq.gz" "${indir}/${base}_2.fastq.gz" \
  "${outdir}/${base}_1.trim.fastq.gz" "${outdir}/${base}_1un.trim.fastq.gz" \
  "${outdir}/${base}_2.trim.fastq.gz" "${outdir}/${base}_2un.trim.fastq.gz" \
  $TRIM_ARGS

Next we will submit a job to the cluster using sbatch and we can check the
status of our jobs using the squeue command.
$ sbatch scripts/trimmomatic.bash data/untrimmed_fastq/SRR2584866_1.fastq.gz data/trimmed_fastq
$ squeue -u USERNAME
Let's update our gitignore settings to ignore our slurm output files.
$ nano .gitignore
$ cat .gitignore
*.fastq.gz
slurm.*.out
slurm.*.err

When there are multiple jobs that need to be submitting, it's best to write a
script that will run multiple sbatch commands for you.
$ nano scripts/submit_trim_jobs.bash
$ cat scripts/submit_trim_jobs.bash
# Trimmomatic job submission script
# Setup variables
metadata=data/untrimmed_fastq/data_urls.txt
indir=data/untrimmed_fastq
infiles=$(cut -d' ' -f 1 $metadata | grep _1.fastq.gz)
outdir=data/trimmed_fastq

# Submit one job per input file
for filename in $infiles; do
  sbatch scripts/trimmomatic.bash $indir/$filename $outdir
done

Let's run the script to submit our jobs and update our repository.
$ bash scripts/submit_trim_jobs.bash
$ git add scripts/trimmomatic.bash  scripts/submit_trim_jobs.bash .gitignore
$ git commit -m "Add a support for sbatch in our trimmomatic script"
$ git push origin main
Reference Alignment

After we have finished trimming our sequencing reads, we need to align our data
to a reference genome. We will be using bwa mem for that. First we will download
and setup our reference genome, using a script to record our work.
$ nano scripts/download_ref_genome.bash
$ cat scripts/download_ref_genome.bash
# Setup variables
ref_dir=data/ref_genome
ref_file=$ref_dir/ecoli_rel606.fasta   
url="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz"

# Download and decompress file
mkdir -p $ref_dir
curl -L $url | gunzip > $ref_file

# Load Modules
module purge
module add bwa/0.7.17

# Index the reference
bwa index $ref_file

Next we will create a a wrapper script to run bwa mem on our data.
$ bash scripts/bwamem.bash
$ cat scripts/bwamem.bash
#!/bin/bash

# Wrapper Script for bwa mem
#   USAGE: bash bwamem.bash PATH/TO/INPUT_1.trim.fastq.gz PATH/TO/OUTPUT/DIR

#SBATCH -N 1  # number of nodes
#SBATCH -n 1  # number of "tasks" (default: allocates 1 core per task)
#SBATCH -t 0-00:20:00   # time in d-hh:mm:ss
#SBATCH -p serial       # partition 
#SBATCH -q normal       # QOS
#SBATCH -o slurm.%j.out # file to save job's STDOUT (%j = JobId)
#SBATCH -e slurm.%j.err # file to save job's STDERR (%j = JobId)
#SBATCH --mail-type=ALL # Send an e-mail when a job starts, stops, or fails
#SBATCH --mail-user=YOUR-EMAIL-ADDRESS # Mail-to address
#SBATCH --export=NONE   # Purge the job-submitting shell environment

# Load Modules
module purge
module add bwa/0.7.17
module add samtools/1.12.0

# Store script arguments as variables
infile="$1"
outdir="$2"
reference="data/ref_genome/ecoli_rel606.fasta"

# Create output directory if neccessary
mkdir -p "$outdir"

# Find the basename and input directory of our input file
base=$(basename "${infile}" _1.trim.fastq.gz)
indir=$(dirname "${infile}")

# Call bwa mem and create a bam file
bwa mem "$reference" -R "@RG\tID:${base}\tSM:${base}" \
  "${indir}/${base}_1.trim.fastq.gz" \
  "${indir}/${base}_2.trim.fastq.gz" |
  samtools view -bo "${outdir}/${base}.aligned.bam"

To verify that our script works, we are going to submit a single job with the
script and look for errors after it completes.
$ sbatch scripts/bwamem.bash data/trimmed_fastq/SRR2584863_1.trim.fastq.gz data/bam
While that is running we will update our gitignore and the repository.
$ nano .gitignore
$ cat .gitignore
*.fastq.gz
slurm.*.out
slurm.*.err
data/ref_genome/*
data/bam/*

$ git add scripts/download_ref_genome.bash scripts/bwamem.bash .gitignore
$ git commit -m "Add a wrapper script for aligning sequences"
$ git push origin main