Skip to content

Instantly share code, notes, and snippets.

@srynobio
Last active September 13, 2019 14:00
Show Gist options
  • Save srynobio/e72cf4d9b70cd04345f8de5dac59d4f0 to your computer and use it in GitHub Desktop.
Save srynobio/e72cf4d9b70cd04345f8de5dac59d4f0 to your computer and use it in GitHub Desktop.

UCGD-Pipeline VAR project setup.

This is a walkthrough of the UCGD Pipeline simplified project setup.

Step one:

Complete project setup.

  1. Create a new project in UCGD_DB

  2. Create a new UCGD Data Processing/Analysis Project

Step two:

Build processing project

$> sudo /bin/su - ucgd-pepipeline
$> ml ucgd_modules
$> UCGDProject BuildProject -p [% Project %]

By default UCGDProject will uses the build specified in the UCGD database.
If however you would like to override the default you can add the following option.

optional arguments:
  --build BUILD, -b BUILD
    Which reference version to build project. Will override database assembly. i.e. GRCh37/GRCh38.

Step Three:

Download customer primary data files.

Link or copy (prefered) primary data and md5sum (or sha256) files to Project_setup directory.

Modify the examples below to fit the location and type of data.

Consider using a PE data transfer node to increase transfer rate from outside PE:

$> ssh pe-dtn.chpc.utah.edu
$> cd /scratch/ucgd/lustre/UCGD_Processing/[% Project %]/UCGD/GRCh38/Project_Setup
Copy data from HCI

Download FDT jar file from http://monalisa.cern.ch/FDT/download.html

Log into HCI-Gnomex-- must be given permission to access that project from the investigator and/or Brian Dalley.

Search Proj # > Files tab > Download > FDT CL download

Paste that command line into redwood as below

$> nohup [paste CL from HCI-Gnomex] 2> fdt.error &

# For example:
$> nohup java -jar fdt.jar -noupdates -pull -r -c hci-bio-app.hci.utah.edu -d ./ /scratch/fdtswap/fdt_sandbox_gnomex/f0e318c3-3fd8-4ea5-b24c-dc949a3d5580/15101R 2> fdt.error &

Move fastq.gz files from HCI download subdirectory to Project_Setup and perform subsequent processing in that directory. The Snakemake workflow and other steps in the README file are expecting the fastq files and the processing steps to occur in Project_Setup and if you perform processing in another directory things will break and you'll have to edit the Snakefile and some of the command lines below.

Copy data from FTP
$> wget -m ftp://User_Name:Password@Data_Link
Copy data from elsewhere on CHPC

Note: If you are copying from one server to another be sure to use scp instead of cp.

$> ls /path/to/data/*.fastq.gz | nohup parallel cp {} . 2> cp.fastq.error &

$> ls /path/to/data/*.md5 | parallel cp {} . 2> cp.md5.error
# Or
$> ls /path/to/data/*.sha256 | parallel cp {} . 2> cp.sha256.error

# If no md5sums (or sha256) provided:
$> cd /path/to/data/
$> ls *.fastq.gz | nohup parallel 'md5sum {} > {}.md5' 2> md5sum.err &
$> cd -
$> cp /path/to/data/md5sum.* ./
Copy data using rclone
# Use rclone config to set up a new endpoint for the transfer.
$> nohup rclone sync janssen_coon_s3:sys_bio/PROJ-00161 PROJ-00161 &> rclone-sync-sys_bio-PROJ-00161.txt &
Copy data from Amazon S3
# Set up a profile in ~/.aws/credentials that has the format
# [profile_name]
# aws_access_key_id = #############
# aws_secret_access_key = ###################

$> aws s3 sync --profile profile_name s3://bucket_name/ .

Step Four:

Create source files id and source file manifest files.

NOTE: THIS MUST BE DONE VERY CAREFULLY AND MAY BE DIFFERENT FOR EACH PROJECT!!! THIS IS A STEP WHERE MISTAKES COULD RESULT IN DATA SWAPS AND DATA MIXING COULD OCCUR. SANITY CHECK THE RESULTING FILE MANIFEST FOR ACCURACY.

If necessary, change fastq file names or bam file SM tags so that Sample_ID column in sample manifest matches either A) fastq file name prior to first underscore or B) SM tag in bam file

## Fastq files
$> ls *gz|perl -F'_' -lane 'use Cwd 'abs_path'; print join "\t", $F[0], abs_path($_)' > source_files_ids.txt
$> data_prep.pl -list source_files_ids.txt > source_file_manifest.txt

## BAM files
$> ml samtools
$> bam_sample_file_names.pl *.bam > source_files_ids.txt
$> data_prep.pl -list source_files_ids.txt > source_file_manifest.txt

NOTE:

All source_files_ids.txt must be of the form (tab delimited):

sample_id   full/path/to/file

The Project_Setup directory must contain both: source_files_ids.txt and source_file_manifest.txt in order to process workflows correctly.

Step Five:

Update UBox manifest and PED (Optional)

mkdir ~/Box\ Sync/UCGD/Projects/NGS/[% Project %]
ucgd_db --report manifest --Project [% Project %] > ~/Box\ Sync/UCGD/Projects/NGS/[% Project %]/[% Project %]-Samples.txt
ucgd_db --report ped --Project [% Project %] > ~/Box\ Sync/UCGD/Projects/NGS/[% Project %]/[% Project %].ped

Update Jira task

Check UCGD_DB.db > People for PI's First_Name, Last_Name, and Email. Should match manifest entries in Projects exactly.

ucgd_db --report project_wiki --Project [% Project %]

Paste the resulting markdown into Jira task description.

Write manifest to UCGD_Processing Reports directory

cd /scratch/ucgd/lustre/UCGD_Processing/[% Project %]/UCGD/GRCh38/Reports
ucgd_db --report manifest --Project [% Project %] > ./[% Project %]-Samples.txt

Change Workflow Status to "Pipeline Processing" in Jira and assign task to Shawn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment