srynobio/projectsetup.md

## projectsetup.md

      
    Raw
  

              projectsetup.md
            
          
    UCGD-Pipeline VAR project setup.

This is a walkthrough of the UCGD Pipeline simplified project setup.
Step one:

Complete project setup.


Create a new project in UCGD_DB


Create a new UCGD Data Processing/Analysis Project


Step two:

Build processing project

$> sudo /bin/su - ucgd-pepipeline
$> ml ucgd_modules
$> UCGDProject BuildProject -p [% Project %]

By default UCGDProject will uses the build specified in the UCGD database.
If however you would like to override the default you can add the following option.

optional arguments:
  --build BUILD, -b BUILD
    Which reference version to build project. Will override database assembly. i.e. GRCh37/GRCh38.


Step Three:

Download customer primary data files.

Link or copy (prefered) primary data and md5sum (or sha256) files to Project_setup    directory.

Modify the examples below to fit the location and type of data.
Consider using a PE data transfer node to increase transfer rate from outside PE:
$> ssh pe-dtn.chpc.utah.edu
$> cd /scratch/ucgd/lustre/UCGD_Processing/[% Project %]/UCGD/GRCh38/Project_Setup

Copy data from HCI

Download FDT jar file from http://monalisa.cern.ch/FDT/download.html
Log into HCI-Gnomex-- must be given permission to access that project
from the investigator and/or Brian Dalley.
Search Proj # > Files tab > Download > FDT CL download
Paste that command line into redwood as below
$> nohup [paste CL from HCI-Gnomex] 2> fdt.error &

# For example:
$> nohup java -jar fdt.jar -noupdates -pull -r -c hci-bio-app.hci.utah.edu -d ./ /scratch/fdtswap/fdt_sandbox_gnomex/f0e318c3-3fd8-4ea5-b24c-dc949a3d5580/15101R 2> fdt.error &

Move fastq.gz files from HCI download subdirectory to Project_Setup
and perform subsequent processing in that directory.  The Snakemake
workflow and other steps in the README file are expecting the fastq
files and the processing steps to occur in Project_Setup and if you perform
processing in another directory things will break and you'll have to edit
the Snakefile and some of the command lines below.
Copy data from FTP

$> wget -m ftp://User_Name:Password@Data_Link

Copy data from elsewhere on CHPC

Note: If you are copying from one server to another be sure to use scp instead of cp.
$> ls /path/to/data/*.fastq.gz | nohup parallel cp {} . 2> cp.fastq.error &

$> ls /path/to/data/*.md5 | parallel cp {} . 2> cp.md5.error
# Or
$> ls /path/to/data/*.sha256 | parallel cp {} . 2> cp.sha256.error

# If no md5sums (or sha256) provided:
$> cd /path/to/data/
$> ls *.fastq.gz | nohup parallel 'md5sum {} > {}.md5' 2> md5sum.err &
$> cd -
$> cp /path/to/data/md5sum.* ./

Copy data using rclone

# Use rclone config to set up a new endpoint for the transfer.
$> nohup rclone sync janssen_coon_s3:sys_bio/PROJ-00161 PROJ-00161 &> rclone-sync-sys_bio-PROJ-00161.txt &

Copy data from Amazon S3

# Set up a profile in ~/.aws/credentials that has the format
# [profile_name]
# aws_access_key_id = #############
# aws_secret_access_key = ###################

$> aws s3 sync --profile profile_name s3://bucket_name/ .

Step Four:

Create source files id and source file manifest files.

NOTE: THIS MUST BE DONE VERY CAREFULLY AND MAY BE DIFFERENT FOR
EACH PROJECT!!!  THIS IS A STEP WHERE MISTAKES COULD RESULT IN DATA
SWAPS AND DATA MIXING COULD OCCUR.  SANITY CHECK THE RESULTING FILE
MANIFEST FOR ACCURACY.
If necessary, change fastq file names or bam file SM tags so that
Sample_ID column in sample manifest matches either A) fastq file name
prior to first underscore or B) SM tag in bam file
## Fastq files
$> ls *gz|perl -F'_' -lane 'use Cwd 'abs_path'; print join "\t", $F[0], abs_path($_)' > source_files_ids.txt
$> data_prep.pl -list source_files_ids.txt > source_file_manifest.txt

## BAM files
$> ml samtools
$> bam_sample_file_names.pl *.bam > source_files_ids.txt
$> data_prep.pl -list source_files_ids.txt > source_file_manifest.txt

NOTE:
All source_files_ids.txt must be of the form (tab delimited):
sample_id   full/path/to/file

The Project_Setup directory must contain both: source_files_ids.txt and source_file_manifest.txt in order to process workflows correctly.
Step Five:

Update UBox manifest and PED (Optional)

mkdir ~/Box\ Sync/UCGD/Projects/NGS/[% Project %]
ucgd_db --report manifest --Project [% Project %] > ~/Box\ Sync/UCGD/Projects/NGS/[% Project %]/[% Project %]-Samples.txt
ucgd_db --report ped --Project [% Project %] > ~/Box\ Sync/UCGD/Projects/NGS/[% Project %]/[% Project %].ped

Update Jira task

Check UCGD_DB.db > People for PI's First_Name, Last_Name, and Email.
Should match manifest entries in Projects exactly.
ucgd_db --report project_wiki --Project [% Project %]

Paste the resulting markdown into Jira task description.
Write manifest to UCGD_Processing Reports directory

cd /scratch/ucgd/lustre/UCGD_Processing/[% Project %]/UCGD/GRCh38/Reports
ucgd_db --report manifest --Project [% Project %] > ./[% Project %]-Samples.txt

Change Workflow Status to "Pipeline Processing" in Jira and assign task to Shawn