naumenko-sa/2018-06-13.GenMedWorkshop.Montreal.Module3.Annotation

Created June 7, 2018 17:56

Star 0 You must be signed in to star a gist
Fork 0 You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/naumenko-sa/82df1cb7d9b5f64691bf437f0eb455f0.js"></script>
Save naumenko-sa/82df1cb7d9b5f64691bf437f0eb455f0 to your computer and use it in GitHub Desktop.

Download ZIP

Module 3 - Annotation. Tutorial

Raw

2018-06-13.GenMedWorkshop.Montreal.Module3.Annotation

Module 3 - annotation

Author

naumenko-sa commented Jun 7, 2018 •

edited

Module3. Annotation

Checklist

open lecture pdf
open gist (browser)
ssh studXX@workshop2018a.vhost37.genap.ca (terminal)
attach your home directory on the server to the laptop (sshfs in linux, winscp in windows)
open google drive folder (browser)
open IGV browser (laptop)
open SQLite DB Browser (laptop)

Download Ashkenazim/Ashkenazim.no_c4r.xls from the google drive and open it in excel

explore annotation columns
look at variant pathogenic in clinvar
filter only de-novo variants (set Zygosity to Het,-,-) - how many variants would you expect here?
try recessive inheritance mode
try dominant inheritance mode

Set up the environment
Check where you are in the file system

cd
pwd

Copy wrapper scripts from the github

git clone https://github.com/naumenko-sa/cre

Set up environment varianbles
. /home/stud40/module3_annotation/env_setup.sh

Check:

which vep
vep
which gemini
gemini
samtools
bcftools
bedtools
gatk-launch --list
freebayes
platypus
which cre.sh

create a project (family)

cd
pwd
mkdir -p nist/input
cp /home/stud40/nist/input/* nist/input
ls nist/input

You should see:

#nist_NA12878.bam - is our input bam file (the alignment of the sequencing reads)
#region.bed - a small region we investigate (to save time, in reality the analysis is exome or genome wide)
cat nist/input/region.bed
22	20500000	21500000

Prepare pipeline run

cd 
pwd
cre.prepare_bcbio_run.sh nist validation /home/stud40/nist/input/region.bed

(Use bed file in your home folder)

Look at the project structure

tree nist

Result

nist
├── config
│   ├── nist.csv
│   ├── nist-template.yaml
│   └── nist.yaml
├── input
│   ├── nist_NA12878.bam
│   └── region.bed
├── samples.txt
└── work

Explore configuration file

cat nist/config/nist.yaml

details:
- algorithm:
    aligner: false
    effects: vep
    effects_transcripts: all
    ensemble:
      numpass: 2
    mark_duplicates: true
    realign: false
    recalibrate: false
    save_diskspace: true
    tools_on:
    - qualimap
    - vep_splicesite_annotations
    - noalt_calling
    validate: giab-NA12878/truth_small_variants.vcf.gz
    validate_regions: giab-NA12878/truth_regions.bed
    variant_regions: /home/stud40/nist/input/region.bed
    variantcaller:
    - gatk-haplotype
    - samtools
    - freebayes
    - platypus
  analysis: variant2
  description: nist_NA12878
  files:
  - /home/stud40/nist/input/nist_NA12878.bam
  genome_build: GRCh37
  metadata:
    batch: nist
fc_name: gatk4
resources:
  default:
    cores: 1
    jvm_opts:
    - -Xms75m
    - -Xmx700m
    memory: 1G
upload:
  dir: ../final

Run the pipeline (first the instructor is running, after that everyone is running with 1 thread) ~ 10 min

export threads=1;bash ~/cre/bcbio.pbs nist

Explore results

tree nist/final

Check out validation: nist_NA12878-validate.png
Check out multiqc report
Check out commands (bcbio-nextgen-commands.log), tools versions (programs.txt), data versions (data_versions.csv).
Create excel report

cd
pwd
export family=nist; export cleanup=1;cre.sh

if you are running for the second time (no cleanup)

cd
pwd
export family=nist; export cleanup=0;cre.sh

Explore excel report: nist.csv
Check if variants are real in IGV. Which of 2 is of better quality?
Explore annotated vcf: Ashkenazim.vcf.gz

cd 
pwd
cp -r /home/stud40/module3_annotation/data/Ashkenazim .
cd Ashkenazim
ls 

# look at  Ashkenazim.vcf.gz
# How big is the file? ls -lh Ashkenazim.vcf.gz
# Header: 
gunzip -c Ashkenazim.vcf.gz | grep ‘^#’ > header.txt
bcftools view -h Ashkenazim.vcf.gz > header.txt
cat header.txt

# What we can learn from the header?
# How many variants?
gunzip -c Ashkenazim.vcf.gz | grep -vc ‘^#’
bcftools stats Ashkenazim.vcf.gz | less
# View variants: 
gunzip -c Ashkenazim.vcf.gz | grep -v "^#" | head | less

Become bioinf pro with 4 tools

#figure out how many reads are aligned
cd
pwd
cd Ashkenzim
samtools
samtools index nist_NA12878.bam 
samtools idxstats nist_NA12878.bam

#extract all variants in a region chr1:1mln-2mln
bcftools
bcftools view Ashkenazim.vcf.gz 1:1000000-2000000 | wc -l

#covert vcf to a table
gatk-launch --list
gatk-launch VariantsToTable
gatk-launch VariantsToTable -V Ashkenazim.vcf.gz -O report.table.txt -F REF -F ALT -F DP -F max_aaf_all
head report.table.txt

# extract variants in muscular genes
bedtools
bedtools intersect -a Ashkenazim.vcf.gz -b Ashkenazim.muscular_genes.bed -header > Ashkenazim.variants_in_muscular_genes.vcf
#how many variants?
cat Ashkenazim.variants_in_muscular_genes.vcf | grep -vc "^#"

Annotate one variant with VEP

VEP web interface
Assembly GRCh37
Chr6 26093141 G A

Annotation of many variants with VEP

cd
pwd
mkdir vep
cd vep
cp /home/stud40/module3_annotation/data/vep/naked.vcf .
tail naked.vcf
# compress and index vcf file
bgzip naked.vcf
tabix naked.vcf

# extract variants on chr1
bcftools view naked.vcf.gz 1 > tiny.vcf
# compress and index vcf file
bgzip tiny.vcf ; tabix tiny.vcf.gz 

#annotate variants
cre.vep.sh tiny.vcf.gz
which cre.vep.sh
ls
#look at the annotated variant
gunzip -c tiny.vepeffects.vcf.gz | tail -n1

#create gemini database
cre.gemini_load.sh tiny.vepeffects.vcf.gz 1
ls
which cre.gemini_load.sh

Explore gemini database in DB Browser.

copy tiny.vepeffects.db to your laptop
Launch DB Browser
open tiny.vepeffects.db.

Querying gemini database
#which genes have variants?
select distinct gene from variants

#select all variants in PKN2 gene
select * from variants where gene='PKN2'

#select samples
select * from samples

#select only rare coding variants (MAF<1%)
select * from variants where max_aaf_all<0.01

#Select impacts of rare coding variants.
select chrom,start+1,ref,alt,impact from variants where max_aaf_all<0.01

#run the same query in unix command line
gemini query -q "select chrom,start+1,ref,alt,impact from variants where max_aaf_all<0.01" --header tiny.vepeffects.db

#Generate excel reports:
cre.gemini2txt.sh tiny.vepeffects.db 10
cat tiny.vepeffects.db.txt

cre.gemini_variant_impacts.sh tiny.vepeffects.db 10 ALL
cat tiny.vepeffects.db.impacts.txt

#Example report: Ashkenazim.no_c4r.xls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment