Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save naumenko-sa/82df1cb7d9b5f64691bf437f0eb455f0 to your computer and use it in GitHub Desktop.
Save naumenko-sa/82df1cb7d9b5f64691bf437f0eb455f0 to your computer and use it in GitHub Desktop.
Module 3 - Annotation. Tutorial
@naumenko-sa
Copy link
Author

naumenko-sa commented Jun 7, 2018

Module3. Annotation

  1. Checklist
  1. Download Ashkenazim/Ashkenazim.no_c4r.xls from the google drive and open it in excel
  • explore annotation columns
  • look at variant pathogenic in clinvar
  • filter only de-novo variants (set Zygosity to Het,-,-) - how many variants would you expect here?
  • try recessive inheritance mode
  • try dominant inheritance mode
  1. Set up the environment
    Check where you are in the file system
cd
pwd

Copy wrapper scripts from the github

git clone https://github.com/naumenko-sa/cre

Set up environment varianbles
. /home/stud40/module3_annotation/env_setup.sh

Check:

which vep
vep
which gemini
gemini
samtools
bcftools
bedtools
gatk-launch --list
freebayes
platypus
which cre.sh
  1. create a project (family)
cd
pwd
mkdir -p nist/input
cp /home/stud40/nist/input/* nist/input
ls nist/input

You should see:

#nist_NA12878.bam - is our input bam file (the alignment of the sequencing reads)
#region.bed - a small region we investigate (to save time, in reality the analysis is exome or genome wide)
cat nist/input/region.bed
22	20500000	21500000
  1. Prepare pipeline run
cd 
pwd
cre.prepare_bcbio_run.sh nist validation /home/stud40/nist/input/region.bed

(Use bed file in your home folder)

  1. Look at the project structure
tree nist

Result

nist
├── config
│   ├── nist.csv
│   ├── nist-template.yaml
│   └── nist.yaml
├── input
│   ├── nist_NA12878.bam
│   └── region.bed
├── samples.txt
└── work
  1. Explore configuration file
cat nist/config/nist.yaml
details:
- algorithm:
    aligner: false
    effects: vep
    effects_transcripts: all
    ensemble:
      numpass: 2
    mark_duplicates: true
    realign: false
    recalibrate: false
    save_diskspace: true
    tools_on:
    - qualimap
    - vep_splicesite_annotations
    - noalt_calling
    validate: giab-NA12878/truth_small_variants.vcf.gz
    validate_regions: giab-NA12878/truth_regions.bed
    variant_regions: /home/stud40/nist/input/region.bed
    variantcaller:
    - gatk-haplotype
    - samtools
    - freebayes
    - platypus
  analysis: variant2
  description: nist_NA12878
  files:
  - /home/stud40/nist/input/nist_NA12878.bam
  genome_build: GRCh37
  metadata:
    batch: nist
fc_name: gatk4
resources:
  default:
    cores: 1
    jvm_opts:
    - -Xms75m
    - -Xmx700m
    memory: 1G
upload:
  dir: ../final
  1. Run the pipeline (first the instructor is running, after that everyone is running with 1 thread) ~ 10 min
export threads=1;bash ~/cre/bcbio.pbs nist
  1. Explore results
tree nist/final 
  1. Check out validation: nist_NA12878-validate.png

  2. Check out multiqc report

  3. Check out commands (bcbio-nextgen-commands.log), tools versions (programs.txt), data versions (data_versions.csv).

  4. Create excel report

cd
pwd
export family=nist; export cleanup=1;cre.sh

if you are running for the second time (no cleanup)

cd
pwd
export family=nist; export cleanup=0;cre.sh
  1. Explore excel report: nist.csv

  2. Check if variants are real in IGV. Which of 2 is of better quality?

  3. Explore annotated vcf: Ashkenazim.vcf.gz

cd 
pwd
cp -r /home/stud40/module3_annotation/data/Ashkenazim .
cd Ashkenazim
ls 

# look at  Ashkenazim.vcf.gz
# How big is the file? ls -lh Ashkenazim.vcf.gz
# Header: 
gunzip -c Ashkenazim.vcf.gz | grep ‘^#’ > header.txt
bcftools view -h Ashkenazim.vcf.gz > header.txt
cat header.txt

# What we can learn from the header?
# How many variants?
gunzip -c Ashkenazim.vcf.gz | grep -vc ‘^#’
bcftools stats Ashkenazim.vcf.gz | less
# View variants: 
gunzip -c Ashkenazim.vcf.gz | grep -v "^#" | head | less

  1. Become bioinf pro with 4 tools
#figure out how many reads are aligned
cd
pwd
cd Ashkenzim
samtools
samtools index nist_NA12878.bam 
samtools idxstats nist_NA12878.bam

#extract all variants in a region chr1:1mln-2mln
bcftools
bcftools view Ashkenazim.vcf.gz 1:1000000-2000000 | wc -l

#covert vcf to a table
gatk-launch --list
gatk-launch VariantsToTable
gatk-launch VariantsToTable -V Ashkenazim.vcf.gz -O report.table.txt -F REF -F ALT -F DP -F max_aaf_all
head report.table.txt

# extract variants in muscular genes
bedtools
bedtools intersect -a Ashkenazim.vcf.gz -b Ashkenazim.muscular_genes.bed -header > Ashkenazim.variants_in_muscular_genes.vcf
#how many variants?
cat Ashkenazim.variants_in_muscular_genes.vcf | grep -vc "^#"

  1. Annotate one variant with VEP
  1. Annotation of many variants with VEP
cd
pwd
mkdir vep
cd vep
cp /home/stud40/module3_annotation/data/vep/naked.vcf .
tail naked.vcf
# compress and index vcf file
bgzip naked.vcf
tabix naked.vcf

# extract variants on chr1
bcftools view naked.vcf.gz 1 > tiny.vcf
# compress and index vcf file
bgzip tiny.vcf ; tabix tiny.vcf.gz 

#annotate variants
cre.vep.sh tiny.vcf.gz
which cre.vep.sh
ls
#look at the annotated variant
gunzip -c tiny.vepeffects.vcf.gz | tail -n1

#create gemini database
cre.gemini_load.sh tiny.vepeffects.vcf.gz 1
ls
which cre.gemini_load.sh
  1. Explore gemini database in DB Browser.
  • copy tiny.vepeffects.db to your laptop
  • Launch DB Browser
  • open tiny.vepeffects.db.
  1. Querying gemini database
    #which genes have variants?
    select distinct gene from variants

#select all variants in PKN2 gene
select * from variants where gene='PKN2'

#select samples
select * from samples

#select only rare coding variants (MAF<1%)
select * from variants where max_aaf_all<0.01

#Select impacts of rare coding variants.
select chrom,start+1,ref,alt,impact from variants where max_aaf_all<0.01

#run the same query in unix command line
gemini query -q "select chrom,start+1,ref,alt,impact from variants where max_aaf_all<0.01" --header tiny.vepeffects.db

#Generate excel reports:
cre.gemini2txt.sh tiny.vepeffects.db 10
cat tiny.vepeffects.db.txt

cre.gemini_variant_impacts.sh tiny.vepeffects.db 10 ALL
cat tiny.vepeffects.db.impacts.txt

#Example report: Ashkenazim.no_c4r.xls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment