Download Ashkenazim/Ashkenazim.no_c4r.xls from the google drive and open it in excel
explore annotation columns
look at variant pathogenic in clinvar
filter only de-novo variants (set Zygosity to Het,-,-) - how many variants would you expect here?
try recessive inheritance mode
try dominant inheritance mode
Set up the environment
Check where you are in the file system
cd
pwd
Copy wrapper scripts from the github
git clone https://github.com/naumenko-sa/cre
Set up environment varianbles . /home/stud40/module3_annotation/env_setup.sh
Check:
which vep
vep
which gemini
gemini
samtools
bcftools
bedtools
gatk-launch --list
freebayes
platypus
which cre.sh
create a project (family)
cd
pwd
mkdir -p nist/input
cp /home/stud40/nist/input/* nist/input
ls nist/input
You should see:
#nist_NA12878.bam - is our input bam file (the alignment of the sequencing reads)
#region.bed - a small region we investigate (to save time, in reality the analysis is exome or genome wide)
cat nist/input/region.bed
22 20500000 21500000
Prepare pipeline run
cd
pwd
cre.prepare_bcbio_run.sh nist validation /home/stud40/nist/input/region.bed
Run the pipeline (first the instructor is running, after that everyone is running with 1 thread) ~ 10 min
export threads=1;bash ~/cre/bcbio.pbs nist
Explore results
tree nist/final
Check out validation: nist_NA12878-validate.png
Check out multiqc report
Check out commands (bcbio-nextgen-commands.log), tools versions (programs.txt), data versions (data_versions.csv).
Create excel report
cd
pwd
export family=nist; export cleanup=1;cre.sh
if you are running for the second time (no cleanup)
cd
pwd
export family=nist; export cleanup=0;cre.sh
Explore excel report: nist.csv
Check if variants are real in IGV. Which of 2 is of better quality?
Explore annotated vcf: Ashkenazim.vcf.gz
cd
pwd
cp -r /home/stud40/module3_annotation/data/Ashkenazim .
cd Ashkenazim
ls
# look at Ashkenazim.vcf.gz
# How big is the file? ls -lh Ashkenazim.vcf.gz
# Header:
gunzip -c Ashkenazim.vcf.gz | grep ‘^#’ > header.txt
bcftools view -h Ashkenazim.vcf.gz > header.txt
cat header.txt
# What we can learn from the header?
# How many variants?
gunzip -c Ashkenazim.vcf.gz | grep -vc ‘^#’
bcftools stats Ashkenazim.vcf.gz | less
# View variants:
gunzip -c Ashkenazim.vcf.gz | grep -v "^#" | head | less
Become bioinf pro with 4 tools
#figure out how many reads are aligned
cd
pwd
cd Ashkenzim
samtools
samtools index nist_NA12878.bam
samtools idxstats nist_NA12878.bam
#extract all variants in a region chr1:1mln-2mln
bcftools
bcftools view Ashkenazim.vcf.gz 1:1000000-2000000 | wc -l
#covert vcf to a table
gatk-launch --list
gatk-launch VariantsToTable
gatk-launch VariantsToTable -V Ashkenazim.vcf.gz -O report.table.txt -F REF -F ALT -F DP -F max_aaf_all
head report.table.txt
# extract variants in muscular genes
bedtools
bedtools intersect -a Ashkenazim.vcf.gz -b Ashkenazim.muscular_genes.bed -header > Ashkenazim.variants_in_muscular_genes.vcf
#how many variants?
cat Ashkenazim.variants_in_muscular_genes.vcf | grep -vc "^#"
cd
pwd
mkdir vep
cd vep
cp /home/stud40/module3_annotation/data/vep/naked.vcf .
tail naked.vcf
# compress and index vcf file
bgzip naked.vcf
tabix naked.vcf
# extract variants on chr1
bcftools view naked.vcf.gz 1 > tiny.vcf
# compress and index vcf file
bgzip tiny.vcf ; tabix tiny.vcf.gz
#annotate variants
cre.vep.sh tiny.vcf.gz
which cre.vep.sh
ls
#look at the annotated variant
gunzip -c tiny.vepeffects.vcf.gz | tail -n1
#create gemini database
cre.gemini_load.sh tiny.vepeffects.vcf.gz 1
ls
which cre.gemini_load.sh
Explore gemini database in DB Browser.
copy tiny.vepeffects.db to your laptop
Launch DB Browser
open tiny.vepeffects.db.
Querying gemini database
#which genes have variants?
select distinct gene from variants
#select all variants in PKN2 gene
select * from variants where gene='PKN2'
#select samples
select * from samples
#select only rare coding variants (MAF<1%)
select * from variants where max_aaf_all<0.01
#Select impacts of rare coding variants.
select chrom,start+1,ref,alt,impact from variants where max_aaf_all<0.01
#run the same query in unix command line
gemini query -q "select chrom,start+1,ref,alt,impact from variants where max_aaf_all<0.01" --header tiny.vepeffects.db
This comment has been minimized.
Module3. Annotation
Check where you are in the file system
Copy wrapper scripts from the github
Set up environment varianbles
. /home/stud40/module3_annotation/env_setup.sh
Check:
You should see:
(Use bed file in your home folder)
Result
Check out validation: nist_NA12878-validate.png
Check out multiqc report
Check out commands (bcbio-nextgen-commands.log), tools versions (programs.txt), data versions (data_versions.csv).
Create excel report
if you are running for the second time (no cleanup)
Explore excel report: nist.csv
Check if variants are real in IGV. Which of 2 is of better quality?
Explore annotated vcf: Ashkenazim.vcf.gz
#which genes have variants?
select distinct gene from variants
#select all variants in PKN2 gene
select * from variants where gene='PKN2'
#select samples
select * from samples
#select only rare coding variants (MAF<1%)
select * from variants where max_aaf_all<0.01
#Select impacts of rare coding variants.
select chrom,start+1,ref,alt,impact from variants where max_aaf_all<0.01
#run the same query in unix command line
gemini query -q "select chrom,start+1,ref,alt,impact from variants where max_aaf_all<0.01" --header tiny.vepeffects.db
#Generate excel reports:
cre.gemini2txt.sh tiny.vepeffects.db 10
cat tiny.vepeffects.db.txt
cre.gemini_variant_impacts.sh tiny.vepeffects.db 10 ALL
cat tiny.vepeffects.db.impacts.txt
#Example report: Ashkenazim.no_c4r.xls