Skip to content

Instantly share code, notes, and snippets.

@sirselim
Last active November 1, 2023 17:08
Show Gist options
  • Save sirselim/dcaad07523c90b46c1c0685efbc5d04e to your computer and use it in GitHub Desktop.
Save sirselim/dcaad07523c90b46c1c0685efbc5d04e to your computer and use it in GitHub Desktop.
small bash script to download dbNSFP 'database' and wrangle into format for pipeline annotation process
#!/bin/bash
## small bash script to download and reformat dbNSFP for pipeline
## Miles Benton
## created: 2018-01-13
## modified: 2019-08-21
# Set to dbNSFP version to download and build
version="4.0a"
#TODO: add an option to 'scrape' this from the url to always return latest version
# define thread number for parallel processing where able
THREADS=6
# Download dbNSFP database
wget -O dbNSFP${version}.zip "ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFP${version}.zip"
# Uncompress
unzip dbNSFP${version}.zip
# grab header
zcat dbNSFP${version}_variant.chr1.gz | head -n 1 | bgzip > header.gz
### this section will produce data for hg38 capable pipelines
## hg38 version
# Create a single file version
# NOTE: bgzip parameter -@ X represents number of threads
cat dbNSFP${version}_variant.chr{1..22}.gz dbNSFP${version}_variant.chrX.gz dbNSFP${version}_variant.chrY.gz dbNSFP${version}_variant.chrM.gz | zgrep -v '#chr' | bgzip -@ ${THREADS} > dbNSFPv${version}_custom.gz
#TODO: there must be a 'nicer' way of ordering the input into the cat (to include the X,Y and M chrs without explicitly stating them)
# add header back into file
cat header.gz dbNSFPv${version}_custom.gz > dbNSFPv${version}_custombuild.gz
# Create tabix index
tabix -s 1 -b 2 -e 2 dbNSFPv${version}_custombuild.gz
# test annotation
# java -jar ~/install/snpEff/SnpSift.jar dbnsfp -v -db /mnt/dbNSFP/hg19/dbNSFPv${version}_custombuild.gz test/chr1_test.vcf > test/chr1_test_anno.vcf
#TODO: provide actual unit test files for testing purposes, i.e. a section of public data with known annotation rates.
#TODO: the above is currently a placeholder but it had it's intended purpose in terms of identifying incorrect genome build.
# clean up
#TODO: add clean up step to rm all intermediate files after testing confirmed working (i.e. correct annotation 'rates')
#/END hg38
###
### this section will produce data for hg19 capable pipelines
## hg19 version
# for hg19 (coordinate data is located in columns 8 [chr] and 9 [position])
# this takes output from above, filters out any variants with no hg19 coords and then sorts on hg19 chr and position, and then bgzips output
# NOTE: bgzip parameter -@ X represents number of threads
# create a file with ordered chrosome names
echo {1..22} X Y M | tr ' ' '\n' > chromosomeList.txt
# set awk params as variable for parallel
awkBody1='$8 != "."'
awkBody2='BEGIN{FS=OFS="\t"} {t = $2; $2 = $9; $9 = t; x = $1; $1 = $8; $8 = $1; print;}'
# parallel implementation of the hg19 format and sort
parallel -j 4 \
"zgrep -v '#chr' dbNSFP${version}_variant.chr{}.gz | \
awk '${awkBody1}' | \
awk '${awkBody2}' | \
LC_ALL=C sort --parallel=2 -n -S 1G -T . -k 1,1 -k 2,2 --compress-program=gzip | \
bgzip -@ 2 > dbNSFPv${version}.chr{}.hg19.custombuild.gz" :::: chromosomeList.txt
#TODO: need to implement a defined approach to GNU parallel, can't use $THREADS
# cat all files back together (should retain sort order)
cat header.gz dbNSFPv${version}.chr{1..22}.hg19.custombuild.gz dbNSFPv${version}.chrX.hg19.custombuild.gz dbNSFPv${version}.chrY.hg19.custombuild.gz dbNSFPv${version}.chrM.hg19.custombuild.gz > dbNSFPv${version}.hg19.custombuild.gz
# Create tabix index
tabix -s 1 -b 2 -e 2 dbNSFPv${version}.hg19.custombuild.gz
# test hg19 annotation
# java -jar ~/install/snpEff/SnpSift.jar dbnsfp -v -db /mnt/dbNSFP/hg19/dbNSFPv${version}.hg19.custombuild.gz test/chr1_test.vcf > test/chr1_test_anno.vcf
# clean up
#TODO: add clean up step to rm all intermediate files after testing confirmed working (i.e. correct annotation 'rates')
#/END hg38
###
@sirselim
Copy link
Author

@dariogf

The problem that @robinpalvadeau is having is because there are some "bad" lines in chr6, chr17 and chr22 that when they are filtered by awk to convert to hg19 format, seems to belong to another chr. For example, this line from chr6 that is in chr6:

I mentioned these variants in my response above, they are not 'bad' variants and I think it is dangerous to assume such, thus I don't remove them. That's why the creation of dbNSFP hg19 is a little trickier, because you need to identify the variants that map to different chromsomes between the builds, move them to the correct location and then apply the sorting. I have this all written up, but it unfortunately takes what i consider to be too much RAM to do it efficiently. So my current solution is to build the most recent version of dbNSFP once (for both hg19 and hg38) on our cluster and then distribute across other machines/locations as required.

What I'm doing in the snippets you have highlighted is moving the columns into the correct location (the awk script you mention). By default dbNSFP uses hg38 positions in columns 1 and 2, and then has hg19 positions in columns 8 and 9. If you haven't (and are interested) I suggest you familiarise yourself with the dbNSFP structure (most recent readme: https://drive.google.com/file/d/1HPcIEE1USiwCPzRFFwZhTZagka1uyUV1/view):

...
Columns of dbNSFP_variant:
1	chr: chromosome number
2	pos(1-based): physical position on the chromosome as to hg38 (1-based coordinate).
		For mitochondrial SNV, this position refers to the rCRS (GenBank: NC_012920). 
3	ref: reference nucleotide allele (as on the + strand)
4	alt: alternative nucleotide allele (as on the + strand)
5	aaref: reference amino acid
		"." if the variant is a splicing site SNP (2bp on each end of an intron)
6	aaalt: alternative amino acid
		"." if the variant is a splicing site SNP (2bp on each end of an intron)
7	rs_dbSNP151: rs number from dbSNP 151
8	hg19_chr: chromosome as to hg19, "." means missing
9	hg19_pos(1-based): physical position on the chromosome as to hg19 (1-based coordinate).
		For mitochondrial SNV, this position refers to a YRI sequence (GenBank: AF347015)
10	hg18_chr: chromosome as to hg18, "." means missing
11	hg18_pos(1-based): physical position on the chromosome as to hg18 (1-based coordinate)
		For mitochondrial SNV, this position refers to a YRI sequence (GenBank: AF347015)
...

I need to revisit this little side project at some stage, but as it's currently implemented it is working for our purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment