Skip to content

Instantly share code, notes, and snippets.

@reedacartwright
Created December 14, 2017 23:37
Show Gist options
  • Save reedacartwright/334fc1b1621f6bd2abe699ffc3736bc5 to your computer and use it in GitHub Desktop.
Save reedacartwright/334fc1b1621f6bd2abe699ffc3736bc5 to your computer and use it in GitHub Desktop.
Script to calculate the entropy of a mutation specturm from a vcf
# This script calculates the average entropy and effective number of
# possible mutants for the human mutation spectrum.
# Preparing to run this script
# > wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/VCF/common_all_20170710.vcf.gz
# > wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/VCF/common_all_20170710.vcf.gz.tbi
# > bcftools query -f '%REF\t%ALT\n' common_all_20170710.vcf.gz | zstd > altref.txt.zst
library(data.table)
library(stringr)
aa = fread("zstdcat altref.txt.zst", header=FALSE)
names(aa) = c("REF", "ALT")
a = aa[!str_detect(aa$ALT,","),]
a[,Nuc := str_sub(REF,1,1)]
a[,Masked := str_replace_all(REF,".","N")]
# Due to the way VCF records indels, we will split the input based on the first
# character of the reference, and mask the reference string with Ns.
b = split(str_c(a$Masked,a$ALT,sep="/"), a$Nuc)
# estimate the entropy of each subset
x = lapply(b, table)
p = lapply(x, function(y) y/sum(y))
h = sapply(p, function(q) -sum(q*log2(q)))
# take the average over each subset
n = sapply(b,length)
hEntropy = weighted.mean(h,n) # 1.998921
kEffective = 2**hEntropy # 3.998008
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment