Skip to content

Instantly share code, notes, and snippets.

@josephhughes
Last active December 20, 2015 01:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save josephhughes/6053031 to your computer and use it in GitHub Desktop.
Save josephhughes/6053031 to your computer and use it in GitHub Desktop.
Generate a set of unique sequences from a fasta file and return the frequency of each unique sequence and a pairwise distance of the sequences using the dist.dna function from ape. The default distance model used in "raw" but all models available in dist.dna can be specified. Sequences that have ? or - will be considered different.
require(ape)
dna2FreqAndDistMat<-function(dna,model=NULL){
if(is.null(model)){ model <- c("raw")}
#model must be one , "raw" is the default model
allowed_models<-c("raw", "N", "TS", "TV", "JC69", "K80", "F81", "K81", "F84", "BH87", "T92", "TN93", "GG95", "logdet", "paralin", "indel", "indelblock")
if(!any(allowed_models==model)){
warning("You need to provide the correct model: raw, N, TS, TV, JC69, K80, F81, K81, F84, BH87, T92, TN93, GG95, logdet, paralin, indel, indelblock")
return(NULL)
}
if (!is.null(dna) && inherits(dna,"DNAbin")){
seqmatrix <- as.character(dna)
xx <-table(apply(seqmatrix, 1, paste, collapse=","))
dd<-as.data.frame(xx)
row.names(dd)<-paste("uniqseqID",seq(1:length(xx)),sep="")
seq<-strsplit(as.character(dd$Var1),",")
uniqmat <- matrix(unlist(seq), ncol = ncol(seqmatrix), byrow = TRUE)
rownames(uniqmat) <- rownames(dd)
uniqdna<-as.DNAbin(uniqmat)
dist <- dist.dna(uniqdna,model=model, as.matrix=TRUE)
freq<-as.data.frame(dd$Freq)
row.names(freq)<-rownames(dd)
}
return(list("freq"=freq,"dist"=dist))
}
data <- read.dna('input.fa', format="fasta")
info=dna2FreqAndDistMat(data)
info$freq
info$dist
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment