szypanther/pipeline.md Secret

## pipeline.md

      
    Raw
  

              pipeline.md
            
          
    We downloaded the source code of deltaBS from the Github website(https://github.com/Gardner-BinfLab/deltaBS),and then followed their instructions online to get started:
We used conda to install HMMER3 (http://hmmer.janelia.org/) and BioPerl (http://bioperl.org/INSTALL.html).
conda install -c bioconda hmmer
conda install -c bioconda perl-bioperl

We used CPAN to install the following modules
Statistics::Distributions, Data::Dumper, and Getopt::Long
cpan> install Statistics::Distributions     
cpan> install  Data::Dumper  
cpan> install Getopt::Long  

We downloaded the Pfam HMM collection from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release and use the Pfam-A.hmm. As running deltaBS, the option -hp specifies the path to HMMER3, and -hd specifies the directory that the Pfam HMMs have been dowloaded.
After installing Pfam HMMs, we named the file deltaBS.hmmlib as the HMM database for further use:
hmmpress deltaBS.hmmlib

We used the Dfast online server (https://dfast.nig.ac.jp/dfc/) to obtain the genome annotation and the gbk file as the input format for delta-bitscore calculation.
./src/deltaBS.pl -f genbank -f1 UNISED/E4742annotation.gbk -f2 UNISED/E4385annotation.gbk --verbose -o UNISED/E4385_E4742  -hp /usr/local/bin -hd  /home/shenzy/software/deltaBS/ -t UNISED/ -c 15

Options
-f [embl, genbank, fasta] file format for reference/query files
-f1 [filename] your reference genome/proteome file
-f2 [filename] your query genome/proteome file
-o [directory] directory you would like the results to go in
-ol [filename] (optional) ortholog list
-pa1 [filename] (optional) hmmscan results for genome 1
-pa2 [filename] (optional) hmmscan results for genome 2
-hp [directory] directory where HMMER3 has been installed (e.g. /usr/local/bin/)
-c [number] number of cpus to be used by HMMER
-hd [directory] directory where HMM directory has been downloaded
-v (optional) use this if you would like updates on what the script is doing while its running
We also downloaded the profile HMMs of Gammaproteobacterial proteins (gproNOG.hmm) from the eggNOG database (http://eggnogdb.embl.de/),
and then conducted all the pairwise comparison between the Delta-bitscore of each strain via the deltaBS.pl script mentioned above. We combinded
all the tables, and then extracted the core orthologue genes of each strain to generate the DBS metric table (bitscores.tsv) which was used as the input file for the following Random forest classifier construction and training.
We used conda to install R, gplots, caret, and randomforest.
conda install r-base=3.51
conda install -c r r-randomforest
conda install -c conda-forge r-gplots
conda install -c conda-forge r-caret
The Rscript was written to do the following:
#R version 3.5.1 
#randomForest 4.6-14

library(gplots)
library(caret)
library(randomForest)
set.seed(1)
# set the directory you are working from
directory <- "/home/shenzy/work_directory"
traindata <- read.delim(paste(directory, "/bitscores.tsv", sep=""))
traindata <- t(traindata)
phenotype <- read.delim(paste(directory, "/phenotype.tsv", sep=""), header=F)
phenotype[,1] <- make.names(phenotype[,1])
traindata <- cbind.data.frame(traindata, phenotype=phenotype[match(row.names(traindata), phenotype[,1]),2])
traindata[is.na(traindata)] <- 0
# traindata <- na.roughfix(traindata)
traindata <- traindata[,-nearZeroVar(traindata)]
names(traindata) <- make.names(names(traindata))

This section was to pick out the best parameters to build the model.

set.seed(1)
# varying ntree
error <- vector()
sparsity <- vector()
for(i in c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000)) { 
  model <- randomForest(phenotype ~ ., data=traindata, ntree=i)
  error <- c(error, model$err.rate[length(model$err.rate)])
  sparsity <- c(sparsity, (sum(model$importance[,1]<=0))/(ncol(traindata)-1))
}
# varying mtry
error2 <- vector()
sparsity2 <- vector()
param <- ncol(traindata)-1
for(i in c(1, round(param/10), round(param/5), round(param/3), round(param/2), param)) {
  model <- randomForest(phenotype ~ ., data=traindata, ntree=10000, mtry=i)
  error2 <- c(error2, model$err.rate[length(model$err.rate)])
  sparsity2 <- c(sparsity2, (sum(model$importance[,1]<=0))/(ncol(traindata)-1))
}
model <- randomForest(phenotype ~ ., data=traindata, ntree=10000, mtry=param/10, na.action=na.roughfix)
png(paste(directory, "/model_training/m1_error_vs_ntree.png", sep=""), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=error, xlab="Number of trees", ylab="OOB error rate", pch=16)
dev.off()
png(paste(directory, "/model_training/m1_sparsity_vs_ntree.png", sep=""), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=sparsity, xlab="Number of trees", ylab="% genes uninformative", pch=16)
dev.off()
png(paste(directory, "/model_training/m1_error_vs_mtry.png", sep=""), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=error2, xlab="Number of genes sampled per tree", ylab="OOB error rate", pch=16)
dev.off()
png(paste(directory, "/model_training/m1_sparsity_vs_mtry.png", sep=""), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=sparsity2, xlab="Number of genes sampled per tree", ylab="% genes uninformative", pch=16)
dev.off()
train2 <- traindata[,match(names(model$importance[model$importance[,1]>0,]), colnames(traindata))]
train2 <- cbind(train2, phenotype=traindata$phenotype)
error <- vector()
sparsity <- vector()
for(i in c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000)) {
  model <- randomForest(phenotype ~ ., data=train2, ntree=i, na.action=na.roughfix)
  error <- c(error, median(model$err.rate))
  sparsity <- c(sparsity, (sum(model$importance[,1]<=0))/(ncol(train2)-1))
}
error2 <- vector()
sparsity2 <- vector()
param <- ncol(train2)-1
for(i in c(1, round(param/10), round(param/5), round(param/3), round(param/2), param)) {
  model <- randomForest(phenotype ~ ., data=train2, ntree=10000, mtry=i, na.action=na.roughfix)
  error2 <- c(error2, median(model$err.rate))
  sparsity2 <- c(sparsity2, (sum(model$importance[,1]<=0))/(ncol(train2)-1))
}
model <- randomForest(phenotype ~ ., data=train2, ntree=10000, mtry=param/10, na.action=na.roughfix)
png(paste(directory, "/model_training/m2_error_vs_ntree.png", sep=""), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=error, xlab="Number of trees", ylab="OOB error rate", pch=16)
dev.off()
png(paste(directory, "/model_training/m2_sparsity_vs_ntree.png", sep=""), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=sparsity, xlab="Number of trees", ylab="% genes uninformative", pch=16)
dev.off()
png(paste(directory, "/model_training/m2_error_vs_mtry.png", sep=""), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=error2, xlab="Number of genes sampled per tree", ylab="OOB error rate", pch=16)
dev.off()
png(paste(directory, "/model_training/m2_sparsity_vs_mtry.png", sep=""), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=sparsity2, xlab="Number of genes sampled per tree", ylab="% genes uninformative", pch=16)
dev.off()
train3 <- train2[,match(names(model$importance[model$importance[,1]>quantile(model$importance[,1], 0.5),]), colnames(train2))]
train3 <- cbind(train3, phenotype=train2$phenotype)
error <- vector()
sparsity <- vector()
param <- ncol(train3)-1
for(i in c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000)) {
  model <- randomForest(phenotype ~ ., data=train3, ntree=i, na.action=na.roughfix)
  error <- c(error, median(model$err.rate))
  sparsity <- c(sparsity, (sum(model$importance[,1]<=0))/(ncol(train3)-1))
}
error2 <- vector()
sparsity2 <- vector()
for(i in c(1, round(param/10), round(param/5), round(param/3), round(param/2), param)) {
  model <- randomForest(phenotype ~ ., data=train3, ntree=10000, mtry=i, na.action=na.roughfix)
  error2 <- c(error2, median(model$err.rate))
  sparsity2 <- c(sparsity2, (sum(model$importance[,1]<=0))/(ncol(train3)-1))
}
model <- randomForest(phenotype ~ ., data=train3, ntree=10000, mtry=param/10, na.action=na.roughfix)
png(paste(directory, "/model_training/m3_error_vs_ntree.png", sep=""), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=error, xlab="Number of trees", ylab="OOB error rate", pch=16)
dev.off()
png(paste(directory, "/model_training/m3_sparsity_vs_ntree.png", sep=""), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=sparsity, xlab="Number of trees", ylab="% genes uninformative", pch=16)
dev.off()
png(paste(directory, "/model_training/m3_error_vs_mtry.png", sep=""), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=error2, xlab="Number of genes sampled per tree", ylab="OOB error rate", pch=16)
dev.off()
png(paste(directory, "/model_training/m3_sparsity_vs_mtry.png", sep=""), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=sparsity2, xlab="Number of genes sampled per tree", ylab="% genes uninformative", pch=16)
dev.off()
train4 <- train3[,match(names(model$importance[model$importance[,1]>quantile(model$importance[,1], 0.5),]), colnames(train3))]
train4 <- cbind(train4, phenotype=train3$phenotype)
error <- vector()
sparsity <- vector()
for(i in c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000)) {
  model <- randomForest(phenotype ~ ., data=train4, ntree=i, na.action=na.roughfix)
  error <- c(error, median(model$err.rate))
  sparsity <- c(sparsity, (sum(model$importance[,1]<=0))/(ncol(train4)-1))
}
error2 <- vector()
sparsity2 <- vector()
param <- ncol(train4)-1
for(i in c(1, round(param/10), round(param/5), round(param/3), round(param/2), param)) {
  model <- randomForest(phenotype ~ ., data=train4, ntree=10000, mtry=i, na.action=na.roughfix)
  error2 <- c(error2, median(model$err.rate))
  sparsity2 <- c(sparsity2, (sum(model$importance[,1]<=0))/(ncol(train4)-1))
}
model <- randomForest(phenotype ~ ., data=train4, ntree=10000, mtry=param/10, na.action=na.roughfix)
png(paste(directory, "/model_training/m4_error_vs_ntree.png", sep=""), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=error, xlab="Number of trees", ylab="OOB error rate", pch=16)
dev.off()
png(paste(directory, "/model_training/m4_sparsity_vs_ntree.png", sep=""), width=350, height = 350)
plot(x=c(1, 10, 50, 250, 500, 1000, 1500, 2000, 5000, 10000), y=sparsity, xlab="Number of trees", ylab="% genes uninformative", pch=16)
dev.off()
png(paste(directory, "/model_training/m4_error_vs_mtry.png", sep=""), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=error2, xlab="Number of genes sampled per tree", ylab="OOB error rate", pch=16)
dev.off()
png(paste(directory, "/model_training/m4_sparsity_vs_mtry.png", sep=""), width=350, height = 350)
plot(x=c(1, round(param/10), round(param/5), round(param/3), round(param/2), param), y=sparsity2, xlab="Number of genes sampled per tree", ylab="% genes uninformative", pch=16)
dev.off()
model$predicted
names(model$importance[order(model$importance, decreasing=T),][1:10])
save(model, train2, train3, train4, file=paste(directory, "/traindata.Rdata", sep=""))

This section was to build a model through iterative feature selection.
set.seed(1)
param <- ncol(traindata)-1
model1 <- randomForest(phenotype ~ ., data=traindata, ntree=10000, mtry=param/10, na.action=na.roughfix)
model1
png(paste(directory, "/VI_full.png", sep=""), width=400, height=350)
plot(1:param, model1$importance[order(model1$importance, decreasing=T)], xlim=c(1,1000), ylab="Variable importance", xlab="Top genes")
dev.off()

train2 <- traindata[,match(names(model1$importance[model1$importance[,1]>0,]), colnames(traindata))]
train2 <- cbind(train2, phenotype=traindata$phenotype)
param <- ncol(train2)-1
model2 <- randomForest(phenotype ~ ., data=train2, ntree=10000, mtry=param/10, na.action=na.roughfix)
model2
train3 <- train2[,match(names(model2$importance[model2$importance[,1]>quantile(model2$importance[,1], 0.5),]), colnames(train2))]
train3 <- cbind(train3, phenotype=train2$phenotype)
param <- ncol(train3)-1
model3 <- randomForest(phenotype ~ ., data=train3, ntree=10000, mtry=param/10, na.action=na.roughfix)
model3
train4 <- train3[,match(names(model3$importance[model3$importance[,1]>quantile(model3$importance[,1], 0.5),]), colnames(train3))]
train4 <- cbind(train4, phenotype=train3$phenotype)
param <- ncol(train4)-1
model4 <- randomForest(phenotype ~ ., data=train4, ntree=10000, mtry=param/10, na.action=na.roughfix)
model4
train5 <- train4[,match(names(model4$importance[model4$importance[,1]>quantile(model4$importance[,1], 0.5),]), colnames(train4))]
train5 <- cbind(train5, phenotype=train4$phenotype)
param <- ncol(train5)-1
model5 <- randomForest(phenotype ~ ., data=train5, ntree=10000, mtry=param/10, na.action=na.roughfix, proximity=T)
model5
train6 <- train5[,match(names(model5$importance[model5$importance[,1]>quantile(model5$importance[,1], 0.5),]), colnames(train5))]
train6 <- cbind(train6, phenotype=train5$phenotype)
param <- ncol(train6)-1
model6 <- randomForest(phenotype ~ ., data=train6, ntree=10000, mtry=param/10, na.action=na.roughfix, proximity=T)
model6
train7 <- train6[,match(names(model6$importance[model6$importance[,1]>quantile(model6$importance[,1], 0.5),]), colnames(train6))]
train7 <- cbind(train7, phenotype=train6$phenotype)
param <- ncol(train7)-1
model7 <- randomForest(phenotype ~ ., data=train7, ntree=10000, mtry=param/10, na.action=na.roughfix, proximity=T)
model7
model7$predicted
names(model7$importance[order(model7$importance[,1], decreasing=T),])[1:10]
png(paste(directory, "final_model.png", sep=""), width=400, height=350)

plot(1:param, model7$importance[order(model7$importance, decreasing=T),], xlab="", ylab="Variable importance")
dev.off()
save(model1, model2, model3, model4, model5,model6, model7, traindata, train2, train3, train4, train5, train6, train7, file=paste("finalmodel.Rdata", sep=""))


The following section showed how the performance of the model has been improved as iteratioon through cycles of feature selection.

votedata <- rbind.data.frame(cbind.data.frame(model=rep("1", 9), Serovar=row.names(model1$votes), Invasive=model1$votes[,2]), cbind.data.frame(model=rep("2", 9), Serovar=row.names(model1$votes), Invasive=model2$votes[,2]), cbind.data.frame(model=rep("3", 9), Serovar=row.names(model1$votes), Invasive=model3$votes[,2]), cbind.data.frame(model=rep("4", 9), Serovar=row.names(model1$votes), Invasive=model4$votes[,2]), cbind.data.frame(model=rep("5", 9), Serovar=row.names(model1$votes), Invasive=model5$votes[,2]), cbind.data.frame(model=rep("6", 9), Serovar=row.names(model1$votes), Invasive=model6$votes[,2]), cbind.data.frame(model=rep("7", 9), Serovar=row.names(model1$votes), Invasive=model7$votes[,2]))

votedata$Phenotype <- phenotype[match(votedata$Serovar, phenotype[,1]),2]
ggplot(votedata, aes(x=model, y=Invasive, col=Serovar)) + geom_jitter(width=0.1) + theme_classic(8) + ylab("Proportion of votes for gastrointestinal origin") + xlab("Model iteration") + geom_hline(yintercept=0.5, lty=2, col="grey") + theme(legend.key.size = unit(0.3, "cm"))
ggsave("votes.png", width=7, height=2.5)


This section picked up the top predictor genes from the chosen model.
topgenes <- vector()
topgenes <- c(topgenes, colnames(train6))
table(topgenes)
write(topgenes,"detected_topgenes_list.csv", sep="")


This section assigned the top predictor genes to COG categories, and then compared to the COG categories in the original training set to check if there were any genes over-represented.
annotations <- read.delim(paste(directory, "/gproNOG.annotations.tsv", sep=""), header=F) 
nogs <- read.delim(paste(directory, "/models_used.tsv", sep=""), header=F)
nogs[,2] <- sub(".*NOG\\.", "", nogs[,2])
nogs[,2] <- sub(".meta_raw", "", nogs[,2])
info <- annotations[match(nogs[,2], annotations[,2]),]
info$V5 <- as.character(info$V5)
# proportion of each COG catergory that comes up in the top indicators
background <- nogs[match(colnames(traindata), make.names(nogs[,1])),2]
background_COGs <- annotations[match(background, annotations[,2]),5]
library(plyr)
bg_COGs2 <- aaply(as.character(background_COGs), 1, function(i){
   if(!is.na(i)) {
      if(nchar(i)>1) {
       char <-  substr(i,1,1)
       for(n in 2:nchar(i)) {
         char <- paste(char,substr(i,n,n),sep=".")
       }
       return(char)
     } else {
         return(i)
     }
   } else{
       return(NA)
   }
})

background_COGs2 <- unlist(strsplit(bg_COGs2, "[.]"))
predictors <- nogs[match(colnames(train6), make.names(nogs[,1])),2]
predictor_COGs <- annotations[match(predictors, annotations[,2]),5]
p_COGs2 <- aaply(as.character(predictor_COGs), 1, function(i){
   if(!is.na(i)) {
      if(nchar(i)>1) {
       char <-  substr(i,1,1)
       for(n in 2:nchar(i)) {
         char <- paste(char,substr(i,n,n),sep=".")
       }
       return(char)
     } else {
         return(i)
     }
   } else{
       return(NA)
   }
})
predictor_COGs2 <- unlist(strsplit(p_COGs2, "[.]"))
barplot(rbind(table(background_COGs2), table(predictor_COGs2)[match(names(table(background_COGs2)), names(table(predictor_COGs2)))]))

This section allowed comparing the performances of the model in predicting between true phenotypes and random phenotypes.
set.seed(2)
control <- traindata
control$phenotype <- sample(traindata$phenotype)
param <- ncol(control)-1
cmodel1 <- randomForest(phenotype ~ ., data=control, ntree=10000, mtry=param/10, na.action=na.roughfix)
cmodel1
png(paste(directory, "/VI_full_control.png", sep=""), width=400, height=350)
plot(1:param, cmodel1$importance[order(cmodel1$importance, decreasing=T)], xlim=c(1,1000), ylab="Variable importance", xlab="Top genes")
dev.off()

ctrain2 <- control[,match(names(cmodel1$importance[cmodel1$importance[,1]>0,]), colnames(control))]
ctrain2 <- cbind(ctrain2, phenotype=control$phenotype)
param <- ncol(ctrain2)-1
cmodel2 <- randomForest(phenotype ~ ., data=ctrain2, ntree=10000, mtry=param/10, na.action=na.roughfix)
cmodel2
ctrain3 <- ctrain2[,match(names(cmodel2$importance[cmodel2$importance[,1]>quantile(cmodel2$importance[,1], 0.5),]), colnames(ctrain2))]
ctrain3 <- cbind(ctrain3, phenotype=ctrain2$phenotype)
param <- ncol(ctrain3)-1
cmodel3 <- randomForest(phenotype ~ ., data=ctrain3, ntree=10000, mtry=param/10, na.action=na.roughfix)
cmodel3
ctrain4 <- ctrain3[,match(names(cmodel3$importance[cmodel3$importance[,1]>quantile(cmodel3$importance[,1], 0.5),]), colnames(ctrain3))]
ctrain4 <- cbind(ctrain4, phenotype=ctrain3$phenotype)
param <- ncol(ctrain4)-1
cmodel4 <- randomForest(phenotype ~ ., data=ctrain4, ntree=10000, mtry=param/10, na.action=na.roughfix)
cmodel4
ctrain5 <- ctrain4[,match(names(cmodel4$importance[cmodel4$importance[,1]>quantile(cmodel4$importance[,1], 0.5),]), colnames(ctrain4))]
ctrain5 <- cbind(ctrain5, phenotype=ctrain4$phenotype)
param <- ncol(ctrain5)-1
cmodel5 <- randomForest(phenotype ~ ., data=ctrain5, ntree=10000, mtry=param/10, na.action=na.roughfix, proximity=T)
cmodel5
ctrain6 <- ctrain5[,match(names(cmodel5$importance[cmodel5$importance[,1]>quantile(cmodel5$importance[,1], 0.5),]), colnames(ctrain5))]
ctrain6 <- cbind(ctrain6, phenotype=ctrain5$phenotype)
param <- ncol(ctrain6)-1
cmodel6 <- randomForest(phenotype ~ ., data=ctrain6, ntree=10000, mtry=param/10, na.action=na.roughfix, proximity=T)
cmodel6

cmodel6$predicted
names(cmodel6$importance[order(cmodel6$importance[,1], decreasing=T),])[1:10]
model7$predicted
# compare votes and phenotypes for the control and real datasets
cbind(cmodel6$votes, control$phenotype, model6$votes, train6$phenotype)