Skip to content

Instantly share code, notes, and snippets.

@multidis
multidis / ggplot_legend_tweaks.md
Created December 13, 2013 21:00
Tweaking legends in ggplot2.
@multidis
multidis / ggplot_pie_chart.r
Created December 13, 2013 21:37
Pie chart with ggplot2. From the grammar of graphics perspective, it is considered as a bar chart transformed into polar coordinates. Keep in mind: although widely accepted, pie charts are not very dense information visualization technique and Tufte recommends against it.
gp <- ggplot(dfnum, aes(x=factor("education"), fill=factor(education)))
gp <- gp + geom_bar(width=1) + coord_polar(theta="y")
## tweak legend
gp <- gp + scale_fill_discrete(name="",
breaks=as.character(seq(0,1,len=6)),
labels=ans.educ[,2])
print(gp)
## tweak legent with colormap (color palette) modification
pie <- pie + scale_fill_brewer(palette="Set1",
@multidis
multidis / caret_install.r
Created December 16, 2013 21:15
Installing caret package with all dependencies and suggested packages - to avoid errors later.
install.packages("caret", dependencies = c("Depends", "Suggests"))
@multidis
multidis / bioconductor_install_update.r
Created December 16, 2013 21:21
Bioconductor installation and updates.
source("http://bioconductor.org/biocLite.R")
## install core packages or get list of updates
biocLite()
## install specific packages by name
biocLite(c("pkg1", "pkg2"))
## if more than one Bioconductor release versions coexist: upgrade
biocLite("BiocUpgrade")
@multidis
multidis / 10_caret_multi_methods_class.r
Created December 16, 2013 21:52
R-functions for comparing various statistical learning methods with caret-unified calls, and also for comparing different variable (feature) subsets with those methods. Binary classification case (originally written for binary decision in medical diagnosis from high-dimensional genomic datasets).
library(caret)
## Evaluating variable subset with various classification models (caret unified func. call).
## Repeated CV-resampling is used for parameter tuning
## unless some heuristics are used by caret (specific to each learning method).
varsubs.miscclass.eval <- function(vset, Xtra, Yfac, Xtst, Ytst, methods.caret,
tune.grid.Npts=5, cv.rep=5) {
if (!all(vset %in% colnames(Xtra))) {
stop("varsubs.miscclass.eval error: Some supplied variable names are not present among the covariate matrix column names.")
}
@multidis
multidis / list_as_fun_args.r
Created December 16, 2013 22:21
Passing lists as function arguments in R. Frequently helps reduce code repetition (e.g. if/else calls of different functions with mostly the same arguments). NOTE: always consider a closure function as FP alternative to this method of dealing with repetitive code elements.
## regular case
foo <- function(a, b, c) a + b - c ## does something
foo2 <- function(b, c) b + c ## also some function
foo(a=1, b=2, c=5)
foo2(b=2, c=5) ## repeating list of multiple arguments
## passing a list
arg.list <- list(b=2, c=5)
do.call(foo, c(list(a=1), arg.list))
do.call(foo2, arg.list)
@multidis
multidis / multicore_detect_cores.r
Created December 16, 2013 22:34
Detecting the number of CPU cores in R.
#library(multicore) ## no longer needed as of R 3.0
library(parallel)
ncores <- detectCores()
@multidis
multidis / sensitivity_specificity_ROC.md
Created December 18, 2013 21:04
Sensitivity, specificity, and ROC. Simple but every time looking them up to verify, hence this gist.

Sensitivity and specificity in binary classification

Sensitivity: given that a result is truly an event, probability of predicting that event correctly.

Medical: fraction of correctly identified disease cases among all disease cases. Likelihood of healthy patient if test is negative. High sensitivity helps avoid interventions done to healthy patients.

Specificity: given that a result is truly NOT an event, probability of predicting a negative.

Medical: fraction of correctly identified healthy cases among all healthy cases. Likelihood of disease if test is positive. High specificity is essential for correctly identifying high-risk patients.

@multidis
multidis / split_strat_scale.r
Created December 23, 2013 08:17
Stratified sampling: training / test data split preserving class distribution (caret functions) and scaling (standardize) the data. Stratified folds for CV.
library(caret)
## select training indices preserving class distribution
in.train <- createDataPartition(yclass, p=0.8, list=FALSE)
summary(factor(yclass))
ytra <- yclass[in.train]; summary(factor(ytra))
ytst <- yclass[-in.train]; summary(factor(ytst))
## standardize features: training parameters of scaling for test-part
Xtra <- scale(X[in.train,])
@multidis
multidis / dataframe_sel_cols_factor.r
Created December 25, 2013 00:19
Binding together R dataframe while converting selected (by name) columns to factors (or whatever other class is desired with appropriate conversion function).
## simply convert selected columns to factors
c <- lapply(a[,3:4], factor)
## build dataframe with some columns as factors
vset.fac <- c("name1","name2") ## names from some original dataframe d0
vset.num <- c("num1","num2")
df <- data.frame(d0[,vset.num], lapply(d0[,vset.fac],factor))
## check structure