Skip to content

Instantly share code, notes, and snippets.

@datalove
Last active August 29, 2015 14:08
Show Gist options
  • Save datalove/539ef9f935077e81e4bd to your computer and use it in GitHub Desktop.
Save datalove/539ef9f935077e81e4bd to your computer and use it in GitHub Desktop.
Find multivariate outliers using Mahalanobis Distances
########################################################
# Takes an arbitrarily long list of input columns and
# returns a boolean indicating whether or not each row
# is an outlier.
#
# The function uses the critical value for Mahalanobis
# Distance calculated from an upper tailed ChiSq
# distribution with p=0.001.
########################################################
# create vector of inputs
inputs <- grep("^input[0-9]+$",ls(), value = TRUE)
# capture columns as a matrix
x <- sapply(inputs, function(y) {eval(parse(text = y))})
# find complete cases
cc <- complete.cases(x)
# column of complete cases
xcc <- x[cc,]
# column of Mahalanobis Dists
dists <- rep(NA, nrow(x))
dists[cc] <- mahalanobis(xcc, colMeans(xcc), cov(xcc))
# column of critical values
critical <- rep(qchisq(0.001, df=ncol(xcc)-1, lower.tail = FALSE), nrow(x))
# column of outliers
outlier <- dists >= critical
# capture the output
output <- outlier
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment