Skip to content

Instantly share code, notes, and snippets.

@deckerego
Created December 6, 2013 14:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save deckerego/7824771 to your computer and use it in GitHub Desktop.
Save deckerego/7824771 to your computer and use it in GitHub Desktop.
If you consider aberrant traffic hit rates ones that are more or equal to two standard deviations away from the mean
# Massage your data into a data frame that provides access by Hour and URI
traffic.df <- parse.log("access.log") #parse.log is left as an exercise for the reader
# Aggregate
uri.hits <- ddply(traffic.df, .(Hour, URI), summarise, Hits=length(URI), .parallel = TRUE)
uri.stats <- ddply(uri.hits, .(URI), summarise, Mean=mean(Hits), Variance=sd(Hits), Total=sum(Hits), .parallel = TRUE)
uri.stats <- join(uri.hits, uri.stats, c("URI"))
# Find two std dev away from mean
uri.bad <- subset(uri.stats, Variance > 0)
uri.bad$Deviations <- (uri.bad$Hits - uri.bad$Mean) / uri.bad$Variance
uri.bad <- subset(uri.bad, Deviations >= 2)
uri.bad <- uri.bad[with(uri.bad, order(-Deviations)), ]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment