Skip to content

Instantly share code, notes, and snippets.

Last active October 13, 2015 17:28
Show Gist options
  • Save mattbaggott/4230765 to your computer and use it in GitHub Desktop.
Save mattbaggott/4230765 to your computer and use it in GitHub Desktop.
short comparison of methods of finding breakpoints in a specific dataset
# short investigation of id'ing breakpoints in data
# should investigate newer library("cpm")
# load data
data = c(746,743,735,735,728,717,709,691,746,738,732,728,722,782,781,778,776,
plot(data) # take a look
data1 <- data.frame(vals=data) # make a data frame
data1$time <- row(data1) # make a 'time' var
# many change point approaches are looking for changes in mean or variance
# which we don't have here, and that is why the approach is failing, I think
# with the increased data in the newer example, it is estimating the mean better.
# One way around that would be to use sequential differences
# or some other derived value, here we use diff()
data1$diffs <- c(0,diff(data1$vals)) # padding the start with a zero since there is no diff there
# and we want equal lengths
# a simple answer might be something like this
data1$bigdiffs <- c(0,diff(data1$vals))>30 # define break as a change of > 30 points
# or we make classifiers using the differences between points
# BCP method on diffs
data1$bcpfit <- bcp(data1$diffs)$posterior.prob #from bcp
# PELT method on diffs
cpts1=cpt.var(data1$diffs,method="PELT")@cpts #from changepoint
cpts2=cpt.meanvar(data1$diffs,method="PELT")@cpts #from changepoint
data1$cptfit1 <- 0 # populate the variable
data1[cpts1,]$cptfit1 <- 1 # assign based on cpts1
data1$cptfit2 <- 0 # populate the variable
data1[cpts2,]$cptfit2 <- 1 # assign based on cpts1
# reorder columns for convenience: time should be first
data1 <- data1[,c("time","vals","diffs","bigdiffs","cptfit1","cptfit2","bcpfit")]
# plot results
data.m <- melt(data1,id.var=c("time")) #from reshape
ggplot(data.m,aes(x=time, y=value, colour=variable))+geom_line()+facet_grid(variable~., scales="free")
# RESULTS: it looks like the simple bigdiffs approach outperforms the others
# maybe simpler is better
# HOWEVER, if we wanted to hand label a bunch of data, we could use it to train a classifier
# that could combine these approaches
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment