Skip to content

Instantly share code, notes, and snippets.

@dsparks
Created September 11, 2012 01:55
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save dsparks/3695362 to your computer and use it in GitHub Desktop.
Save dsparks/3695362 to your computer and use it in GitHub Desktop.
Random, equally-sized partitions
# Randomly allocating observations into groups, for, e.g. cross-validation
kk <- 10 # Number of partitions, as in "kk-fold cross-validation."
# Here is a data.frame full of good data:
nn <- 1003
myData <- data.frame(matrix(rnorm(nn * 3), ncol = 3))
colnames(myData) <- LETTERS[1:3]
# This does not work:
whichK <- sample(LETTERS[1:kk], nrow(myData), replace = T)
table(whichK) # Because the partitions are not equally sized
# This does work:
randomDraw <- rnorm(nrow(myData))
kQuantiles <- quantile(randomDraw, 0:kk/kk)
whichK <- cut(randomDraw, kQuantiles, include.lowest = TRUE) # Divide randomDraw into kk equally-sized groups
levels(whichK) <- LETTERS[1:kk] # (Optionally) Give the levels handier names
# Check partition counts:
table(whichK) # As equal as possible.
# Illustrating a lapply() over the training sets:
plot.new()
plot.window(xlim = c(-4, 4), ylim = c(0, 1/2))
lapply(levels(whichK), function(k){
lines(density(myData$A[whichK != k]))})
@tong-wang
Copy link

why not simply generate random index and split to a fixed number of parts?

# generate folds as a list of indices
folds <- split(sample(nrow(data), nrow(data),replace=FALSE), as.factor(1:K))
# take the first part
data[folds[[1]], ]
# take all but the first part
data[-folds[[1]], ]

@gwangjinkim
Copy link

@tong-wang thanks! Nice idea! I made a function using your idea. It splits a dataframe then in random k parts and returns a list of sub-data-frames.

k_split <- function(df, k) {
  folds <- split(sample(nrow(df), nrow(df), replace=F), as.factor(1:k))
  lapply(folds, function(idxs) df[idxs, ])
}

@AaronCreighton
Copy link

@gwangjinkim, I like the function. However it only works with even splits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment