Skip to content

Instantly share code, notes, and snippets.

@mrdwab
Last active April 27, 2024 19:57
Show Gist options
  • Star 45 You must be signed in to star a gist
  • Fork 27 You must be signed in to fork a gist
  • Save mrdwab/6424112 to your computer and use it in GitHub Desktop.
Save mrdwab/6424112 to your computer and use it in GitHub Desktop.
Stratified random sampling from a `data.frame` in R
stratified <- function(df, group, size, select = NULL,
replace = FALSE, bothSets = FALSE) {
if (is.null(select)) {
df <- df
} else {
if (is.null(names(select))) stop("'select' must be a named list")
if (!all(names(select) %in% names(df)))
stop("Please verify your 'select' argument")
temp <- sapply(names(select),
function(x) df[[x]] %in% select[[x]])
df <- df[rowSums(temp) == length(select), ]
}
df.interaction <- interaction(df[group], drop = TRUE)
df.table <- table(df.interaction)
df.split <- split(df, df.interaction)
if (length(size) > 1) {
if (length(size) != length(df.split))
stop("Number of groups is ", length(df.split),
" but number of sizes supplied is ", length(size))
if (is.null(names(size))) {
n <- setNames(size, names(df.split))
message(sQuote("size"), " vector entered as:\n\nsize = structure(c(",
paste(n, collapse = ", "), "),\n.Names = c(",
paste(shQuote(names(n)), collapse = ", "), ")) \n\n")
} else {
ifelse(all(names(size) %in% names(df.split)),
n <- size[names(df.split)],
stop("Named vector supplied with names ",
paste(names(size), collapse = ", "),
"\n but the names for the group levels are ",
paste(names(df.split), collapse = ", ")))
}
} else if (size < 1) {
n <- round(df.table * size, digits = 0)
} else if (size >= 1) {
if (all(df.table >= size) || isTRUE(replace)) {
n <- setNames(rep(size, length.out = length(df.split)),
names(df.split))
} else {
message(
"Some groups\n---",
paste(names(df.table[df.table < size]), collapse = ", "),
"---\ncontain fewer observations",
" than desired number of samples.\n",
"All observations have been returned from those groups.")
n <- c(sapply(df.table[df.table >= size], function(x) x = size),
df.table[df.table < size])
}
}
temp <- lapply(
names(df.split),
function(x) df.split[[x]][sample(df.table[x],
n[x], replace = replace), ])
set1 <- do.call("rbind", temp)
if (isTRUE(bothSets)) {
set2 <- df[!rownames(df) %in% rownames(set1), ]
list(SET1 = set1, SET2 = set2)
} else {
set1
}
}
@Kamel20
Copy link

Kamel20 commented May 24, 2015

Dearr mrdwab
Hope this find you well,
really it great code,
and i want to ask you question please about
stacked regression as it mention in article of Brieman
http://statistics.berkeley.edu/sites/default/files/tech-reports/367.pdf

have you any code for that, please.

@NycteaBubo
Copy link

This is genius. Thanks!

@kransom14
Copy link

Thank you for posting this! It's just what I needed! I am using this function to prepare some data for my research. Do you have a preferred citation so I can cite your function in my paper?

@danahaggarty
Copy link

Hi there! Great function! I am using it to randomly select a subset of species observations so that we can verify the species identification of a randomly selected subset. So, I can easily stratify by species. However, if I want the observations to also be stratified by transect so that if possible, the species checked are from different transects, it becomes more complicated. I tried: size=c("TransectName"=1,"SpeciesName"=5)
to choose 5 observations of each species, each from a different transect, but this didn't work.
Error message: (Error in stratified(df, group = c("SpeciesName", "TransectName"), size = c(TransectName = 1, : Number of groups is 508 but number of sizes supplied is 2)
It gets more complicated, because if it isn't possible to get 5 observations from different transects, they can come from the same transect.
Any ideas on how I would accomplish this?
If not, I'll drop the stratification by transect.
Thanks!!

@Tamaramaria
Copy link

Great function! I have a case in which I am not sure it could be used.

Let's say I have a dataset x and a dataset y. Dataset x contains N observations and dataset y contains M observations, where N>M. Both datasets contain the same variables k. If I want to make from dataset x a representative sample of dataset y.
Is it possible to use the function and to specify the proportions of dataset y for the category vars of: 1) Size and 2) Sector?

stratified(Dataset x, c("Size", "Sector"), ...)

Thanks very much!

@karagawa
Copy link

Great Function! @mrdwab Could you please provide a official citation guide to cite your function/package? Thank you!

@robsalasco
Copy link

thank you!

@jcms2665
Copy link

Great Function !!!!!......Thanks a lot !!

@Kriseye
Copy link

Kriseye commented Jul 5, 2017

Great. Thank you very much for that.

@Opadera
Copy link

Opadera commented Oct 23, 2017

awesome!

@ShivaniMahendra
Copy link

Hi, Thank you for the amazing code. But i have a query regarding using multiple columns to create strata.
Here you have shown one example "stratified(dat1, c("E", "D"), size = 0.15)" where both "E" and "D" are categorical columns. I was wondering if we can use multiple numerical columns. Please guide me for the same.
Basically your code : stratified(dat1, c("B", "C"), size = 0.15) should return some output.
Thanks in advance.

@svknair
Copy link

svknair commented Jul 25, 2019

Hi, I tried to load the function using the following commands:

library(devtools)
source_gist("https://gist.github.com/mrdwab/6424112")

But, I got the following error:

Error in r_files[[which]] : invalid subscript type 'closure'

Really appreciate your help to fix this. This is exactly the function that I have been looking for and desperately need to use it.

@blechturm
Copy link

Wow this is exactly what I need! Thank you so much!

By the way, is there a way to apply population weights for the sampling?

@edwardriveros
Copy link

Thanks so much for this code, it works perfectly.

@clonyem
Copy link

clonyem commented Apr 27, 2024

Hi there Ananda,

How do I make attribution to your article?
Such as citing the material. This is top stuff, indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment