Skip to content

Instantly share code, notes, and snippets.

@ramhiser
Last active April 1, 2021 14:22
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ramhiser/8b5ffd0ffbfbf1f49e71bbbd330bf72d to your computer and use it in GitHub Desktop.
Save ramhiser/8b5ffd0ffbfbf1f49e71bbbd330bf72d to your computer and use it in GitHub Desktop.
Stratified Sampling in R with dplyr
# Uses a subset of the Iris data set with different proportions of the Species factor
set.seed(42)
iris_subset <- iris[c(1:50, 51:80, 101:120), ]
stratified_sample <- iris_subset %>%
group_by(Species) %>%
mutate(num_rows=n()) %>%
sample_frac(0.4, weight=num_rows) %>%
ungroup
# These results should be equal
table(iris_subset$Species) / nrow(iris_subset)
table(stratified_sample$Species) / nrow(stratified_sample)
# Success!
# setosa versicolor virginica
# 0.5 0.3 0.2
@mikeyEcology
Copy link

Thank you for posting this. What if you wanted to stratify sampling based on two conditions? For example, if you wanted an equal proportions of both species and condition, using these data:

    iris_subset$condition <- rep(seq(1,5,by=1), 20)
    # next line does not run, but I'm wondering how it could. 
    stratified_sample <- iris_subset %>%
      group_by(c(Species,condition)) %>%
      mutate(num_rows=n()) %>%
      sample_frac(0.4, weight=num_rows) %>%
      ungroup

@matiarno
Copy link

matiarno commented Aug 6, 2018

Thanks for the post. A question about a previous stage: How can I get/calculate de optimum sample size?

@louisaslett
Copy link

This is nice, introduced sample_frac to me thanks!

However, I might not have entirely understood now having read the docs ... why do we need to have the num_rows and weight by them? The docs say subset_frac honours any grouping so I would have thought this also achieves a stratified sample:

stratified_sample <- iris_subset %>%
  group_by(Species) %>%
  sample_frac(0.4) %>%
  ungroup

Indeed, I think sample_frac will by definition see the same weight in each group so that the weight has no effect after grouping?

@Martins6
Copy link

Big thanks! Don't know how much it helped me on my projects! :D

@Martins6
Copy link

Thank you for posting this. What if you wanted to stratify sampling based on two conditions? For example, if you wanted an equal proportions of both species and condition, using these data:

    iris_subset$condition <- rep(seq(1,5,by=1), 20)
    # next line does not run, but I'm wondering how it could. 
    stratified_sample <- iris_subset %>%
      group_by(c(Species,condition)) %>%
      mutate(num_rows=n()) %>%
      sample_frac(0.4, weight=num_rows) %>%
      ungroup

maybe you could filter it after grouping? like that:

  stratified_sample <- iris_subset %>%
   group_by(c(Species)) %>%
   filter(condition == TRUE) %>%
   mutate(num_rows=n()) %>% ...

@prabhatsgautam
Copy link

This is nice, introduced sample_frac to me thanks!

However, I might not have entirely understood now having read the docs ... why do we need to have the num_rows and weight by them? The docs say subset_frac honours any grouping so I would have thought this also achieves a stratified sample:

stratified_sample <- iris_subset %>%
  group_by(Species) %>%
  sample_frac(0.4) %>%
  ungroup

Indeed, I think sample_frac will by definition see the same weight in each group so that the weight has no effect after grouping?

I have the same question as above-- any responses on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment