Skip to content

Instantly share code, notes, and snippets.

@ramhiser
Last active April 1, 2021 14:22
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ramhiser/8b5ffd0ffbfbf1f49e71bbbd330bf72d to your computer and use it in GitHub Desktop.
Save ramhiser/8b5ffd0ffbfbf1f49e71bbbd330bf72d to your computer and use it in GitHub Desktop.
Stratified Sampling in R with dplyr
# Uses a subset of the Iris data set with different proportions of the Species factor
set.seed(42)
iris_subset <- iris[c(1:50, 51:80, 101:120), ]
stratified_sample <- iris_subset %>%
group_by(Species) %>%
mutate(num_rows=n()) %>%
sample_frac(0.4, weight=num_rows) %>%
ungroup
# These results should be equal
table(iris_subset$Species) / nrow(iris_subset)
table(stratified_sample$Species) / nrow(stratified_sample)
# Success!
# setosa versicolor virginica
# 0.5 0.3 0.2
@prabhatsgautam
Copy link

This is nice, introduced sample_frac to me thanks!

However, I might not have entirely understood now having read the docs ... why do we need to have the num_rows and weight by them? The docs say subset_frac honours any grouping so I would have thought this also achieves a stratified sample:

stratified_sample <- iris_subset %>%
  group_by(Species) %>%
  sample_frac(0.4) %>%
  ungroup

Indeed, I think sample_frac will by definition see the same weight in each group so that the weight has no effect after grouping?

I have the same question as above-- any responses on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment