Created
April 17, 2021 20:32
-
-
Save b-rodrigues/d9efe80f879f95d305cd661b63e2dee9 to your computer and use it in GitHub Desktop.
source code to my blog post about "dealing with non-representative samples with post-stratification"
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
date: 2021-04-17 | |
title: Dealing with non-representative samples with post-stratification | |
tags: [R] | |
menu: | |
main: | |
parent: Blog | |
identifier: /blog/post_strat | |
weight: 1 | |
--- | |
<div style="text-align:center;"> | |
<a href="https://www.youtube.com/watch?v=eOBIIB690yE"> | |
<img src="/img/bingo.png" title = "It could have been worse"></a> | |
</div> | |
Let's go back to stats 101: what do you do if you want to know how many people like to play bingo | |
in a certain population? The answer, of course, is to ask a sample of people if they enjoy | |
playing bingo, compute the proportion and then... we're done! Right? | |
Well not exactly. This works if your sample is representative, which in practice, is not often | |
the case. | |
I am not an expert of survey methods, very far from it, but I was recently confronted to a similar | |
issue at work. So in this blog post I want to talk about estimating a proportion using a sample | |
that is not representative of the population using a method called "post-stratification". | |
By the way, before continuing, I also made a video about this topic if you're interested, | |
watch it [here](https://www.youtube.com/watch?v=eOBIIB690yE). | |
The data I use in this sample is simulated; so I know the "truth", since I made the data, and can | |
thus compare the results from post-stratification to the truth. At the end of the blog post, I | |
will post the complete source code, but for now, let's suppose that this is my sample: | |
```{r, include = FALSE} | |
library(tidyverse) | |
library(survey) | |
library(janitor) | |
library(brotools) | |
``` | |
```{r, eval = FALSE} | |
library(tidyverse) | |
library(survey) | |
library(janitor) | |
library(brotools) | |
``` | |
```{r, include = FALSE} | |
data("eusilcP", package = "simPop") | |
eusilcP <- eusilcP %>% | |
mutate(age_group = case_when(age < 20 ~ "19-", | |
between(age, 20, 49.999) ~ "20-49", | |
between(age, 50, 79.999) ~ "50-79", | |
age >= 80 ~ "80+")) | |
proba_vac <- function(gender, age, list_probas_female, list_probas_male){ | |
if(gender == "female"){ | |
case_when(age > 70 ~ sample(x = c(1, 0), 1, prob = list_probas_female$prob_70), | |
between(age, 40, 70) ~ sample(x = c(1, 0), 1, prob = list_probas_female$prob_middle_age), | |
between(age, 20, 39.999) ~ sample(x = c(1, 0), 1, prob = list_probas_female$prob_child), | |
age < 20 ~ sample(x = c(1, 0), 1, prob = list_probas_female$prob_child)) | |
} else { | |
case_when(age > 70 ~ sample(x = c(1, 0), 1, prob = list_probas_male$prob_70), | |
between(age, 40, 70) ~ sample(x = c(1, 0), 1, prob = list_probas_male$prob_middle_age), | |
between(age, 20, 39.999) ~ sample(x = c(1, 0), 1, prob = list_probas_male$prob_child), | |
age < 20 ~ sample(x = c(1, 0), 1, prob = list_probas_male$prob_child)) | |
} | |
} | |
list_probas_female_1 <- list(prob_70 = c(.8, .2), | |
prob_middle_age = c(.25, .65), | |
prob_young = c(.05, .95), | |
prob_child = c(.01, .99)) | |
list_probas_male_1 <- list(prob_70 = c(.75, .25), | |
prob_middle_age = c(.2, .8), | |
prob_young = c(.05, .95), | |
prob_child = c(.01, .99)) | |
list_probas_female_2 <- list(prob_70 = c(.6, .4), | |
prob_middle_age = c(.45, .55), | |
prob_young = c(.1, .9), | |
prob_child = c(.05, .95)) | |
list_probas_male_2 <- list(prob_70 = c(.55, .45), | |
prob_middle_age = c(.4, .6), | |
prob_young = c(.15, .85), | |
prob_child = c(.05, .95)) | |
eusilcP <- eusilcP %>% | |
mutate(likes_bingo_1 = pmap_dbl(list(gender, age), ~proba_vac(list_probas_female = list_probas_female_1, | |
list_probas_male = list_probas_male_1, | |
...))) %>% | |
mutate(likes_bingo_2 = pmap_dbl(list(gender, age), ~proba_vac(list_probas_female = list_probas_female_2, | |
list_probas_male = list_probas_male_2, | |
...))) %>% | |
mutate(likes_bingo_3 = pmap_dbl(list(gender, age), ~proba_vac(list_probas_female = list_probas_female_2, | |
list_probas_male = list_probas_male_2, | |
...))) %>% | |
mutate(likes_bingo_4 = pmap_dbl(list(gender, age), ~proba_vac(list_probas_female = list_probas_female_2, | |
list_probas_male = list_probas_male_2, | |
...))) %>% | |
mutate(likes_bingo_5 = pmap_dbl(list(gender, age), ~proba_vac(list_probas_female = list_probas_female_2, | |
list_probas_male = list_probas_male_2, | |
...))) %>% | |
mutate(likes_bingo_2 = ifelse(likes_bingo_1 + likes_bingo_2 >= 1, 1, 0), | |
likes_bingo_3 = ifelse(likes_bingo_2 + likes_bingo_3 >= 1, 1, 0), | |
likes_bingo_4 = ifelse(likes_bingo_3 + likes_bingo_4 >= 1, 1, 0), | |
likes_bingo_5 = ifelse(likes_bingo_4 + likes_bingo_5 >= 1, 1, 0)) | |
age_distribution_population <- eusilcP %>% | |
tabyl(age_group) | |
my_sample_1 <- eusilcP %>% | |
group_by(age_group) %>% | |
nest() %>% | |
#left_join(population_freq) %>% | |
bind_cols(n_to_sample = c(174, 540, 40, 150)) %>% | |
mutate(sample_in_group = map2(.x = data, | |
.y = n_to_sample, | |
~sample_n(tbl = .x, size = .y))) %>% | |
select(age_group, sample_in_group) %>% | |
unnest(cols = sample_in_group) %>% | |
ungroup() | |
pop_marginal_age <- eusilcP %>% | |
count(age_group, name = "Freq") | |
unweighted_data <- svydesign(ids = ~1, data = my_sample_1, control = list(maxit = 100)) | |
weighted_data <- rake(design = unweighted_data, | |
sample.margins = list(~age_group), | |
population.margins = list(pop_marginal_age)) | |
svymean(~likes_bingo_1, weighted_data, na.rm = TRUE) | |
# every week | |
likes_bingo_through_time <- eusilcP %>% | |
pivot_longer(cols = starts_with("likes_bingo"), | |
names_to = "week", | |
values_to = "freq") %>% | |
group_by(week) %>% | |
summarise(freq = mean(freq)) | |
ggplot(likes_bingo_through_time) + | |
geom_line(aes(y = freq, x = week), group = 1) | |
my_sample_2 <- eusilcP %>% | |
filter(!(id %in% my_sample_1$id)) %>% | |
group_by(age_group) %>% | |
nest() %>% | |
#left_join(population_freq) %>% | |
bind_cols(n_to_sample = c(194, 740, 60, 35)) %>% | |
mutate(sample_in_group = map2(.x = data, | |
.y = n_to_sample, | |
~sample_n(tbl = .x, size = .y))) %>% | |
select(age_group, sample_in_group) %>% | |
unnest(cols = sample_in_group) %>% | |
ungroup() | |
my_sample_2 %>% | |
summarise(mean(likes_bingo_2)) | |
my_sample_3 <- eusilcP %>% | |
filter(!(id %in% my_sample_1$id)) %>% | |
filter(!(id %in% my_sample_2$id)) %>% | |
group_by(age_group) %>% | |
nest() %>% | |
#left_join(population_freq) %>% | |
bind_cols(n_to_sample = c(1224, 120, 72, 20)) %>% | |
mutate(sample_in_group = map2(.x = data, | |
.y = n_to_sample, | |
~sample_n(tbl = .x, size = .y))) %>% | |
select(age_group, sample_in_group) %>% | |
unnest(cols = sample_in_group) %>% | |
ungroup() | |
my_sample_3 %>% | |
summarise(mean(likes_bingo_3)) | |
my_sample_4 <- eusilcP %>% | |
filter(!(id %in% my_sample_1$id)) %>% | |
filter(!(id %in% my_sample_2$id)) %>% | |
filter(!(id %in% my_sample_3$id)) %>% | |
group_by(age_group) %>% | |
nest() %>% | |
#left_join(population_freq) %>% | |
bind_cols(n_to_sample = c(440, 200, 320, 420)) %>% | |
mutate(sample_in_group = map2(.x = data, | |
.y = n_to_sample, | |
~sample_n(tbl = .x, size = .y))) %>% | |
select(age_group, sample_in_group) %>% | |
unnest(cols = sample_in_group) %>% | |
ungroup() | |
my_sample_4 %>% | |
summarise(mean(likes_bingo_4)) | |
my_sample_5 <- eusilcP %>% | |
filter(!(id %in% my_sample_1$id)) %>% | |
filter(!(id %in% my_sample_2$id)) %>% | |
filter(!(id %in% my_sample_3$id)) %>% | |
filter(!(id %in% my_sample_4$id)) %>% | |
group_by(age_group) %>% | |
nest() %>% | |
#left_join(population_freq) %>% | |
bind_cols(n_to_sample = c(504, 325, 720, 20)) %>% | |
mutate(sample_in_group = map2(.x = data, | |
.y = n_to_sample, | |
~sample_n(tbl = .x, size = .y))) %>% | |
select(age_group, sample_in_group) %>% | |
unnest(cols = sample_in_group) %>% | |
ungroup() | |
my_sample_5 %>% | |
summarise(mean(likes_bingo_5)) | |
samples <- bind_rows(my_sample_1, | |
my_sample_2, | |
my_sample_3, | |
my_sample_4, | |
my_sample_5) %>% | |
pivot_longer(cols = starts_with("likes_bingo"), | |
names_to = "week", | |
values_to = "yes") %>% | |
select(age_group, week, yes) | |
samples_likes_bingo_through_time <- samples %>% | |
group_by(week) %>% | |
summarise(freq = mean(yes)) | |
``` | |
```{r} | |
my_sample_1 | |
``` | |
Let's suppose that we have asked people two questions: their age, and whether or not they like | |
bingo. Using this sample, I obtain the following result: | |
```{r} | |
result <- mean(my_sample_1$likes_bingo_1) | |
``` | |
So according to this sample, `r round(result*100, 2)`% of people in my population like bingo. | |
But is that right? Let's use the other piece of information we have: the interviewee's ages. This | |
is the distribution of the age group in my sample: | |
```{r} | |
my_sample_1 %>% | |
tabyl(age_group) | |
``` | |
We want to compare this to the distribution of the same age groups in the population. Thankfully, | |
this is something that is readily available in most (all?) countries. National statistical | |
institutes publish such data on a yearly basis. This is the distribution in the population: | |
```{r} | |
age_distribution_population | |
``` | |
As we can see, our sample is completely off! Elderly people are over-represented while younger | |
people are under-represented. Perhaps this happened because elderly people love bingo more | |
than younger people and, when given the opportunity to confess their love for bingo, are more | |
willing to answer to a survey. Whatever the reason, it would be unreasonable to assume that the | |
proportion given by our sample is a good, unbiased, estimate of the true proportion in the population. | |
What we would like to do here, is to compute weights for each individual in the sample, such that | |
individuals from over-represented groups contribute less to the computation of the proportion than | |
individuals from under-represented groups. This is where post-stratification and raking come | |
into play. As already said, I'm not an expert of these methods. So don't believe that this blog | |
post is a tutorial. However, what I'm going to show you might come in handy. | |
We're going to use the `{survey}` package to compute the weights using raking, by post-stratifying | |
the sample on age group. This can be done with two commands: | |
```{r} | |
unweighted_data <- svydesign(ids = ~1, data = my_sample_1) | |
weighted_data <- rake(design = unweighted_data, | |
sample.margins = list(~age_group), | |
population.margins = list(pop_marginal_age)) | |
``` | |
The first function, `svydesign()` allows you to create a new object based on your data, which | |
specifies the [design](https://stats.oecd.org/glossary/detail.asp?ID=3852) of your study. In this | |
case, I have used `ids = ~1` to say "I don't have any weights, nor anything specific to tell you". | |
Next, using the `rake()` function, I can compute the weights. For this, I need the object I created | |
before, the variable I want to post-stratify on, and then give a table that contains the distribution | |
of said variable in the population. This table looks a bit different from the one I already | |
showed you: it doesn't contain the categories' frequencies, and the variable containing the counts | |
is called `Freq` (`rake()` looks for this variable so it must be named like this): | |
```{r} | |
pop_marginal_age | |
``` | |
We can now take a look at the weights: | |
```{r} | |
summary(weights(weighted_data)) | |
``` | |
In cases where you have very high or very low weights, the literature recommends trimming them. | |
However, I have not seen anything very definitive on this, and it seems that practitioners rely | |
on rules of thumb and gut feeling to know when to trim weights. In my example here, I don't think | |
it is needed, but as I said, I have no intuition for this. Anyways, we are now ready to compute the | |
new proportion: | |
```{r} | |
svymean(~likes_bingo_1, weighted_data) | |
``` | |
The result is quite different from before (it was `r round(result*100, 2)`% in the "raw" sample)! | |
Because I have simulated the data, I can now compare to the "true" value: | |
```{r} | |
eusilcP %>% | |
summarise(mean(likes_bingo_1)) | |
``` | |
And we're quite close! | |
Now let's continue a little bit, with a more complicated example. Imagine that I collected five | |
samples, one per week. Each sample contains totally different people (no person gets asked twice). | |
Also, imagine that while I'm collecting my samples and analyzing them, bingo fever is running amok | |
in my country, always infecting more and more people. As time passes, the proportion of people who | |
love bingo keeps increasing. So my population's parameter keeps changing, and each week, when I get | |
a new sample, the proportion in my sample will also grow on a weekly basis. | |
Because of this, I have to compute weights each week. Thankfully, the distribution of age groups in | |
my population can be assumed to stay constant, so I don't need to think about that. | |
Let's take a look at my sample which contains 5 weeks of data: | |
```{r} | |
samples | |
``` | |
Each row is one person, and this person gets sample exactly once. The `yes` variable collects the | |
answer to the question "do you like bingo?". Let's see how my proportion evolves through time: | |
```{r} | |
(samples_likes_bingo_through_time <- samples %>% | |
group_by(week) %>% | |
summarise(freq = mean(yes))) | |
``` | |
We see that it keeps increasing: this is a good sign, since we know that this is also the case | |
in the population. We just don't know by how much. Let's compute weights for each week, and then | |
recompute estimated proportions using these weights. In order to do this, I will write a function | |
that will make it easy to do just that: | |
```{r} | |
compute_weekly_weights <- function(sample_df){ | |
unweighted_data <- svydesign(ids = ~1, data = sample_df) | |
rake(design = unweighted_data, | |
sample.margins = list(~age_group), | |
population.margins = list(pop_marginal_age)) | |
} | |
``` | |
This function does the exact same thing as before. But it will now make it easy to apply to each | |
week using the `group_by`-`nest`-`map` approach: | |
```{r} | |
weighted_samples <- samples %>% | |
group_nest(week) %>% | |
mutate(weights = map(data, compute_weekly_weights)) %>% | |
mutate(svymeans = map(weights, ~svymean(~yes, .))) | |
``` | |
Let's take a look at this object: | |
```{r} | |
weighted_samples | |
``` | |
So for each week, I have now a `svydesign` object and also a new, hopefully unbiased, proportion | |
of people who like bingo. The following lines simply but this into a nice tibble: | |
```{r} | |
weighted_samples <- weighted_samples %>% | |
mutate(svymeans = map(svymeans, as_tibble)) %>% | |
select(week, svymeans) %>% | |
unnest(cols = svymeans) %>% | |
rename(freq = mean, | |
SE = yes) %>% | |
mutate(is = "corrected_sample") | |
``` | |
To conclude, let's create a plot that compares the proportions computed without using weights to the | |
proportions computed with weights to the true values that I simulated myself. I put everything | |
in a data frame and the create the plot: | |
```{r} | |
all_data <- bind_rows(weighted_samples, # my corrected data | |
mutate(samples_likes_bingo_through_time, is = "raw_sample"), # the raw samples | |
mutate(likes_bingo_through_time, is = "true_value")) %>% # the true, simulated, values | |
mutate(SE = ifelse(is.na(SE), 0, SE)) | |
ggplot(all_data) + | |
geom_ribbon(aes(y = freq, x = week, | |
ymin = freq - 2*SE, | |
ymax = freq + 2*SE, | |
group = is), | |
fill = "pink", | |
alpha = .3) + | |
geom_line(aes(y = freq, x = week, colour = is, group = is)) + | |
theme_blog() | |
``` | |
We can see that the proportions computed without weights were clearly over-estimating the true | |
share of bingo enthusiasts in the population. The weighted proportions are very close to the true | |
values and are acceptable estimates of the true proportions! | |
Hope you enjoyed! If you found this blog post useful, you might want to follow | |
me on [twitter](https://www.twitter.com/brodriguesco) for blog post updates and | |
[buy me an espresso](https://www.buymeacoffee.com/brodriguesco) or [paypal.me](https://www.paypal.me/brodriguesco), or buy my ebook on [Leanpub](https://leanpub.com/modern_tidyverse). | |
You can also watch my videos on [youtube](https://www.youtube.com/c/BrunoRodrigues1988/). | |
So much content for you to consoom! | |
<style>.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;}</style><link href="https://fonts.googleapis.com/css?family=Cookie" rel="stylesheet"><a class="bmc-button" target="_blank" href="https://www.buymeacoffee.com/brodriguesco"><img src="https://www.buymeacoffee.com/assets/img/BMC-btn-logo.svg" alt="Buy me an Espresso"><span style="margin-left:5px">Buy me an Espresso</span></a> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment