Skip to content

Instantly share code, notes, and snippets.

@GuiMarthe
Created July 20, 2019 20:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save GuiMarthe/1228809f84be61bbb99c0e0728a640a4 to your computer and use it in GitHub Desktop.
Save GuiMarthe/1228809f84be61bbb99c0e0728a640a4 to your computer and use it in GitHub Desktop.
A simple procedure for sampling a distribution to look like another. A method through binning and another by kde estimation. The binning idea came from this stats exchange question and the kde method came from other studies of mine.
library(tidyverse)
library(broom)
df <-
tibble(
label = factor(c(rep("group1", 8E4), rep("group2", 1E4))),
var = c(rnorm(n = 8E4, mean =2, sd= 5), c( rnorm(n = 5E3,mean =-2, sd= 0.5), rnorm(n=5E3, mean = 1, sd = 0.5)))
)
df %>%
ggplot(aes(var)) +
geom_histogram(aes(fill = label), bins=100, position = 'identity', alpha=0.8)
# densities by binning
df <-
df %>%
mutate(bins = cut(var, breaks = 100))
df %>%
group_by(label, bins) %>%
summarise(total = n()) %>%
mutate(prop = total/sum(total)) %>%
select(-total) %>%
spread(label, prop, fill = 0) -> densities
df %>%
filter(label == 'group1') %>%
left_join(densities, by = 'bins') %>%
mutate(weight = group2/group1) %>%
sample_n(20000, replace = T, weight = weight) %>%
mutate(label = 'group1 (sampled)') %>%
bind_rows(df) %>%
ggplot(aes(var)) +
geom_histogram(aes(fill = label), bins=100, position = 'identity', alpha=0.8)
# density by kde
kde_function_group1 <- df %>% filter(label == 'group1') %>% pull(var) %>% density(.) %>% approxfun(.)
kde_function_group2 <- df %>% filter(label == 'group2') %>% pull(var) %>% density(.) %>% approxfun(.)
df %>%
filter(label == 'group1') %>%
mutate(g2d = kde_function_group2(var), g1d = kde_function_group1(var),
weight = coalesce(g2d/g1d, 0) #sample importance
) %>%
sample_n(20000, replace = T, weight = weight) %>%
mutate(label = 'group1 (sampled)') %>%
bind_rows(df) %>%
ggplot(aes(var)) +
geom_histogram(aes(fill = label), bins=100, position = 'identity', alpha=0.8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment