Skip to content

Instantly share code, notes, and snippets.

@bgall
Last active May 19, 2022 08:02
Show Gist options
  • Save bgall/c4e5936d7ae714919a9b8550442b4804 to your computer and use it in GitHub Desktop.
Save bgall/c4e5936d7ae714919a9b8550442b4804 to your computer and use it in GitHub Desktop.
Benchmarking fastDummies::dummy_cols() against modeldb::add_dummy_variables()
# Compare the performance of two purportedly-fast ways of generating dummy variables from character vector.
# NOTE: fastDummies retains the original variable by default while modeldb does not
# Dependencies
library(microbenchmark)
library(fastDummies)
library(modeldb)
library(dplyr)
# Simulate data: 1 million rows, 1 variable with 26 unique values
set.seed(123)
df <- data.frame(x = sample(LETTERS, size = 1000000, replace = TRUE))
# Benchmarks
t <- 1000
result_modeldb <- microbenchmark(
z <- df %>% modeldb::add_dummy_variables(x = x, auto_values = T),
times = t)
result_fastdums <- microbenchmark::microbenchmark(
z <- df %>% fastDummies::dummy_cols(remove_first_dummy = TRUE),
times = t)
# Compare
result_modeldb %>% bind_rows(result_fastdums)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment