Skip to content

Instantly share code, notes, and snippets.

@dantonnoriega
Created February 10, 2022 23:09
Show Gist options
  • Save dantonnoriega/13a3c8a23c8ef03fb49f3112fdf476c2 to your computer and use it in GitHub Desktop.
Save dantonnoriega/13a3c8a23c8ef03fb49f3112fdf476c2 to your computer and use it in GitHub Desktop.
a simple script that tests out different was of implementing the `if_any` logic in native data.table
library(tidyverse)
dat <- as_tibble(mtcars) %>%
mutate(vs = as.character(vs),
am = as.character(am)) #just to make some non-numeric
dd0 <- dat %>%
select(where(is_numeric)) %>%
filter(if_any(disp:wt, ~ .x > 100))
dd0
library(data.table)
dd = data.table::as.data.table(dat)
rdx = dd[, .SD, .SDcols = is.numeric]
# Reduce lists using vectorized "or" ('|')
ii = rdx[, Reduce('|', lapply(.SD, '>', 100)), .SDcols = disp:wt]
## keep where any true
dd1 = rdx[ii]
identical(setDT(dd0), dd1)
# all at once
rdx[rdx[, Reduce('|', lapply(.SD, '>', 100)), .SDcols = disp:wt]]
# benchmark
library(data.table)
set.seed(1000)
n_m = expand.grid(n = c(3,12), m = c(2.5,100)*1e4)
#
results = mapply(function(n,m) {
my.df <- sample(1:80, m*n, replace=TRUE)
dim(my.df) <- c(m,n)
my.df <- as.data.frame(my.df)
names(my.df) <- c(LETTERS,letters)[1:n]
my.dt <- as.data.table(my.df)
bench::mark(
# using Reduce with lapply()
tm1 = my.dt[my.dt[, Reduce('|', lapply(.SD, '>', 75))]],
# using rowSums
tm2 = my.dt[rowSums(my.dt[, lapply(.SD, '>', 75)]) > 0],
# using apply with any()
tm3 = my.dt[apply(my.dt[, lapply(.SD, '>', 75)], 1, any)],
# dtplyr
tm4 = my.dt %>% dplyr::filter(if_any(.fns = ~ .x > 75)),
iterations=30L,
time_unit = 's'
) %>%
dplyr::mutate(n = n, m = m)
}, n = n_m$n, m = n_m$m, SIMPLIFY = FALSE)
dplyr::bind_rows(results) %>%
dplyr::select(n, m, expression, median,
total_time, `itr/sec`, mem_alloc,
n_itr, n_gc)
@dantonnoriega
Copy link
Author

The goal here was to extract all rows where a condition is TRUE for any one or more subset of columns—but using base R and pure data.table.

Basically, all but apply using any (tm3) perform well and low cost. Reduce using a logical operator like | appears to be the most efficient.

# A tibble: 16 × 9
       n       m expression   median total_time `itr/sec` mem_alloc n_itr  n_gc
   <dbl>   <dbl> <bch:expr>    <dbl>      <dbl>     <dbl> <bch:byt> <int> <dbl>
 1     3   25000 tm1        0.000473     0.0146   2051.     719.8KB    30     0
 2     3   25000 tm2        0.000707     0.0217   1383.      1.39MB    30     0
 3     3   25000 tm3        0.0153       0.245      65.2     1.86MB    16    14
 4     3   25000 tm4        0.00164      0.0479    606.    858.86KB    29     1
 5    12   25000 tm1        0.00203      0.0624    481.      3.02MB    30     0
 6    12   25000 tm2        0.00187      0.0538    539.      4.57MB    29     1
 7    12   25000 tm3        0.0185       0.315      54.0      5.9MB    17    13
 8    12   25000 tm4        0.00363      0.105     277.      3.16MB    29     1
 9     3 1000000 tm1        0.0128       0.400      75.0    25.64MB    30     2
10     3 1000000 tm2        0.0174       0.561      53.5    52.36MB    30     6
11     3 1000000 tm3        0.640       19.4         1.55   71.44MB    30   540
12     3 1000000 tm4        0.0185       0.579      51.8    33.22MB    30     3
13    12 1000000 tm1        0.0653       2.04       14.7   118.31MB    30    14
14    12 1000000 tm2        0.0621       1.93       15.5   179.37MB    30    22
15    12 1000000 tm3        0.826       24.9         1.21  232.77MB    30   544
16    12 1000000 tm4        0.0855       2.58       11.6   125.89MB    30    19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment