Key takeaways:
- everything inside
vars(...)
is exactly the same as the stuff insideselect(...)
!!! vars()
is used for all scoped variants of dplyr verbs (I assume bc the variables need to “fit” into a single argument,.vars
. Inselect(...)
, the ellipses take everything)vars_select()
is probably more of a developer-facing function (seen inselect_helpers
documentation)
Some "gotchas":
- Using
*_if
when I mean*_at
- Forgetting why
select(starts_with("a"), starts_with("b"))
works butselect_at(starts_with("a"), starts_with("b"))
errors - Trying to use
vars()
when I can't (and vice versa)- example:
pivot_longer()
only takes a single select helper
- example:
- Trying to use “and” for two sets
- example:
starts_with(“total) & ends_with(“count”)
instead of usingmatches("^total.*count$")
- example:
- Trying to mix predicates with tidyselect (
starts_with(“x”)
ANDis.character()
) -- not sure if this is possible except maybe in two steps - Trying to use regex in
starts_with()
(e.g.starts_with(“a|b”)
) --matches()
takes regular expressions butstarts_with()
does not!
library(tidyverse) #w/ tidyr 0.8.3.9000
df <- tibble(id = 1:10,
a_count = sample(10),
a_mean = sample(10),
b_count = sample(10),
b_mean = sample(10),
c_count = sample(10),
c_mean = sample(10),
total_ab_count = a_count + b_count,
total_ab_mean = (a_mean + b_mean)/2,
total_bc_count = b_count + c_count,
total_bc_mean = (a_mean + b_mean)/2)
# business as usual: using select()
df %>% select(id, starts_with("a"), starts_with("b"))
#> # A tibble: 10 x 5
#> id a_count a_mean b_count b_mean
#> <int> <int> <int> <int> <int>
#> 1 1 10 4 2 4
#> 2 2 7 7 10 1
#> 3 3 1 3 7 9
#> 4 4 9 8 6 7
#> 5 5 8 9 4 8
#> 6 6 5 6 8 6
#> 7 7 3 1 1 3
#> 8 8 2 10 3 5
#> 9 9 6 5 5 10
#> 10 10 4 2 9 2
# select_* works the same but we need vars() - presumably bc the column selection needs to "fit" into the first argument
df %>% select_at(vars(starts_with("a"), starts_with("b")))
#> # A tibble: 10 x 4
#> a_count a_mean b_count b_mean
#> <int> <int> <int> <int>
#> 1 10 4 2 4
#> 2 7 7 10 1
#> 3 1 3 7 9
#> 4 9 8 6 7
#> 5 8 9 4 8
#> 6 5 6 8 6
#> 7 3 1 1 3
#> 8 2 10 3 5
#> 9 6 5 5 10
#> 10 4 2 9 2
# not using vars()--even for just one selector--does not work
df %>% select_at(starts_with("a"))
#> No tidyselect variables were registered
# doesn't work but I feel like it should:
df %>% select_at(vars(starts_with("a|b")))
#> # A tibble: 10 x 0
# the stuff that goes into vars() is the stuff that goes into select()!
df %>% select_at(vars(-starts_with("a")))
#> # A tibble: 10 x 9
#> id b_count b_mean c_count c_mean total_ab_count total_ab_mean
#> <int> <int> <int> <int> <int> <int> <dbl>
#> 1 1 2 4 2 4 12 4
#> 2 2 10 1 3 3 17 4
#> 3 3 7 9 4 6 8 6
#> 4 4 6 7 6 9 15 7.5
#> 5 5 4 8 1 5 12 8.5
#> 6 6 8 6 9 10 13 6
#> 7 7 1 3 7 2 4 2
#> 8 8 3 5 10 1 5 7.5
#> 9 9 5 10 8 8 11 7.5
#> 10 10 9 2 5 7 13 2
#> # … with 2 more variables: total_bc_count <int>, total_bc_mean <dbl>
# ...which is why this doesn't work: df %>% select_at(-vars(starts_with("a")))
# using "set-thinking":
df %>% select_at(vars(starts_with("total"), -ends_with("mean")))
#> # A tibble: 10 x 2
#> total_ab_count total_bc_count
#> <int> <int>
#> 1 12 4
#> 2 17 13
#> 3 8 11
#> 4 15 12
#> 5 12 5
#> 6 13 17
#> 7 4 8
#> 8 5 13
#> 9 11 13
#> 10 13 14
# but if I want to make the same selection "positively" (starts_with("total) AND ends_with("count"))
# I end up with extra columns (bc it's "or")
df %>% select_at(vars(starts_with("total"), ends_with("count")))
#> # A tibble: 10 x 7
#> total_ab_count total_ab_mean total_bc_count total_bc_mean a_count
#> <int> <dbl> <int> <dbl> <int>
#> 1 12 4 4 4 10
#> 2 17 4 13 4 7
#> 3 8 6 11 6 1
#> 4 15 7.5 12 7.5 9
#> 5 12 8.5 5 8.5 8
#> 6 13 6 17 6 5
#> 7 4 2 8 2 3
#> 8 5 7.5 13 7.5 2
#> 9 11 7.5 13 7.5 6
#> 10 13 2 14 2 4
#> # … with 2 more variables: b_count <int>, c_count <int>
# or if I try "and", it doesn't work:
df %>% select_at(vars(starts_with("total") & ends_with("count")))
#> Warning in starts_with("total") & ends_with("count"): longer object length
#> is not a multiple of shorter object length
#> `starts_with("total") & ends_with("count")` must evaluate to column
#> positions or names, not a logical vector
# can be solved with matches()
df %>% select_at(vars(matches("^total.*count$")))
#> # A tibble: 10 x 2
#> total_ab_count total_bc_count
#> <int> <int>
#> 1 12 4
#> 2 17 13
#> 3 8 11
#> 4 15 12
#> 5 12 5
#> 6 13 17
#> 7 4 8
#> 8 5 13
#> 9 11 13
#> 10 13 14
# pivot_longer takes a SINGLE selector
df %>% pivot_longer(ends_with("count"))
#> # A tibble: 50 x 8
#> id a_mean b_mean c_mean total_ab_mean total_bc_mean name value
#> <int> <int> <int> <int> <dbl> <dbl> <chr> <int>
#> 1 1 4 4 4 4 4 a_count 10
#> 2 1 4 4 4 4 4 b_count 2
#> 3 1 4 4 4 4 4 c_count 2
#> 4 1 4 4 4 4 4 total_ab_c… 12
#> 5 1 4 4 4 4 4 total_bc_c… 4
#> 6 2 7 1 3 4 4 a_count 7
#> 7 2 7 1 3 4 4 b_count 10
#> 8 2 7 1 3 4 4 c_count 3
#> 9 2 7 1 3 4 4 total_ab_c… 17
#> 10 2 7 1 3 4 4 total_bc_c… 13
#> # … with 40 more rows
# doesn't work:
df %>% pivot_longer(vars(starts_with("total"), ends_with("count")))
#> `vars(starts_with("total"), ends_with("count"))` must evaluate to column
#> positions or names, not a list
# use matches to match for multiple
df %>% pivot_longer(matches("^total.*count$"))
#> # A tibble: 20 x 11
#> id a_count a_mean b_count b_mean c_count c_mean total_ab_mean
#> <int> <int> <int> <int> <int> <int> <int> <dbl>
#> 1 1 10 4 2 4 2 4 4
#> 2 1 10 4 2 4 2 4 4
#> 3 2 7 7 10 1 3 3 4
#> 4 2 7 7 10 1 3 3 4
#> 5 3 1 3 7 9 4 6 6
#> 6 3 1 3 7 9 4 6 6
#> 7 4 9 8 6 7 6 9 7.5
#> 8 4 9 8 6 7 6 9 7.5
#> 9 5 8 9 4 8 1 5 8.5
#> 10 5 8 9 4 8 1 5 8.5
#> 11 6 5 6 8 6 9 10 6
#> 12 6 5 6 8 6 9 10 6
#> 13 7 3 1 1 3 7 2 2
#> 14 7 3 1 1 3 7 2 2
#> 15 8 2 10 3 5 10 1 7.5
#> 16 8 2 10 3 5 10 1 7.5
#> 17 9 6 5 5 10 8 8 7.5
#> 18 9 6 5 5 10 8 8 7.5
#> 19 10 4 2 9 2 5 7 2
#> 20 10 4 2 9 2 5 7 2
#> # … with 3 more variables: total_bc_mean <dbl>, name <chr>, value <int>
Created on 2019-06-21 by the reprex package (v0.2.1)
Thanks for putting this together! This really helped, and I was encouraged to dig a bit deeper into how both the different scoped versions work using either
vars()
or predicates.I found some workarounds to some of your questions raised in the "gotchas":
tidyselect::select_helpers()
return integer vectors with the position of the matched variables, you can apply boolean logic using set operations:_at
and_if
variants was a bit more challenging, since it seems like_if
version applies the predicate function to the column, whereas the_at
version applies the select helpers on the variable names. You can combine them by computing the_if
output manually:[I didn't figure out a way to do this in the opposite direction, since I don't think you can get the column name into whatever calculation you do inside
select_if
??]