Key takeaways:
- everything inside
vars(...)is exactly the same as the stuff insideselect(...)!!! vars()is used for all scoped variants of dplyr verbs (I assume bc the variables need to “fit” into a single argument,.vars. Inselect(...), the ellipses take everything)vars_select()is probably more of a developer-facing function (seen inselect_helpersdocumentation)
Some "gotchas":
- Using
*_ifwhen I mean*_at - Forgetting why
select(starts_with("a"), starts_with("b"))works butselect_at(starts_with("a"), starts_with("b"))errors - Trying to use
vars()when I can't (and vice versa)- example:
pivot_longer()only takes a single select helper
- example:
- Trying to use “and” for two sets
- example:
starts_with(“total) & ends_with(“count”)instead of usingmatches("^total.*count$")
- example:
- Trying to mix predicates with tidyselect (
starts_with(“x”)ANDis.character()) -- not sure if this is possible except maybe in two steps - Trying to use regex in
starts_with()(e.g.starts_with(“a|b”)) --matches()takes regular expressions butstarts_with()does not!
library(tidyverse) #w/ tidyr 0.8.3.9000
df <- tibble(id = 1:10,
a_count = sample(10),
a_mean = sample(10),
b_count = sample(10),
b_mean = sample(10),
c_count = sample(10),
c_mean = sample(10),
total_ab_count = a_count + b_count,
total_ab_mean = (a_mean + b_mean)/2,
total_bc_count = b_count + c_count,
total_bc_mean = (a_mean + b_mean)/2)
# business as usual: using select()
df %>% select(id, starts_with("a"), starts_with("b"))
#> # A tibble: 10 x 5
#> id a_count a_mean b_count b_mean
#> <int> <int> <int> <int> <int>
#> 1 1 10 4 2 4
#> 2 2 7 7 10 1
#> 3 3 1 3 7 9
#> 4 4 9 8 6 7
#> 5 5 8 9 4 8
#> 6 6 5 6 8 6
#> 7 7 3 1 1 3
#> 8 8 2 10 3 5
#> 9 9 6 5 5 10
#> 10 10 4 2 9 2
# select_* works the same but we need vars() - presumably bc the column selection needs to "fit" into the first argument
df %>% select_at(vars(starts_with("a"), starts_with("b")))
#> # A tibble: 10 x 4
#> a_count a_mean b_count b_mean
#> <int> <int> <int> <int>
#> 1 10 4 2 4
#> 2 7 7 10 1
#> 3 1 3 7 9
#> 4 9 8 6 7
#> 5 8 9 4 8
#> 6 5 6 8 6
#> 7 3 1 1 3
#> 8 2 10 3 5
#> 9 6 5 5 10
#> 10 4 2 9 2
# not using vars()--even for just one selector--does not work
df %>% select_at(starts_with("a"))
#> No tidyselect variables were registered
# doesn't work but I feel like it should:
df %>% select_at(vars(starts_with("a|b")))
#> # A tibble: 10 x 0
# the stuff that goes into vars() is the stuff that goes into select()!
df %>% select_at(vars(-starts_with("a")))
#> # A tibble: 10 x 9
#> id b_count b_mean c_count c_mean total_ab_count total_ab_mean
#> <int> <int> <int> <int> <int> <int> <dbl>
#> 1 1 2 4 2 4 12 4
#> 2 2 10 1 3 3 17 4
#> 3 3 7 9 4 6 8 6
#> 4 4 6 7 6 9 15 7.5
#> 5 5 4 8 1 5 12 8.5
#> 6 6 8 6 9 10 13 6
#> 7 7 1 3 7 2 4 2
#> 8 8 3 5 10 1 5 7.5
#> 9 9 5 10 8 8 11 7.5
#> 10 10 9 2 5 7 13 2
#> # … with 2 more variables: total_bc_count <int>, total_bc_mean <dbl>
# ...which is why this doesn't work: df %>% select_at(-vars(starts_with("a")))
# using "set-thinking":
df %>% select_at(vars(starts_with("total"), -ends_with("mean")))
#> # A tibble: 10 x 2
#> total_ab_count total_bc_count
#> <int> <int>
#> 1 12 4
#> 2 17 13
#> 3 8 11
#> 4 15 12
#> 5 12 5
#> 6 13 17
#> 7 4 8
#> 8 5 13
#> 9 11 13
#> 10 13 14
# but if I want to make the same selection "positively" (starts_with("total) AND ends_with("count"))
# I end up with extra columns (bc it's "or")
df %>% select_at(vars(starts_with("total"), ends_with("count")))
#> # A tibble: 10 x 7
#> total_ab_count total_ab_mean total_bc_count total_bc_mean a_count
#> <int> <dbl> <int> <dbl> <int>
#> 1 12 4 4 4 10
#> 2 17 4 13 4 7
#> 3 8 6 11 6 1
#> 4 15 7.5 12 7.5 9
#> 5 12 8.5 5 8.5 8
#> 6 13 6 17 6 5
#> 7 4 2 8 2 3
#> 8 5 7.5 13 7.5 2
#> 9 11 7.5 13 7.5 6
#> 10 13 2 14 2 4
#> # … with 2 more variables: b_count <int>, c_count <int>
# or if I try "and", it doesn't work:
df %>% select_at(vars(starts_with("total") & ends_with("count")))
#> Warning in starts_with("total") & ends_with("count"): longer object length
#> is not a multiple of shorter object length
#> `starts_with("total") & ends_with("count")` must evaluate to column
#> positions or names, not a logical vector
# can be solved with matches()
df %>% select_at(vars(matches("^total.*count$")))
#> # A tibble: 10 x 2
#> total_ab_count total_bc_count
#> <int> <int>
#> 1 12 4
#> 2 17 13
#> 3 8 11
#> 4 15 12
#> 5 12 5
#> 6 13 17
#> 7 4 8
#> 8 5 13
#> 9 11 13
#> 10 13 14
# pivot_longer takes a SINGLE selector
df %>% pivot_longer(ends_with("count"))
#> # A tibble: 50 x 8
#> id a_mean b_mean c_mean total_ab_mean total_bc_mean name value
#> <int> <int> <int> <int> <dbl> <dbl> <chr> <int>
#> 1 1 4 4 4 4 4 a_count 10
#> 2 1 4 4 4 4 4 b_count 2
#> 3 1 4 4 4 4 4 c_count 2
#> 4 1 4 4 4 4 4 total_ab_c… 12
#> 5 1 4 4 4 4 4 total_bc_c… 4
#> 6 2 7 1 3 4 4 a_count 7
#> 7 2 7 1 3 4 4 b_count 10
#> 8 2 7 1 3 4 4 c_count 3
#> 9 2 7 1 3 4 4 total_ab_c… 17
#> 10 2 7 1 3 4 4 total_bc_c… 13
#> # … with 40 more rows
# doesn't work:
df %>% pivot_longer(vars(starts_with("total"), ends_with("count")))
#> `vars(starts_with("total"), ends_with("count"))` must evaluate to column
#> positions or names, not a list
# use matches to match for multiple
df %>% pivot_longer(matches("^total.*count$"))
#> # A tibble: 20 x 11
#> id a_count a_mean b_count b_mean c_count c_mean total_ab_mean
#> <int> <int> <int> <int> <int> <int> <int> <dbl>
#> 1 1 10 4 2 4 2 4 4
#> 2 1 10 4 2 4 2 4 4
#> 3 2 7 7 10 1 3 3 4
#> 4 2 7 7 10 1 3 3 4
#> 5 3 1 3 7 9 4 6 6
#> 6 3 1 3 7 9 4 6 6
#> 7 4 9 8 6 7 6 9 7.5
#> 8 4 9 8 6 7 6 9 7.5
#> 9 5 8 9 4 8 1 5 8.5
#> 10 5 8 9 4 8 1 5 8.5
#> 11 6 5 6 8 6 9 10 6
#> 12 6 5 6 8 6 9 10 6
#> 13 7 3 1 1 3 7 2 2
#> 14 7 3 1 1 3 7 2 2
#> 15 8 2 10 3 5 10 1 7.5
#> 16 8 2 10 3 5 10 1 7.5
#> 17 9 6 5 5 10 8 8 7.5
#> 18 9 6 5 5 10 8 8 7.5
#> 19 10 4 2 9 2 5 7 2
#> 20 10 4 2 9 2 5 7 2
#> # … with 3 more variables: total_bc_mean <dbl>, name <chr>, value <int>Created on 2019-06-21 by the reprex package (v0.2.1)
Thanks for putting this together! This really helped, and I was encouraged to dig a bit deeper into how both the different scoped versions work using either
vars()or predicates.I found some workarounds to some of your questions raised in the "gotchas":
tidyselect::select_helpers()return integer vectors with the position of the matched variables, you can apply boolean logic using set operations:_atand_ifvariants was a bit more challenging, since it seems like_ifversion applies the predicate function to the column, whereas the_atversion applies the select helpers on the variable names. You can combine them by computing the_ifoutput manually:[I didn't figure out a way to do this in the opposite direction, since I don't think you can get the column name into whatever calculation you do inside
select_if??]