Skip to content

Instantly share code, notes, and snippets.

@isteves
Last active May 4, 2020 15:28
Show Gist options
  • Save isteves/afb7ac5a3b185f600d7f130d99142174 to your computer and use it in GitHub Desktop.
Save isteves/afb7ac5a3b185f600d7f130d99142174 to your computer and use it in GitHub Desktop.
Trying to grok tidyselect

Trying to grok tidyselect

Key takeaways:

  • everything inside vars(...) is exactly the same as the stuff inside select(...)!!!
  • vars() is used for all scoped variants of dplyr verbs (I assume bc the variables need to “fit” into a single argument, .vars. In select(...), the ellipses take everything)
  • vars_select() is probably more of a developer-facing function (seen in select_helpers documentation)

Some "gotchas":

  • Using *_if when I mean *_at
  • Forgetting why select(starts_with("a"), starts_with("b")) works but select_at(starts_with("a"), starts_with("b")) errors
  • Trying to use vars() when I can't (and vice versa)
    • example: pivot_longer() only takes a single select helper
  • Trying to use “and” for two sets
    • example: starts_with(“total) & ends_with(“count”) instead of using matches("^total.*count$")
  • Trying to mix predicates with tidyselect (starts_with(“x”) AND is.character()) -- not sure if this is possible except maybe in two steps
  • Trying to use regex in starts_with() (e.g. starts_with(“a|b”)) -- matches() takes regular expressions but starts_with() does not!
library(tidyverse) #w/ tidyr 0.8.3.9000

df <- tibble(id = 1:10,
             a_count = sample(10),
             a_mean = sample(10),
             b_count = sample(10),
             b_mean = sample(10),
             c_count = sample(10),
             c_mean = sample(10),
             total_ab_count = a_count + b_count,
             total_ab_mean = (a_mean + b_mean)/2,
             total_bc_count = b_count + c_count,
             total_bc_mean = (a_mean + b_mean)/2)

# business as usual: using select()
df %>% select(id, starts_with("a"), starts_with("b"))
#> # A tibble: 10 x 5
#>       id a_count a_mean b_count b_mean
#>    <int>   <int>  <int>   <int>  <int>
#>  1     1      10      4       2      4
#>  2     2       7      7      10      1
#>  3     3       1      3       7      9
#>  4     4       9      8       6      7
#>  5     5       8      9       4      8
#>  6     6       5      6       8      6
#>  7     7       3      1       1      3
#>  8     8       2     10       3      5
#>  9     9       6      5       5     10
#> 10    10       4      2       9      2

# select_* works the same but we need vars() - presumably bc the column selection needs to "fit" into the first argument
df %>% select_at(vars(starts_with("a"), starts_with("b")))
#> # A tibble: 10 x 4
#>    a_count a_mean b_count b_mean
#>      <int>  <int>   <int>  <int>
#>  1      10      4       2      4
#>  2       7      7      10      1
#>  3       1      3       7      9
#>  4       9      8       6      7
#>  5       8      9       4      8
#>  6       5      6       8      6
#>  7       3      1       1      3
#>  8       2     10       3      5
#>  9       6      5       5     10
#> 10       4      2       9      2

# not using vars()--even for just one selector--does not work
df %>% select_at(starts_with("a"))
#> No tidyselect variables were registered

# doesn't work but I feel like it should:
df %>% select_at(vars(starts_with("a|b")))
#> # A tibble: 10 x 0

# the stuff that goes into vars() is the stuff that goes into select()!
df %>% select_at(vars(-starts_with("a")))
#> # A tibble: 10 x 9
#>       id b_count b_mean c_count c_mean total_ab_count total_ab_mean
#>    <int>   <int>  <int>   <int>  <int>          <int>         <dbl>
#>  1     1       2      4       2      4             12           4  
#>  2     2      10      1       3      3             17           4  
#>  3     3       7      9       4      6              8           6  
#>  4     4       6      7       6      9             15           7.5
#>  5     5       4      8       1      5             12           8.5
#>  6     6       8      6       9     10             13           6  
#>  7     7       1      3       7      2              4           2  
#>  8     8       3      5      10      1              5           7.5
#>  9     9       5     10       8      8             11           7.5
#> 10    10       9      2       5      7             13           2  
#> # … with 2 more variables: total_bc_count <int>, total_bc_mean <dbl>
# ...which is why this doesn't work: df %>% select_at(-vars(starts_with("a")))

# using "set-thinking":
df %>% select_at(vars(starts_with("total"), -ends_with("mean")))
#> # A tibble: 10 x 2
#>    total_ab_count total_bc_count
#>             <int>          <int>
#>  1             12              4
#>  2             17             13
#>  3              8             11
#>  4             15             12
#>  5             12              5
#>  6             13             17
#>  7              4              8
#>  8              5             13
#>  9             11             13
#> 10             13             14

# but if I want to make the same selection "positively" (starts_with("total) AND ends_with("count"))
# I end up with extra columns (bc it's "or")
df %>% select_at(vars(starts_with("total"), ends_with("count")))
#> # A tibble: 10 x 7
#>    total_ab_count total_ab_mean total_bc_count total_bc_mean a_count
#>             <int>         <dbl>          <int>         <dbl>   <int>
#>  1             12           4                4           4        10
#>  2             17           4               13           4         7
#>  3              8           6               11           6         1
#>  4             15           7.5             12           7.5       9
#>  5             12           8.5              5           8.5       8
#>  6             13           6               17           6         5
#>  7              4           2                8           2         3
#>  8              5           7.5             13           7.5       2
#>  9             11           7.5             13           7.5       6
#> 10             13           2               14           2         4
#> # … with 2 more variables: b_count <int>, c_count <int>
# or if I try "and", it doesn't work:
df %>% select_at(vars(starts_with("total") & ends_with("count")))
#> Warning in starts_with("total") & ends_with("count"): longer object length
#> is not a multiple of shorter object length
#> `starts_with("total") & ends_with("count")` must evaluate to column
#> positions or names, not a logical vector
# can be solved with matches()
df %>% select_at(vars(matches("^total.*count$")))
#> # A tibble: 10 x 2
#>    total_ab_count total_bc_count
#>             <int>          <int>
#>  1             12              4
#>  2             17             13
#>  3              8             11
#>  4             15             12
#>  5             12              5
#>  6             13             17
#>  7              4              8
#>  8              5             13
#>  9             11             13
#> 10             13             14

# pivot_longer takes a SINGLE selector
df %>% pivot_longer(ends_with("count"))
#> # A tibble: 50 x 8
#>       id a_mean b_mean c_mean total_ab_mean total_bc_mean name        value
#>    <int>  <int>  <int>  <int>         <dbl>         <dbl> <chr>       <int>
#>  1     1      4      4      4             4             4 a_count        10
#>  2     1      4      4      4             4             4 b_count         2
#>  3     1      4      4      4             4             4 c_count         2
#>  4     1      4      4      4             4             4 total_ab_c…    12
#>  5     1      4      4      4             4             4 total_bc_c…     4
#>  6     2      7      1      3             4             4 a_count         7
#>  7     2      7      1      3             4             4 b_count        10
#>  8     2      7      1      3             4             4 c_count         3
#>  9     2      7      1      3             4             4 total_ab_c…    17
#> 10     2      7      1      3             4             4 total_bc_c…    13
#> # … with 40 more rows
# doesn't work:
df %>% pivot_longer(vars(starts_with("total"), ends_with("count")))
#> `vars(starts_with("total"), ends_with("count"))` must evaluate to column
#> positions or names, not a list
# use matches to match for multiple
df %>% pivot_longer(matches("^total.*count$"))
#> # A tibble: 20 x 11
#>       id a_count a_mean b_count b_mean c_count c_mean total_ab_mean
#>    <int>   <int>  <int>   <int>  <int>   <int>  <int>         <dbl>
#>  1     1      10      4       2      4       2      4           4  
#>  2     1      10      4       2      4       2      4           4  
#>  3     2       7      7      10      1       3      3           4  
#>  4     2       7      7      10      1       3      3           4  
#>  5     3       1      3       7      9       4      6           6  
#>  6     3       1      3       7      9       4      6           6  
#>  7     4       9      8       6      7       6      9           7.5
#>  8     4       9      8       6      7       6      9           7.5
#>  9     5       8      9       4      8       1      5           8.5
#> 10     5       8      9       4      8       1      5           8.5
#> 11     6       5      6       8      6       9     10           6  
#> 12     6       5      6       8      6       9     10           6  
#> 13     7       3      1       1      3       7      2           2  
#> 14     7       3      1       1      3       7      2           2  
#> 15     8       2     10       3      5      10      1           7.5
#> 16     8       2     10       3      5      10      1           7.5
#> 17     9       6      5       5     10       8      8           7.5
#> 18     9       6      5       5     10       8      8           7.5
#> 19    10       4      2       9      2       5      7           2  
#> 20    10       4      2       9      2       5      7           2  
#> # … with 3 more variables: total_bc_mean <dbl>, name <chr>, value <int>

Created on 2019-06-21 by the reprex package (v0.2.1)

@ha0ye
Copy link

ha0ye commented Jun 21, 2019

Thanks for putting this together! This really helped, and I was encouraged to dig a bit deeper into how both the different scoped versions work using either vars() or predicates.

I found some workarounds to some of your questions raised in the "gotchas":

  • Since the tidyselect::select_helpers() return integer vectors with the position of the matched variables, you can apply boolean logic using set operations:
iris %>%
    select_at(vars(intersect(ends_with("th"), starts_with("s"))))
  • Mixing the _at and _if variants was a bit more challenging, since it seems like _if version applies the predicate function to the column, whereas the _at version applies the select helpers on the variable names. You can combine them by computing the _if output manually:
iris %>%
    select_at(vars(starts_with("p"), which(vapply(., is.factor, TRUE))))

[I didn't figure out a way to do this in the opposite direction, since I don't think you can get the column name into whatever calculation you do inside select_if ??]

@irenetlv
Copy link

Nice! 💯 Hadley also suggested intersect() -- I hadn't thought of using the more general set operators. I don't love the _at/_if solution, but it's cool you found something that it works! I'll have to keep that hack in mind.

I also suspect that you can't get at column names with select_if() but I haven't fully investigated it.

@ha0ye
Copy link

ha0ye commented Jun 23, 2019

Haha, I didn't see Hadley's tweet until after I had played around with things myself. I think the note on set operators is going to be really helpful!

Yep, totally agree that the _at / _if solution is a kludge on the syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment