isteves/tidyselect.md

## tidyselect.md

      
    Raw
  

              tidyselect.md
            
          
    Trying to grok tidyselect

Key takeaways:

everything inside vars(...) is exactly the same as the stuff inside select(...)!!!
vars() is used for all scoped variants of dplyr verbs (I assume bc the variables need to “fit” into a single argument, .vars. In select(...), the ellipses take everything)
vars_select() is probably more of a developer-facing function (seen in select_helpers documentation)

Some "gotchas":

Using *_if when I mean *_at
Forgetting why select(starts_with("a"), starts_with("b")) works but select_at(starts_with("a"), starts_with("b")) errors
Trying to use vars() when I can't (and vice versa)

example: pivot_longer() only takes a single select helper


Trying to use “and” for two sets

example: starts_with(“total) & ends_with(“count”) instead of using matches("^total.*count$")


Trying to mix predicates with tidyselect (starts_with(“x”) AND is.character()) -- not sure if this is possible except maybe in two steps
Trying to use regex in starts_with() (e.g. starts_with(“a|b”)) -- matches() takes regular expressions but starts_with() does not!

library(tidyverse) #w/ tidyr 0.8.3.9000

df <- tibble(id = 1:10,
             a_count = sample(10),
             a_mean = sample(10),
             b_count = sample(10),
             b_mean = sample(10),
             c_count = sample(10),
             c_mean = sample(10),
             total_ab_count = a_count + b_count,
             total_ab_mean = (a_mean + b_mean)/2,
             total_bc_count = b_count + c_count,
             total_bc_mean = (a_mean + b_mean)/2)

# business as usual: using select()
df %>% select(id, starts_with("a"), starts_with("b"))
#> # A tibble: 10 x 5
#>       id a_count a_mean b_count b_mean
#>    <int>   <int>  <int>   <int>  <int>
#>  1     1      10      4       2      4
#>  2     2       7      7      10      1
#>  3     3       1      3       7      9
#>  4     4       9      8       6      7
#>  5     5       8      9       4      8
#>  6     6       5      6       8      6
#>  7     7       3      1       1      3
#>  8     8       2     10       3      5
#>  9     9       6      5       5     10
#> 10    10       4      2       9      2

# select_* works the same but we need vars() - presumably bc the column selection needs to "fit" into the first argument
df %>% select_at(vars(starts_with("a"), starts_with("b")))
#> # A tibble: 10 x 4
#>    a_count a_mean b_count b_mean
#>      <int>  <int>   <int>  <int>
#>  1      10      4       2      4
#>  2       7      7      10      1
#>  3       1      3       7      9
#>  4       9      8       6      7
#>  5       8      9       4      8
#>  6       5      6       8      6
#>  7       3      1       1      3
#>  8       2     10       3      5
#>  9       6      5       5     10
#> 10       4      2       9      2

# not using vars()--even for just one selector--does not work
df %>% select_at(starts_with("a"))
#> No tidyselect variables were registered

# doesn't work but I feel like it should:
df %>% select_at(vars(starts_with("a|b")))
#> # A tibble: 10 x 0

# the stuff that goes into vars() is the stuff that goes into select()!
df %>% select_at(vars(-starts_with("a")))
#> # A tibble: 10 x 9
#>       id b_count b_mean c_count c_mean total_ab_count total_ab_mean
#>    <int>   <int>  <int>   <int>  <int>          <int>         <dbl>
#>  1     1       2      4       2      4             12           4  
#>  2     2      10      1       3      3             17           4  
#>  3     3       7      9       4      6              8           6  
#>  4     4       6      7       6      9             15           7.5
#>  5     5       4      8       1      5             12           8.5
#>  6     6       8      6       9     10             13           6  
#>  7     7       1      3       7      2              4           2  
#>  8     8       3      5      10      1              5           7.5
#>  9     9       5     10       8      8             11           7.5
#> 10    10       9      2       5      7             13           2  
#> # … with 2 more variables: total_bc_count <int>, total_bc_mean <dbl>
# ...which is why this doesn't work: df %>% select_at(-vars(starts_with("a")))

# using "set-thinking":
df %>% select_at(vars(starts_with("total"), -ends_with("mean")))
#> # A tibble: 10 x 2
#>    total_ab_count total_bc_count
#>             <int>          <int>
#>  1             12              4
#>  2             17             13
#>  3              8             11
#>  4             15             12
#>  5             12              5
#>  6             13             17
#>  7              4              8
#>  8              5             13
#>  9             11             13
#> 10             13             14

# but if I want to make the same selection "positively" (starts_with("total) AND ends_with("count"))
# I end up with extra columns (bc it's "or")
df %>% select_at(vars(starts_with("total"), ends_with("count")))
#> # A tibble: 10 x 7
#>    total_ab_count total_ab_mean total_bc_count total_bc_mean a_count
#>             <int>         <dbl>          <int>         <dbl>   <int>
#>  1             12           4                4           4        10
#>  2             17           4               13           4         7
#>  3              8           6               11           6         1
#>  4             15           7.5             12           7.5       9
#>  5             12           8.5              5           8.5       8
#>  6             13           6               17           6         5
#>  7              4           2                8           2         3
#>  8              5           7.5             13           7.5       2
#>  9             11           7.5             13           7.5       6
#> 10             13           2               14           2         4
#> # … with 2 more variables: b_count <int>, c_count <int>
# or if I try "and", it doesn't work:
df %>% select_at(vars(starts_with("total") & ends_with("count")))
#> Warning in starts_with("total") & ends_with("count"): longer object length
#> is not a multiple of shorter object length
#> `starts_with("total") & ends_with("count")` must evaluate to column
#> positions or names, not a logical vector
# can be solved with matches()
df %>% select_at(vars(matches("^total.*count$")))
#> # A tibble: 10 x 2
#>    total_ab_count total_bc_count
#>             <int>          <int>
#>  1             12              4
#>  2             17             13
#>  3              8             11
#>  4             15             12
#>  5             12              5
#>  6             13             17
#>  7              4              8
#>  8              5             13
#>  9             11             13
#> 10             13             14

# pivot_longer takes a SINGLE selector
df %>% pivot_longer(ends_with("count"))
#> # A tibble: 50 x 8
#>       id a_mean b_mean c_mean total_ab_mean total_bc_mean name        value
#>    <int>  <int>  <int>  <int>         <dbl>         <dbl> <chr>       <int>
#>  1     1      4      4      4             4             4 a_count        10
#>  2     1      4      4      4             4             4 b_count         2
#>  3     1      4      4      4             4             4 c_count         2
#>  4     1      4      4      4             4             4 total_ab_c…    12
#>  5     1      4      4      4             4             4 total_bc_c…     4
#>  6     2      7      1      3             4             4 a_count         7
#>  7     2      7      1      3             4             4 b_count        10
#>  8     2      7      1      3             4             4 c_count         3
#>  9     2      7      1      3             4             4 total_ab_c…    17
#> 10     2      7      1      3             4             4 total_bc_c…    13
#> # … with 40 more rows
# doesn't work:
df %>% pivot_longer(vars(starts_with("total"), ends_with("count")))
#> `vars(starts_with("total"), ends_with("count"))` must evaluate to column
#> positions or names, not a list
# use matches to match for multiple
df %>% pivot_longer(matches("^total.*count$"))
#> # A tibble: 20 x 11
#>       id a_count a_mean b_count b_mean c_count c_mean total_ab_mean
#>    <int>   <int>  <int>   <int>  <int>   <int>  <int>         <dbl>
#>  1     1      10      4       2      4       2      4           4  
#>  2     1      10      4       2      4       2      4           4  
#>  3     2       7      7      10      1       3      3           4  
#>  4     2       7      7      10      1       3      3           4  
#>  5     3       1      3       7      9       4      6           6  
#>  6     3       1      3       7      9       4      6           6  
#>  7     4       9      8       6      7       6      9           7.5
#>  8     4       9      8       6      7       6      9           7.5
#>  9     5       8      9       4      8       1      5           8.5
#> 10     5       8      9       4      8       1      5           8.5
#> 11     6       5      6       8      6       9     10           6  
#> 12     6       5      6       8      6       9     10           6  
#> 13     7       3      1       1      3       7      2           2  
#> 14     7       3      1       1      3       7      2           2  
#> 15     8       2     10       3      5      10      1           7.5
#> 16     8       2     10       3      5      10      1           7.5
#> 17     9       6      5       5     10       8      8           7.5
#> 18     9       6      5       5     10       8      8           7.5
#> 19    10       4      2       9      2       5      7           2  
#> 20    10       4      2       9      2       5      7           2  
#> # … with 3 more variables: total_bc_mean <dbl>, name <chr>, value <int>
^{Created on 2019-06-21 by the reprex package (v0.2.1)}