Skip to content

Instantly share code, notes, and snippets.

@jennybc
Created May 29, 2016 06:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jennybc/4ed67a1c1af22f92191daab571c7cc62 to your computer and use it in GitHub Desktop.
Save jennybc/4ed67a1c1af22f92191daab571c7cc62 to your computer and use it in GitHub Desktop.
dplyr::summarise_all() and friends

2016-05_summarise-all-test-drive.R

jenny Sat May 28 23:41:28 2016

Worth reminding that summarise_*() needs n-to-1 function, mutate_*() needs n-to-n function?

I can use variable selection helpers from select(). Yay! I just didn't expect I would need to put them inside of vars().

Re: naming conventions for the resulting variables. The docs do cover this. But this consequence seems just as important: names determines whether you are effectively adding new variables vs. redefining existing variables.

What about programming? Will there be "underscore versions" or are they clearly not needed?

No links yet from mutate() or summarise() to summarise_all(). Didn't do this myself because not sure if should be done in ad hoc way or via a family.

Use dplyr from the PR (hidden chunk).

Reveal dplyr version

devtools::session_info("dplyr")$packages %>%
  filter(package == "dplyr")
#>   package *    version       date                         source
#> 1   dplyr * 0.4.3.9001 2016-05-28 Github (lionel-/dplyr@34927b8)

prep iris

i2 <- as.tbl(iris) %>%
  mutate_at(vars(contains("Petal")), as.integer)
by_species <- i2 %>% group_by(Species)

Has grouping always spawned all these attributes? I guess so?

str(by_species)
#> Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Petal.Width : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "vars")=List of 1
#>   ..$ : symbol Species
#>  - attr(*, "drop")= logi TRUE
#>  - attr(*, "indices")=List of 3
#>   ..$ : int  0 1 2 3 4 5 6 7 8 9 ...
#>   ..$ : int  50 51 52 53 54 55 56 57 58 59 ...
#>   ..$ : int  100 101 102 103 104 105 106 107 108 109 ...
#>  - attr(*, "group_sizes")= int  50 50 50
#>  - attr(*, "biggest_group_size")= int 50
#>  - attr(*, "labels")='data.frame':   3 obs. of  1 variable:
#>   ..$ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
#>   ..- attr(*, "vars")=List of 1
#>   .. ..$ : symbol Species
#>   ..- attr(*, "drop")= logi TRUE

With 20/20 hindsight: create a version of scale() that's "vector in, vector out", instead of "matrix(like object) in, matrix out". And another n-to-n function that works on any numeric variable, negify(). Got burned by hadley/tibble#84.

vscale <- function(x, center = TRUE, scale = TRUE) {
  stopifnot(is.null(dim(x)))
  scale(x, center = center, scale = scale)[ ,1, drop = TRUE]
}
negify <- function(x) {
  stopifnot(is.null(dim(x)))
  x * -1
}

When I apply vscale() to all variables, I get an error, because Species is not numeric. When I apply negify(), I just get a warning and a vector of NAs. What's the difference?

i2 %>% mutate_all(vscale)
#> Error in eval(expr, envir, enclos): 'x' must be numeric
i2 %>% mutate_all(negify) %>% head(3)
#> Warning in Ops.factor(x, -1): '*' not meaningful for factors
#> Source: local data frame [3 x 5]
#> 
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl>   <lgl>
#> 1         -5.1        -3.5           -1           0      NA
#> 2         -4.9        -3.0           -1           0      NA
#> 3         -4.7        -3.2           -1           0      NA

I understand why the number of variables and their names are so different here. Will it surprise people? Example of how seemingly small difference of input leads to pretty different output.

# in situ edit --> 5 original variables, 2 have been mutated
by_species %>%
  mutate_at(vars(matches("Sepal")), vscale) %>% head(3)
#> Source: local data frame [3 x 5]
#> Groups: Species [1]
#> 
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <int>       <int>  <fctr>
#> 1    0.2666745   0.1899414            1           0  setosa
#> 2   -0.3007180  -1.1290958            1           0  setosa
#> 3   -0.8681105  -0.6014810            1           0  setosa

## new variables --> 5 original variables (untouched) + 2 * 2 new = 9 variables
by_species %>%
  mutate_at(vars(matches("Sepal")), funs(vscale, negify)) %>% head(3)
#> Source: local data frame [3 x 9]
#> Groups: Species [1]
#> 
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <int>       <int>  <fctr>
#> 1          5.1         3.5            1           0  setosa
#> 2          4.9         3.0            1           0  setosa
#> 3          4.7         3.2            1           0  setosa
#> Variables not shown: Sepal.Length_vscale <dbl>, Sepal.Width_vscale <dbl>,
#>   Sepal.Length_negify <dbl>, Sepal.Width_negify <dbl>.

I understand how to force inclusion of function name in output variable name. How to force inclusion of variable name? The second thing here?

by_species %>%
  mutate_at(vars(Sepal.Length), funs(vscale, negify)) %>% head(3)
#> Source: local data frame [3 x 7]
#> Groups: Species [1]
#> 
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     vscale
#>          <dbl>       <dbl>        <int>       <int>  <fctr>      <dbl>
#> 1          5.1         3.5            1           0  setosa  0.2666745
#> 2          4.9         3.0            1           0  setosa -0.3007180
#> 3          4.7         3.2            1           0  setosa -0.8681105
#> Variables not shown: negify <dbl>.
by_species %>%
  mutate_at(vars(Sepal.Length = Sepal.Length), funs(vscale, negify)) %>% head(3)
#> Source: local data frame [3 x 7]
#> Groups: Species [1]
#> 
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <int>       <int>  <fctr>
#> 1          5.1         3.5            1           0  setosa
#> 2          4.9         3.0            1           0  setosa
#> 3          4.7         3.2            1           0  setosa
#> Variables not shown: Sepal.Length_vscale <dbl>, Sepal.Length_negify <dbl>.

A tbl_df that creates problems for str() and View(), a.k.a. how I was reminded that scale() returns a matrix. See hadley/tibble#84.

baby_iris <- iris[c(1, 20, 48, 51, 89, 92, 101, 102, 103), ]
#baby_iris %>% View() ## works
baby_iris %>% group_by(Species) %>% mutate_all(scale) ## print looks ok
#> Source: local data frame [9 x 5]
#> Groups: Species [3]
#> 
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#>          <dbl>       <dbl>        <dbl>       <dbl>     <fctr>
#> 1    0.5773503   0.0000000   -0.5773503  -0.5773503     setosa
#> 2    0.5773503   1.0000000    1.1547005   1.1547005     setosa
#> 3   -1.1547005  -1.0000000   -0.5773503  -0.5773503     setosa
#> 4    1.0806343   1.1547005    0.7258662   0.5773503 versicolor
#> 5   -0.8926979  -0.5773503   -1.1406469  -1.1547005 versicolor
#> 6   -0.1879364  -0.5773503    0.4147807   0.5773503 versicolor
#> 7   -0.1524986   1.0000000    0.6757374   1.0910895  virginica
#> 8   -0.9149914  -1.0000000   -1.1487535  -0.8728716  virginica
#> 9    1.0674900   0.0000000    0.4730162  -0.2182179  virginica
## but all is not well
baby_iris %>% group_by(Species) %>% mutate_all(scale) %>% str()
#> Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  9 obs. of  5 variables:
#>  $ Sepal.Length:
#> Error in str.default(obj, ...): dims [product 3] do not match the length of object [9]
baby_iris %>% group_by(Species) %>% mutate_all(scale) %>% View() ## also nope
#> Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, : dims [product 3] do not match the length of object [9]
#' ---
#' output: github_document
#' ---
#' Worth reminding that `summarise_*()` needs n-to-1 function, `mutate_*()`
#' needs n-to-n function?
#'
#' I can use variable selection helpers from `select()`. Yay! I just didn't
#' expect I would need to put them inside of `vars()`.
#'
#' Re: naming conventions for the resulting variables. The docs do cover this.
#' But this consequence seems just as important: names determines whether you
#' are effectively adding new variables vs. redefining existing variables.
#'
#' What about programming? Will there be "underscore versions" or are they
#' clearly not needed?
#'
#' No links yet from `mutate()` or `summarise()` to `summarise_all()`. Didn't do
#' this myself because not sure if should be done in ad hoc way or via a family.
#+ setup, include = FALSE
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
error = TRUE
)
#' Use `dplyr` from the PR (hidden chunk).
#+ install, include = FALSE
tmp_lib <- "~/tmp/tmp_library"
if (!dir.exists(tmp_lib)) dir.create(tmp_lib)
withr::with_libpaths(tmp_lib,
devtools::install_github("hadley/dplyr#1853"),
"prefix")
library("dplyr", lib.loc = tmp_lib)
#' Reveal `dplyr` version
devtools::session_info("dplyr")$packages %>%
filter(package == "dplyr")
#' prep iris
i2 <- as.tbl(iris) %>%
mutate_at(vars(contains("Petal")), as.integer)
by_species <- i2 %>% group_by(Species)
#' Has grouping always spawned all these attributes? I guess so?
str(by_species)
#' With 20/20 hindsight: create a version of `scale()` that's "vector in,
#' vector out", instead of "matrix(like object) in, matrix out". And another
#' n-to-n function that works on any numeric variable, `negify()`. Got burned by
#' [`hadley/tibble#84`](https://github.com/hadley/tibble/issues/84).
vscale <- function(x, center = TRUE, scale = TRUE) {
stopifnot(is.null(dim(x)))
scale(x, center = center, scale = scale)[ ,1, drop = TRUE]
}
negify <- function(x) {
stopifnot(is.null(dim(x)))
x * -1
}
#' When I apply `vscale()` to all variables, I get an error, because `Species`
#' is not numeric. When I apply `negify()`, I just get a warning and a vector of
#' `NA`s. What's the difference?
i2 %>% mutate_all(vscale)
i2 %>% mutate_all(negify) %>% head(3)
#' I understand why the number of variables and their names are so different
#' here. Will it surprise people? Example of how seemingly small difference of
#' input leads to pretty different output.
# in situ edit --> 5 original variables, 2 have been mutated
by_species %>%
mutate_at(vars(matches("Sepal")), vscale) %>% head(3)
## new variables --> 5 original variables (untouched) + 2 * 2 new = 9 variables
by_species %>%
mutate_at(vars(matches("Sepal")), funs(vscale, negify)) %>% head(3)
#' I understand how to force inclusion of function name in output variable name.
#' How to force inclusion of variable name? The second thing here?
by_species %>%
mutate_at(vars(Sepal.Length), funs(vscale, negify)) %>% head(3)
by_species %>%
mutate_at(vars(Sepal.Length = Sepal.Length), funs(vscale, negify)) %>% head(3)
#' A `tbl_df` that creates problems for `str()` and `View()`, a.k.a. how I was
#' reminded that `scale()` returns a matrix. See
#' [`hadley/tibble#84`](https://github.com/hadley/tibble/issues/84).
baby_iris <- iris[c(1, 20, 48, 51, 89, 92, 101, 102, 103), ]
#baby_iris %>% View() ## works
baby_iris %>% group_by(Species) %>% mutate_all(scale) ## print looks ok
## but all is not well
baby_iris %>% group_by(Species) %>% mutate_all(scale) %>% str()
baby_iris %>% group_by(Species) %>% mutate_all(scale) %>% View() ## also nope
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment