jenny Sat May 28 23:41:28 2016
Worth reminding that summarise_*()
needs n-to-1 function, mutate_*()
needs n-to-n function?
I can use variable selection helpers from select()
. Yay! I just didn't expect I would need to put them inside of vars()
.
Re: naming conventions for the resulting variables. The docs do cover this. But this consequence seems just as important: names determines whether you are effectively adding new variables vs. redefining existing variables.
What about programming? Will there be "underscore versions" or are they clearly not needed?
No links yet from mutate()
or summarise()
to summarise_all()
. Didn't do this myself because not sure if should be done in ad hoc way or via a family.
Use dplyr
from the PR (hidden chunk).
Reveal dplyr
version
devtools::session_info("dplyr")$packages %>%
filter(package == "dplyr")
#> package * version date source
#> 1 dplyr * 0.4.3.9001 2016-05-28 Github (lionel-/dplyr@34927b8)
prep iris
i2 <- as.tbl(iris) %>%
mutate_at(vars(contains("Petal")), as.integer)
by_species <- i2 %>% group_by(Species)
Has grouping always spawned all these attributes? I guess so?
str(by_species)
#> Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 150 obs. of 5 variables:
#> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> $ Petal.Length: int 1 1 1 1 1 1 1 1 1 1 ...
#> $ Petal.Width : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#> - attr(*, "vars")=List of 1
#> ..$ : symbol Species
#> - attr(*, "drop")= logi TRUE
#> - attr(*, "indices")=List of 3
#> ..$ : int 0 1 2 3 4 5 6 7 8 9 ...
#> ..$ : int 50 51 52 53 54 55 56 57 58 59 ...
#> ..$ : int 100 101 102 103 104 105 106 107 108 109 ...
#> - attr(*, "group_sizes")= int 50 50 50
#> - attr(*, "biggest_group_size")= int 50
#> - attr(*, "labels")='data.frame': 3 obs. of 1 variable:
#> ..$ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
#> ..- attr(*, "vars")=List of 1
#> .. ..$ : symbol Species
#> ..- attr(*, "drop")= logi TRUE
With 20/20 hindsight: create a version of scale()
that's "vector in, vector out", instead of "matrix(like object) in, matrix out". And another n-to-n function that works on any numeric variable, negify()
. Got burned by hadley/tibble#84
.
vscale <- function(x, center = TRUE, scale = TRUE) {
stopifnot(is.null(dim(x)))
scale(x, center = center, scale = scale)[ ,1, drop = TRUE]
}
negify <- function(x) {
stopifnot(is.null(dim(x)))
x * -1
}
When I apply vscale()
to all variables, I get an error, because Species
is not numeric. When I apply negify()
, I just get a warning and a vector of NA
s. What's the difference?
i2 %>% mutate_all(vscale)
#> Error in eval(expr, envir, enclos): 'x' must be numeric
i2 %>% mutate_all(negify) %>% head(3)
#> Warning in Ops.factor(x, -1): '*' not meaningful for factors
#> Source: local data frame [3 x 5]
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 -5.1 -3.5 -1 0 NA
#> 2 -4.9 -3.0 -1 0 NA
#> 3 -4.7 -3.2 -1 0 NA
I understand why the number of variables and their names are so different here. Will it surprise people? Example of how seemingly small difference of input leads to pretty different output.
# in situ edit --> 5 original variables, 2 have been mutated
by_species %>%
mutate_at(vars(matches("Sepal")), vscale) %>% head(3)
#> Source: local data frame [3 x 5]
#> Groups: Species [1]
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <int> <int> <fctr>
#> 1 0.2666745 0.1899414 1 0 setosa
#> 2 -0.3007180 -1.1290958 1 0 setosa
#> 3 -0.8681105 -0.6014810 1 0 setosa
## new variables --> 5 original variables (untouched) + 2 * 2 new = 9 variables
by_species %>%
mutate_at(vars(matches("Sepal")), funs(vscale, negify)) %>% head(3)
#> Source: local data frame [3 x 9]
#> Groups: Species [1]
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <int> <int> <fctr>
#> 1 5.1 3.5 1 0 setosa
#> 2 4.9 3.0 1 0 setosa
#> 3 4.7 3.2 1 0 setosa
#> Variables not shown: Sepal.Length_vscale <dbl>, Sepal.Width_vscale <dbl>,
#> Sepal.Length_negify <dbl>, Sepal.Width_negify <dbl>.
I understand how to force inclusion of function name in output variable name. How to force inclusion of variable name? The second thing here?
by_species %>%
mutate_at(vars(Sepal.Length), funs(vscale, negify)) %>% head(3)
#> Source: local data frame [3 x 7]
#> Groups: Species [1]
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species vscale
#> <dbl> <dbl> <int> <int> <fctr> <dbl>
#> 1 5.1 3.5 1 0 setosa 0.2666745
#> 2 4.9 3.0 1 0 setosa -0.3007180
#> 3 4.7 3.2 1 0 setosa -0.8681105
#> Variables not shown: negify <dbl>.
by_species %>%
mutate_at(vars(Sepal.Length = Sepal.Length), funs(vscale, negify)) %>% head(3)
#> Source: local data frame [3 x 7]
#> Groups: Species [1]
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <int> <int> <fctr>
#> 1 5.1 3.5 1 0 setosa
#> 2 4.9 3.0 1 0 setosa
#> 3 4.7 3.2 1 0 setosa
#> Variables not shown: Sepal.Length_vscale <dbl>, Sepal.Length_negify <dbl>.
A tbl_df
that creates problems for str()
and View()
, a.k.a. how I was reminded that scale()
returns a matrix. See hadley/tibble#84
.
baby_iris <- iris[c(1, 20, 48, 51, 89, 92, 101, 102, 103), ]
#baby_iris %>% View() ## works
baby_iris %>% group_by(Species) %>% mutate_all(scale) ## print looks ok
#> Source: local data frame [9 x 5]
#> Groups: Species [3]
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fctr>
#> 1 0.5773503 0.0000000 -0.5773503 -0.5773503 setosa
#> 2 0.5773503 1.0000000 1.1547005 1.1547005 setosa
#> 3 -1.1547005 -1.0000000 -0.5773503 -0.5773503 setosa
#> 4 1.0806343 1.1547005 0.7258662 0.5773503 versicolor
#> 5 -0.8926979 -0.5773503 -1.1406469 -1.1547005 versicolor
#> 6 -0.1879364 -0.5773503 0.4147807 0.5773503 versicolor
#> 7 -0.1524986 1.0000000 0.6757374 1.0910895 virginica
#> 8 -0.9149914 -1.0000000 -1.1487535 -0.8728716 virginica
#> 9 1.0674900 0.0000000 0.4730162 -0.2182179 virginica
## but all is not well
baby_iris %>% group_by(Species) %>% mutate_all(scale) %>% str()
#> Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 9 obs. of 5 variables:
#> $ Sepal.Length:
#> Error in str.default(obj, ...): dims [product 3] do not match the length of object [9]
baby_iris %>% group_by(Species) %>% mutate_all(scale) %>% View() ## also nope
#> Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, : dims [product 3] do not match the length of object [9]