-
-
Save hadley/c430501804349d382ce90754936ab8ec to your computer and use it in GitHub Desktop.
# What's the most natural way to express this code in base R? | |
library(dplyr, warn.conflicts = FALSE) | |
mtcars %>% | |
group_by(cyl) %>% | |
summarise(mean = mean(disp), n = n()) | |
#> # A tibble: 3 x 3 | |
#> cyl mean n | |
#> <dbl> <dbl> <int> | |
#> 1 4 105. 11 | |
#> 2 6 183. 7 | |
#> 3 8 353. 14 | |
# tapply() ---------------------------------------------------------------- | |
data.frame( | |
cyl = sort(unique(mtcars$cyl)), | |
mean = tapply(mtcars$disp, mtcars$cyl, mean), | |
n = tapply(mtcars$disp, mtcars$cyl, length) | |
) | |
#> cyl mean n | |
#> 4 4 105.1364 11 | |
#> 6 6 183.3143 7 | |
#> 8 8 353.1000 14 | |
# - hard to generalise to more than one group because tapply() will | |
# return an array | |
# - is `sort(unique(mtcars$cyl))` guaranteed to be in the same order as | |
# the tapply() output? | |
# aggregate() ------------------------------------------------------------- | |
df_mean <- aggregate(mtcars["disp"], mtcars["cyl"], mean) | |
df_length <- aggregate(mtcars["disp"], mtcars["cyl"], length) | |
names(df_mean)[2] <- "mean" | |
names(df_length)[2] <- "n" | |
merge(df_mean, df_length, by = "cyl") | |
#> cyl mean n | |
#> 1 4 105.1364 11 | |
#> 2 6 183.3143 7 | |
#> 3 8 353.1000 14 | |
# + generalises in stratightforward to multiple grouping variables and | |
# multiple summary variables | |
# - need to manually rename summary variables | |
# Could also use formula interface | |
# https://twitter.com/tjmahr/status/1231255000766005248 | |
df_mean <- aggregate(disp ~ cyl, mtcars, mean) | |
df_length <- aggregate(disp ~ cyl, mtcars, length) | |
# by() -------------------------------------------------------------------- | |
mtcars_by <- by(mtcars, mtcars$cyl, function(df) { | |
data.frame(cyl = df$cyl[[1]], mean = mean(df$disp), n = nrow(df)) | |
}) | |
do.call(rbind, mtcars_by) | |
#> cyl mean n | |
#> 4 4 105.1364 11 | |
#> 6 6 183.3143 7 | |
#> 8 8 353.1000 14 | |
# + generalises easily to more/different summaries | |
# - need to know about anonymous functions + do.call + rbind | |
# by() = split() + lapply() | |
mtcars_by <- lapply(split(mtcars, mtcars$cyl), function(df) { | |
data.frame(cyl = df$cyl[[1]], mean = mean(df$disp), n = nrow(df)) | |
}) | |
do.call(rbind, mtcars_by) | |
#> cyl mean n | |
#> 4 4 105.1364 11 | |
#> 6 6 183.3143 7 | |
#> 8 8 353.1000 14 | |
# Manual indexing approahes ------------------------------------------------- | |
# from https://twitter.com/fartmiasma/status/1231258479865647105 | |
cyl_counts <- sort(unique(mtcars$cyl)) | |
tabl <- sapply(cyl_counts, function(ct) { | |
with(mtcars, c(cyl = ct, mean = mean(disp[cyl == ct]), n = sum(cyl == ct))) | |
}) | |
as.data.frame(t(tabl)) | |
#> cyl mean n | |
#> 1 4 105.1364 11 | |
#> 2 6 183.3143 7 | |
#> 3 8 353.1000 14 | |
# - coerces all results (and grouping var) to common type | |
# Similar approach from | |
# https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec#gistcomment-3185680 | |
s <- lapply(cyl_counts, function(cyl) { | |
indx <- mtcars$cyl == cyl | |
data.frame(cyl = cyl, mean = mean(mtcars$disp[indx]), n = sum(indx)) | |
}) | |
do.call(rbind, s) | |
#> cyl mean n | |
#> 1 4 105.1364 11 | |
#> 2 6 183.3143 7 | |
#> 3 8 353.1000 14 | |
# - harder to generalise to multiple grouping vars (need to use Map()) |
@gwangjinkim Reduce works pair wise, which is potential slower than providing all data frames to rbind at once (although I don't know if rbind actually takes advantage of that). And nice spotting with the order problem, thanks!
@hadley Thank you, too! I often used Reduce
(because I was coming from Lisp languages, where you have reduce
). do.call
is actually Lisp's apply
- while the apply
functions in R are more like a mapcar
in LIsp ... - so to use do.call
in such situations is then better! Good to know, thank you!
Thank you for the interesting question.
To me the most natural way to express the counts in a group variable in base R is to use table()
, but it coerces the grouped variable into a factor. Since I didn't see its use in the previous answers, here is my trial:
dat <- setNames(as.data.frame(table(mtcars$cyl)), c("cyl", "n"))
dat$cyl <- as.numeric(as.character(dat$cyl))
dat$mean <- sapply(dat$cyl, function(x) mean(with(mtcars, disp[cyl == x])))
dat <- dat[, c(1,3,2)]
dat
# cyl mean n
# 1 4 105.1364 11
# 2 6 183.3143 7
# 3 8 353.1000 14
Note a downside of the formula based aggregate syntax — it drops missing values like a modelling function:
aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, length)
#> Month Ozone Temp
#> 1 5 26 26
#> 2 6 9 9
#> 3 7 26 26
#> 4 8 26 26
#> 5 9 29 29
aggregate(airquality[c("Ozone", "Temp")], airquality["Month"], length)
#> Month Ozone Temp
#> 1 5 31 31
#> 2 6 30 30
#> 3 7 31 31
#> 4 8 31 31
#> 5 9 30 30
Created on 2022-10-24 with reprex v2.0.2
@hadley To your question in your
tapply()
example - whether the order oftapply
's output is guaranteed to be the same likesort(unique(mtcars$cyl))
:Initially, I thought "yes!", because
tapply
usessplit
from which it overtakes the output order.split
in turn usesfactor
to sort the names of the groups, which in turn usesorder()
to define the order of the levels.sort.default
uses in its code alsoorder
for sorting. Thus, the output order of the groups must be identical, sincetapply()
as well assort(unique())
both use the same functionorder()
to order their output.In case the names are integers,
sort.int()
should always lead to the same output order ...But then I realized after all this - that if the
cyl
column inmtcars
would not benumeric
but afactor
with customizedlevels
, the ordering methods might differ, and indeed it does: