Skip to content
{{ message }}

Instantly share code, notes, and snippets.

# hadley/dplyr-summarise.R

Last active Jun 8, 2021
 # What's the most natural way to express this code in base R? library(dplyr, warn.conflicts = FALSE) mtcars %>% group_by(cyl) %>% summarise(mean = mean(disp), n = n()) #> # A tibble: 3 x 3 #> cyl mean n #> #> 1 4 105. 11 #> 2 6 183. 7 #> 3 8 353. 14 # tapply() ---------------------------------------------------------------- data.frame( cyl = sort(unique(mtcars\$cyl)), mean = tapply(mtcars\$disp, mtcars\$cyl, mean), n = tapply(mtcars\$disp, mtcars\$cyl, length) ) #> cyl mean n #> 4 4 105.1364 11 #> 6 6 183.3143 7 #> 8 8 353.1000 14 # - hard to generalise to more than one group because tapply() will # return an array # - is `sort(unique(mtcars\$cyl))` guaranteed to be in the same order as # the tapply() output? # aggregate() ------------------------------------------------------------- df_mean <- aggregate(mtcars["disp"], mtcars["cyl"], mean) df_length <- aggregate(mtcars["disp"], mtcars["cyl"], length) names(df_mean) <- "mean" names(df_length) <- "n" merge(df_mean, df_length, by = "cyl") #> cyl mean n #> 1 4 105.1364 11 #> 2 6 183.3143 7 #> 3 8 353.1000 14 # + generalises in stratightforward to multiple grouping variables and # multiple summary variables # - need to manually rename summary variables # Could also use formula interface # https://twitter.com/tjmahr/status/1231255000766005248 df_mean <- aggregate(disp ~ cyl, mtcars, mean) df_length <- aggregate(disp ~ cyl, mtcars, length) # by() -------------------------------------------------------------------- mtcars_by <- by(mtcars, mtcars\$cyl, function(df) { data.frame(cyl = df\$cyl[], mean = mean(df\$disp), n = nrow(df)) }) do.call(rbind, mtcars_by) #> cyl mean n #> 4 4 105.1364 11 #> 6 6 183.3143 7 #> 8 8 353.1000 14 # + generalises easily to more/different summaries # - need to know about anonymous functions + do.call + rbind # by() = split() + lapply() mtcars_by <- lapply(split(mtcars, mtcars\$cyl), function(df) { data.frame(cyl = df\$cyl[], mean = mean(df\$disp), n = nrow(df)) }) do.call(rbind, mtcars_by) #> cyl mean n #> 4 4 105.1364 11 #> 6 6 183.3143 7 #> 8 8 353.1000 14 # Manual indexing approahes ------------------------------------------------- # from https://twitter.com/fartmiasma/status/1231258479865647105 cyl_counts <- sort(unique(mtcars\$cyl)) tabl <- sapply(cyl_counts, function(ct) { with(mtcars, c(cyl = ct, mean = mean(disp[cyl == ct]), n = sum(cyl == ct))) }) as.data.frame(t(tabl)) #> cyl mean n #> 1 4 105.1364 11 #> 2 6 183.3143 7 #> 3 8 353.1000 14 # - coerces all results (and grouping var) to common type # Similar approach from # https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec#gistcomment-3185680 s <- lapply(cyl_counts, function(cyl) { indx <- mtcars\$cyl == cyl data.frame(cyl = cyl, mean = mean(mtcars\$disp[indx]), n = sum(indx)) }) do.call(rbind, s) #> cyl mean n #> 1 4 105.1364 11 #> 2 6 183.3143 7 #> 3 8 353.1000 14 # - harder to generalise to multiple grouping vars (need to use Map())

### llrs commented Feb 22, 2020 • edited

 The second example doesn't return the same result as the other solutions, you used mpg instead of disp column for the mean. I would use this or make a `for` loop to avoid the final call to `rbind` and to create a new data.frame for each case. ```s <- lapply(unique(mtcars\$cyl), function(x){ k <- mtcars\$cyl == x n <- sum(k) m <- mean(mtcars\$disp[k]) # if several columns it could be used inside an apply call. data.frame(cyl = x, mean = m, n = n)} ) do.call(rbind, s)```

### hadley commented Feb 22, 2020

 @llrs added your approach — thanks! How would you use a for loop here?

### llrs commented Feb 22, 2020

 Like this if you want to be memory efficient: ``````keys <- unique(mtcars\$cyl) n <- vector("numeric", length(keys)) m <- vector("numeric", length(keys)) for (x in seq_along(keys)) { k <- mtcars\$cyl == keys[x] n[x] <- sum(k) m[x] <- mean(mtcars\$disp[k]) # if several columns it could be used inside an apply call. } data.frame(cyl = keys, mean = m, n = n) ``````

### TimTeaFan commented Feb 22, 2020

 The `aggregate` approach could be optimized the following way: ```aggregate(disp ~ cyl, mtcars, function(x) c(mean = mean(x), n = length(x))) #> cyl disp.mean disp.n #> 1 4 105.1364 11.0000 #> 2 6 183.3143 7.0000 #> 3 8 353.1000 14.0000``` ++ It's much less verbose than the original aggregate approach from above and easier to generalize than the twitter approach with separate calls with `df <-` ++ no need to adjust the naming of the variables -- it will return all variables in the same format, that means `` as soon as there is one variable included that can't be coerced to integer. -- The result is a `data.frame` with two columns (`cyl` and `disp`) the latter is a matrix. To remedy the last point, we could wrap the `aggregate` in a `with` call, but that would be again more verbose: ```with(aggregate(disp ~ cyl, mtcars, function(x) c(mean = mean(x), n = length(x))), as.data.frame(cbind(cyl, disp))) #> cyl mean n #>1 4 105.1364 11 #>2 6 183.3143 7 #>3 8 353.1000 14```

### AlbanOtt commented Feb 23, 2020

 Just after I left the university, I would probably have written something like that : ``````cyl_u = unique(mtcars\$cyl) res=c() for(cyl in cyl_u){ keepit=mtcars\$cyl==cyl mean=mean(mtcars[keepit,"disp"]) n=sum(keepit) res=rbind(res, data.frame(cyl,mean,n)) } res `````` Yes it might be shameful but you said : "How would you use a for loop here?" so here I am...

### Myfanwy commented Feb 24, 2020 • edited

 I learned Base R mostly after the fact, but here's how I was taught to do it, FWIW: write a function as if there were only one group, and make sure it returns the answer in the format you want. Then apply it to all the groups. Can replace some of the below with `by`, or combine different parts, but just wanted to convey the thought process most importantly (do for one group first, make sure it works, then apply to all groups): ``````onecar = function(x) { data.frame(mean_per_cyl = mean(x\$disp), n = nrow(x)) } mtsplit = split(mtcars, mtcars\$cyl) # could obviously move this step into the function summ_cyls = do.call(rbind, lapply(mtsplit, onecar)) ``````

### dwoll commented Feb 25, 2020

 `merge()` has an argument `suffixes` which eliminates the need to manually rename the aggregated variables: ``````df_mean <- aggregate(mtcars["disp"], mtcars["cyl"], mean) df_length <- aggregate(mtcars["disp"], mtcars["cyl"], length) merge(df_mean, df_length, by = "cyl", suffixes = c("_mean", "_n")) ``````

### alexpavlakis commented Feb 27, 2020 • edited

 imo this is the clearest base `R` approach to the problem (probably not the fastest though). ``````summariseByGroup <- function(groupData, numericData) { group <- sort(unique(groupData)) nGroups <- length(group) n <- vector('integer', nGroups) mean <- vector('numeric', nGroups) for(g in seq_along(group)) { for(i in seq_along(numericData)) { if(groupData[i] == group[g]) { n[g] <- n[g] + 1 mean[g] <- (mean[g]*(n[g] - 1) + numericData[i])/n[g] } } } data.frame(group, mean, n) } summariseByGroup(mtcars\$cyl, mtcars\$disp) ``````

### romainfrancois commented Jan 13, 2021

 Adding the link of the tweet where this was also discussed, so that we can remove it from the vignette, because CRAN url checks. https://twitter.com/hadleywickham/status/1231252596712771585
to join this conversation on GitHub. Already have an account? Sign in to comment