Jenny Bryan
22 August, 2014
I posed a question on Twitter (click to see the figure!):
What is most elegant (d)plyr-ish way to do? #dplyr @hadleywickham pic.twitter.com/kXsfkq7Rkq
— Jennifer Bryan (@JennyBryan) August 22, 2014
that boiled down to this: you have a list of data.frames and the element names convey information. You want to row bind them together and, in the new data.frame, you want a variable for the list element each observation originated in.
I got back an embarrassment of riches, which I'll record here.
First, make an appropriate list of data.frames from the iris
data. Note the Species
information is carried only in the list names.
(my_list <-
lapply(split(subset(iris, select = -Species), iris$Species), "[", 1:2, ))
## $setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
##
## $versicolor
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 51 7.0 3.2 4.7 1.4
## 52 6.4 3.2 4.5 1.5
##
## $virginica
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 101 6.3 3.3 6.0 2.5
## 102 5.8 2.7 5.1 1.9
Row binding with existing rbind()
-type functions cannot recover Species
.
do.call("rbind", my_list) # rownames have never looked so good ...
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa.1 5.1 3.5 1.4 0.2
## setosa.2 4.9 3.0 1.4 0.2
## versicolor.51 7.0 3.2 4.7 1.4
## versicolor.52 6.4 3.2 4.5 1.5
## virginica.101 6.3 3.3 6.0 2.5
## virginica.102 5.8 2.7 5.1 1.9
dplyr::rbind_all(my_list)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 7.0 3.2 4.7 1.4
## 4 6.4 3.2 4.5 1.5
## 5 6.3 3.3 6.0 2.5
## 6 5.8 2.7 5.1 1.9
Kara Woo provided this solution:
my_list2 <-
mapply(`[<-`, my_list, 'Species', value = names(my_list), SIMPLIFY = FALSE)
dplyr::rbind_all(my_list2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 7.0 3.2 4.7 1.4 versicolor
## 4 6.4 3.2 4.5 1.5 versicolor
## 5 6.3 3.3 6.0 2.5 virginica
## 6 5.8 2.7 5.1 1.9 virginica
Hadley Wickham pegged this as a data tidying task and added experimental new functionality for tidyr::unnest()
: tidyverse/tidyr#22. I installed tidyr
from this commit to try out the new list method.
library(tidyr)
unnest_(my_list) # by default, unnest() just row binds
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 7.0 3.2 4.7 1.4
## 4 6.4 3.2 4.5 1.5
## 5 6.3 3.3 6.0 2.5
## 6 5.8 2.7 5.1 1.9
unnest(my_list, Species) # but this creates the desired Species variable
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.1 3.5 1.4 0.2
## 2 setosa 4.9 3.0 1.4 0.2
## 3 versicolor 7.0 3.2 4.7 1.4
## 4 versicolor 6.4 3.2 4.5 1.5
## 5 virginica 6.3 3.3 6.0 2.5
## 6 virginica 5.8 2.7 5.1 1.9
Kevin Ushey proposed the rbindlistn()
function from his data.table.extras package. I installed data.table.extras
from this commit to try it out.
library(data.table.extras)
## Loading required package: data.table
rbindlistn(my_list, "Species")
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.1 3.5 1.4 0.2 setosa
## 2: 4.9 3.0 1.4 0.2 setosa
## 3: 7.0 3.2 4.7 1.4 versicolor
## 4: 6.4 3.2 4.5 1.5 versicolor
## 5: 6.3 3.3 6.0 2.5 virginica
## 6: 5.8 2.7 5.1 1.9 virginica
Arun Srinivasan also proposed a data.table
solution:
library(data.table)
rbindlist(my_list)[, Species := rep(names(my_list), vapply(my_list, nrow, 0L))][]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.1 3.5 1.4 0.2 setosa
## 2: 4.9 3.0 1.4 0.2 setosa
## 3: 7.0 3.2 4.7 1.4 versicolor
## 4: 6.4 3.2 4.5 1.5 versicolor
## 5: 6.3 3.3 6.0 2.5 virginica
## 6: 5.8 2.7 5.1 1.9 virginica
Oddly, the above does not print to Console for me when run interactively, but, lo, here it is after I Resolved.render()
this. Not sure what to make of that.
Kevin Ushey also shared a Python-inspired enumerate()
function, but I still need to apply it to my problem. At this point, seems superseded by his other solution.
enumerate <- function(X, FUN, ...) {
result <- vector("list", length(X))
for (i in seq_along(result)) {
tmp <- FUN(X[[i]], i, ...)
if (is.null(tmp))
result[i] <- list(NULL)
else
result[[i]] <- tmp
}
result
}
l <- list(a = 1, b = 2, c = 3)
enumerate(l, function(x, i) {
cat("Name: ", names(l)[[i]], "\n")
cat("Value: ", x, "\n")
invisible()
})
## Name: a
## Value: 1
## Name: b
## Value: 2
## Name: c
## Value: 3
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table.extras_1.0 data.table_1.9.3 tidyr_0.1.0.9000
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 digest_0.6.4 dplyr_0.2.0.99 evaluate_0.5.5
## [5] formatR_0.10 htmltools_0.2.4 knitr_1.6 magrittr_1.0.1
## [9] parallel_3.1.0 plyr_1.8.1 Rcpp_0.11.1 reshape2_1.4
## [13] rmarkdown_0.2.64 stringr_0.6.2 tools_3.1.0 yaml_2.1.13
@jennybc, great summary!
It's because of
:=
, because it assigns by reference and returns invisibly. Just add a[]
at the end