jennybc/2014-08-22_rbind-and-store-as-var.Rmd

## 2014-08-22_rbind-and-store-as-var.md

      
    Raw
  

              2014-08-22_rbind-and-store-as-var.md
            
          
    Row bind a list of data.frames with a key

Jenny Bryan

22 August, 2014
I posed a question on Twitter (click to see the figure!):
What is most elegant (d)plyr-ish way to do? #dplyr @hadleywickham pic.twitter.com/kXsfkq7Rkq
— Jennifer Bryan (@JennyBryan) August 22, 2014
that boiled down to this:  you have a list of data.frames and the element names convey information. You want to row bind them together and, in the new data.frame, you want a variable for the list element each observation originated in.
I got back an embarrassment of riches, which I'll record here.
Problem example

First, make an appropriate list of data.frames from the iris data. Note the Species information is carried only in the list names.
(my_list <-
   lapply(split(subset(iris, select = -Species), iris$Species), "[", 1:2, ))
## $setosa
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 
## $versicolor
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 51          7.0         3.2          4.7         1.4
## 52          6.4         3.2          4.5         1.5
## 
## $virginica
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 101          6.3         3.3          6.0         2.5
## 102          5.8         2.7          5.1         1.9

Row binding with existing rbind()-type functions cannot recover Species.
do.call("rbind", my_list) # rownames have never looked so good ...
##               Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa.1               5.1         3.5          1.4         0.2
## setosa.2               4.9         3.0          1.4         0.2
## versicolor.51          7.0         3.2          4.7         1.4
## versicolor.52          6.4         3.2          4.5         1.5
## virginica.101          6.3         3.3          6.0         2.5
## virginica.102          5.8         2.7          5.1         1.9
dplyr::rbind_all(my_list)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          7.0         3.2          4.7         1.4
## 4          6.4         3.2          4.5         1.5
## 5          6.3         3.3          6.0         2.5
## 6          5.8         2.7          5.1         1.9
dplyr::rbind_all() + mapply()

Kara Woo provided this solution:
my_list2 <-
  mapply(`[<-`, my_list, 'Species', value = names(my_list), SIMPLIFY = FALSE)
dplyr::rbind_all(my_list2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          5.1         3.5          1.4         0.2     setosa
## 2          4.9         3.0          1.4         0.2     setosa
## 3          7.0         3.2          4.7         1.4 versicolor
## 4          6.4         3.2          4.5         1.5 versicolor
## 5          6.3         3.3          6.0         2.5  virginica
## 6          5.8         2.7          5.1         1.9  virginica

tidyr::unnest() extension

Hadley Wickham pegged this as a data tidying task and added experimental new functionality for tidyr::unnest(): tidyverse/tidyr#22. I installed tidyr from this commit to try out the new list method.
library(tidyr) 
unnest_(my_list) # by default, unnest() just row binds
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          7.0         3.2          4.7         1.4
## 4          6.4         3.2          4.5         1.5
## 5          6.3         3.3          6.0         2.5
## 6          5.8         2.7          5.1         1.9

unnest(my_list, Species) # but this creates the desired Species variable
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa          5.1         3.5          1.4         0.2
## 2     setosa          4.9         3.0          1.4         0.2
## 3 versicolor          7.0         3.2          4.7         1.4
## 4 versicolor          6.4         3.2          4.5         1.5
## 5  virginica          6.3         3.3          6.0         2.5
## 6  virginica          5.8         2.7          5.1         1.9

data.table.extras::rbindlistn()

Kevin Ushey proposed the rbindlistn() function from his data.table.extras package. I installed data.table.extras from this commit to try it out.
library(data.table.extras)
## Loading required package: data.table

rbindlistn(my_list, "Species")
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1:          5.1         3.5          1.4         0.2     setosa
## 2:          4.9         3.0          1.4         0.2     setosa
## 3:          7.0         3.2          4.7         1.4 versicolor
## 4:          6.4         3.2          4.5         1.5 versicolor
## 5:          6.3         3.3          6.0         2.5  virginica
## 6:          5.8         2.7          5.1         1.9  virginica

data.table::rbindlist + vapply()

Arun Srinivasan also proposed a data.table solution:
library(data.table)
rbindlist(my_list)[, Species := rep(names(my_list), vapply(my_list, nrow, 0L))][]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1:          5.1         3.5          1.4         0.2     setosa
## 2:          4.9         3.0          1.4         0.2     setosa
## 3:          7.0         3.2          4.7         1.4 versicolor
## 4:          6.4         3.2          4.5         1.5 versicolor
## 5:          6.3         3.3          6.0         2.5  virginica
## 6:          5.8         2.7          5.1         1.9  virginica

Oddly, the above does not print to Console for me when run interactively, but, lo, here it is after I render() this. Not sure what to make of that. Resolved.
Python-inspired enumerate()

Kevin Ushey also shared a Python-inspired enumerate() function, but I still need to apply it to my problem. At this point, seems superseded by his other solution.
enumerate <- function(X, FUN, ...) {
  result <- vector("list", length(X))
  for (i in seq_along(result)) {
    tmp <- FUN(X[[i]], i, ...)
    if (is.null(tmp))
      result[i] <- list(NULL)
    else
      result[[i]] <- tmp
  }
  result
}

l <- list(a = 1, b = 2, c = 3)
enumerate(l, function(x, i) {
  cat("Name:  ", names(l)[[i]], "\n")
  cat("Value: ", x, "\n")
  invisible()
})
## Name:   a 
## Value:  1 
## Name:   b 
## Value:  2 
## Name:   c 
## Value:  3

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

sessionInfo()

sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table.extras_1.0 data.table_1.9.3      tidyr_0.1.0.9000     
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1   digest_0.6.4     dplyr_0.2.0.99   evaluate_0.5.5  
##  [5] formatR_0.10     htmltools_0.2.4  knitr_1.6        magrittr_1.0.1  
##  [9] parallel_3.1.0   plyr_1.8.1       Rcpp_0.11.1      reshape2_1.4    
## [13] rmarkdown_0.2.64 stringr_0.6.2    tools_3.1.0      yaml_2.1.13


## 2014-08-22_rbind-and-store-as-var.Rmd
---
title: "Row bind a list of data.frames with a key"
author: "Jenny Bryan"
date: "22 August, 2014"
output:
  html_document:
    keep_md: TRUE
---

I posed a question on Twitter (click to see the figure!):

<blockquote class="twitter-tweet" lang="en"><p>What is most elegant (d)plyr-ish way to do? <a href="https://twitter.com/hashtag/dplyr?src=hash">#dplyr</a> <a href="https://twitter.com/hadleywickham">@hadleywickham</a> <a href="http://t.co/kXsfkq7Rkq">pic.twitter.com/kXsfkq7Rkq</a></p>&mdash; Jennifer Bryan (@JennyBryan) <a href="https://twitter.com/JennyBryan/statuses/502864414266363904">August 22, 2014</a></blockquote>

that boiled down to this:  you have a list of data.frames and the element names convey information. You want to row bind them together and, in the new data.frame, you want a variable for the list element each observation originated in.

I got back an embarrassment of riches, which I'll record here.

#### Problem example

First, make an appropriate list of data.frames from the `iris` data. Note the `Species` information is carried only in the list names.

```{r}
(my_list <-
   lapply(split(subset(iris, select = -Species), iris$Species), "[", 1:2, ))
```

Row binding with existing `rbind()`-type functions cannot recover `Species`.

```{r collapse = TRUE}
do.call("rbind", my_list) # rownames have never looked so good ...
dplyr::rbind_all(my_list)
```

#### `dplyr::rbind_all()` + `mapply()`

[Kara Woo](https://twitter.com/kara_woo) provided [this solution](https://twitter.com/kara_woo/statuses/502867132049145858):

```{r}
my_list2 <-
  mapply(`[<-`, my_list, 'Species', value = names(my_list), SIMPLIFY = FALSE)
dplyr::rbind_all(my_list2)
```

#### `tidyr::unnest()` extension

Hadley Wickham pegged this as a data tidying task and added experimental new functionality for `tidyr::unnest()`: <https://github.com/hadley/tidyr/issues/22>. I installed `tidyr` from [this commit](https://github.com/hadley/tidyr/commit/b44eeb66e683abc1f610b04962d00b3e91822f31) to try out the new list method.

```{r collapse = FALSE}
library(tidyr)
unnest_(my_list) # by default, unnest() just row binds
unnest(my_list, Species) # but this creates the desired Species variable
```

#### `data.table.extras::rbindlistn()`

Kevin Ushey proposed the [`rbindlistn()` function](https://github.com/kevinushey/data.table.extras/blob/master/R/rbindlistn.R) from his [data.table.extras](https://github.com/kevinushey/data.table.extras) package. I installed `data.table.extras` from [this commit](https://github.com/kevinushey/data.table.extras/commit/99c27fe56ac8fe6abc0cf11fb87881a5f587ec42) to try it out.

```{r}
library(data.table.extras)
rbindlistn(my_list, "Species")
```

#### `data.table::rbindlist` + `vapply()`

[Arun Srinivasan](https://twitter.com/arun_sriniv) also proposed [a `data.table` solution](https://twitter.com/arun_sriniv/statuses/503010269048872960):

```{r}
library(data.table)
rbindlist(my_list)[, Species := rep(names(my_list), vapply(my_list, nrow, 0L))][]
```

~~*Oddly, the above does not print to Console for me when run interactively, but, lo, here it is after I `render()` this. Not sure what to make of that.*~~ Resolved.

#### Python-inspired `enumerate()`

Kevin Ushey also [shared](https://gist.github.com/kevinushey/7538142b5e16dd3b7200) a Python-inspired `enumerate()` function, *but I still need to apply it to my problem. At this point, seems superseded by his other solution.*

```{r}
enumerate <- function(X, FUN, ...) {
  result <- vector("list", length(X))
  for (i in seq_along(result)) {
    tmp <- FUN(X[[i]], i, ...)
    if (is.null(tmp))
      result[i] <- list(NULL)
    else
      result[[i]] <- tmp
  }
  result
}

l <- list(a = 1, b = 2, c = 3)
enumerate(l, function(x, i) {
  cat("Name:  ", names(l)[[i]], "\n")
  cat("Value: ", x, "\n")
  invisible()
})
```

#### sessionInfo()

```{r}
sessionInfo()
```
	---
	title: "Row bind a list of data.frames with a key"
	author: "Jenny Bryan"
	date: "22 August, 2014"
	output:
	html_document:
	keep_md: TRUE
	---

	I posed a question on Twitter (click to see the figure!):

	<blockquote class="twitter-tweet" lang="en"><p>What is most elegant (d)plyr-ish way to do? <a href="https://twitter.com/hashtag/dplyr?src=hash">#dplyr</a> <a href="https://twitter.com/hadleywickham">@hadleywickham</a> <a href="http://t.co/kXsfkq7Rkq">pic.twitter.com/kXsfkq7Rkq</a></p>— Jennifer Bryan (@JennyBryan) <a href="https://twitter.com/JennyBryan/statuses/502864414266363904">August 22, 2014</a></blockquote>

	that boiled down to this: you have a list of data.frames and the element names convey information. You want to row bind them together and, in the new data.frame, you want a variable for the list element each observation originated in.

	I got back an embarrassment of riches, which I'll record here.

	#### Problem example

	First, make an appropriate list of data.frames from the `iris` data. Note the `Species` information is carried only in the list names.

	```{r}
	(my_list <-
	lapply(split(subset(iris, select = -Species), iris$Species), "[", 1:2, ))
	```

	Row binding with existing `rbind()`-type functions cannot recover `Species`.

	```{r collapse = TRUE}
	do.call("rbind", my_list) # rownames have never looked so good ...
	dplyr::rbind_all(my_list)
	```

	#### `dplyr::rbind_all()` + `mapply()`

	[Kara Woo](https://twitter.com/kara_woo) provided [this solution](https://twitter.com/kara_woo/statuses/502867132049145858):

	```{r}
	my_list2 <-
	mapply(`[<-`, my_list, 'Species', value = names(my_list), SIMPLIFY = FALSE)
	dplyr::rbind_all(my_list2)
	```

	#### `tidyr::unnest()` extension

	Hadley Wickham pegged this as a data tidying task and added experimental new functionality for `tidyr::unnest()`: <https://github.com/hadley/tidyr/issues/22>. I installed `tidyr` from [this commit](https://github.com/hadley/tidyr/commit/b44eeb66e683abc1f610b04962d00b3e91822f31) to try out the new list method.

	```{r collapse = FALSE}
	library(tidyr)
	unnest_(my_list) # by default, unnest() just row binds
	unnest(my_list, Species) # but this creates the desired Species variable
	```

	#### `data.table.extras::rbindlistn()`

	Kevin Ushey proposed the [`rbindlistn()` function](https://github.com/kevinushey/data.table.extras/blob/master/R/rbindlistn.R) from his [data.table.extras](https://github.com/kevinushey/data.table.extras) package. I installed `data.table.extras` from [this commit](https://github.com/kevinushey/data.table.extras/commit/99c27fe56ac8fe6abc0cf11fb87881a5f587ec42) to try it out.

	```{r}
	library(data.table.extras)
	rbindlistn(my_list, "Species")
	```

	#### `data.table::rbindlist` + `vapply()`

	[Arun Srinivasan](https://twitter.com/arun_sriniv) also proposed [a `data.table` solution](https://twitter.com/arun_sriniv/statuses/503010269048872960):

	```{r}
	library(data.table)
	rbindlist(my_list)[, Species := rep(names(my_list), vapply(my_list, nrow, 0L))][]
	```

	~~Oddly, the above does not print to Console for me when run interactively, but, lo, here it is after I `render()` this. Not sure what to make of that.~~ Resolved.

	#### Python-inspired `enumerate()`

	Kevin Ushey also [shared](https://gist.github.com/kevinushey/7538142b5e16dd3b7200) a Python-inspired `enumerate()` function, but I still need to apply it to my problem. At this point, seems superseded by his other solution.

	```{r}
	enumerate <- function(X, FUN, ...) {
	result <- vector("list", length(X))
	for (i in seq_along(result)) {
	tmp <- FUN(X[[i]], i, ...)
	if (is.null(tmp))
	result[i] <- list(NULL)
	else
	result[[i]] <- tmp
	}
	result
	}

	l <- list(a = 1, b = 2, c = 3)
	enumerate(l, function(x, i) {
	cat("Name: ", names(l)[[i]], "\n")
	cat("Value: ", x, "\n")
	invisible()
	})
	```

	#### sessionInfo()

	```{r}
	sessionInfo()
	```