Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save mpettis/1afd9a7f42fff34ba9a2d5c240356acc to your computer and use it in GitHub Desktop.
Save mpettis/1afd9a7f42fff34ba9a2d5c240356acc to your computer and use it in GitHub Desktop.
r-purrr-named-lists-of-dataframes-to-single-dataframe
# Ref: https://jennybc.github.io/purrr-tutorial/ls02_map-extraction-advanced.html#list_inside_a_data_frame
# Ref: https://github.com/tidyverse/tidyr/issues/22
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(purrr))
# Make iris dataset into list of data frames split by `Species`.
my_list <- split(iris, iris$Species)
str(my_list)
#> List of 3
#> $ setosa :'data.frame': 50 obs. of 5 variables:
#> ..$ Sepal.Length: num [1:50] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> ..$ Sepal.Width : num [1:50] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> ..$ Petal.Length: num [1:50] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> ..$ Petal.Width : num [1:50] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ versicolor:'data.frame': 50 obs. of 5 variables:
#> ..$ Sepal.Length: num [1:50] 7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
#> ..$ Sepal.Width : num [1:50] 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
#> ..$ Petal.Length: num [1:50] 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
#> ..$ Petal.Width : num [1:50] 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ virginica :'data.frame': 50 obs. of 5 variables:
#> ..$ Sepal.Length: num [1:50] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 ...
#> ..$ Sepal.Width : num [1:50] 3.3 2.7 3 2.9 3 3 2.5 2.9 2.5 3.6 ...
#> ..$ Petal.Length: num [1:50] 6 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 ...
#> ..$ Petal.Width : num [1:50] 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 ...
#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 3 3 3 3 3 3 3 3 3 3 ...
# I'd like to compute a `fivenum()` on each of the numeric columns.
# First, `fivenum()` gives a vector of 5 numbers as output, but
# doesn't label them. The documenation says what they are, but I'd
# like them in the output as well. Here we make a unary function
# that will compute the fivenum function, but the output structure is
# a data frame with useful column names.
fivenum2df <- . %>%
fivenum() %>%
set_names(c("min", "lower_hinge", "median", "upper_hinge", "max")) %>%
as.list() %>%
as_data_frame()
# Here we do a nested map. The inner map (`map(fivenum2df)`) walks over the supplied
# list of columns and computes the fivenum2df function on them, creating a list of
# data frames.
#
# The outer map walks all of the different source data frames (one per Species)
# and feeds that data frame to the anonymous function in the outer map. That
# anonymous function looks at each data frame, keeps only the numeric columns,
# feeds it to the inner map which computes a list of fivenum data frames,
# and then recombines it into a single data frame with 'bind_rows()`, with an
# added column name called 'col_nm'.
#
# Finally, each of the previous data frames (one per Species) is assembled into
# a single data frame, with a column called 'species_nm' to record which species
# it came from.
dat_fivenum <- my_list %>%
map( ~ .x %>%
keep(is.numeric) %>%
map(fivenum2df) %>%
bind_rows(.id='col_nm')) %>%
bind_rows(.id='species_nm')
dat_fivenum
#> # A tibble: 12 x 7
#> species_nm col_nm min lower_hinge median upper_hinge max
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa Sepal.Length 4.30 4.80 5.00 5.20 5.80
#> 2 setosa Sepal.Width 2.30 3.20 3.40 3.70 4.40
#> 3 setosa Petal.Length 1.00 1.40 1.50 1.60 1.90
#> 4 setosa Petal.Width 0.100 0.200 0.200 0.300 0.600
#> 5 versicolor Sepal.Length 4.90 5.60 5.90 6.30 7.00
#> 6 versicolor Sepal.Width 2.00 2.50 2.80 3.00 3.40
#> 7 versicolor Petal.Length 3.00 4.00 4.35 4.60 5.10
#> 8 versicolor Petal.Width 1.00 1.20 1.30 1.50 1.80
#> 9 virginica Sepal.Length 4.90 6.20 6.50 6.90 7.90
#> 10 virginica Sepal.Width 2.20 2.80 3.00 3.20 3.80
#> 11 virginica Petal.Length 4.50 5.10 5.55 5.90 6.90
#> 12 virginica Petal.Width 1.40 1.80 2.00 2.30 2.50
@mpettis
Copy link
Author

mpettis commented Apr 23, 2018

To try and put some possible narrative around Jenny's statement on Twitter that, to paraphrase, she likes to be more disciplined about staying within the data frame... I have an argument as to why she'd say that, perhaps she'll comment and see if that makes sense.

My original method makes named lists of data frames that get aggregated with bind_rows() into a single data frame. This pulls a data frame apart into the named list, and then puts it back together. Jenny's approach keeps the dataframes intact, but changes their shape to get structures she needs. So, in opposition to pulling apart and then stitching together the data, she nests the data first, then expands it back to the output she wants.

This seems like it would be two equivalent approaches, just rearranging the order of operations (expanding then compacting). But it is probably a better strategy to compact (nest) first and then expand (unnest or extract). The reason I see for this is that when you initially expand the data into a named list of data frames, your keys can be only for 1 column of data. If you need two, you will need nested list of depth 2, which each hierarchical level corresponding to a column you want to operate on. That can make it hard to keep the other columns of the dataframe intact through the whole process, and it is hard to reason about highly nested structures, probably moreso for key nesting, rather than value nesting. When you nest first, you can wrap your arbitrary objects inside of lists, and then when you need to pull out features of those wrapped objects, you can map over them and extract the pieces you need for tabular structure. And, memory permitting, you can carry along those arbitrary objects in a list-column and extract pieces you may need at further processing steps with similar code patterns.

In short, though they seem similar, it is likely that the 'nest-then-unnest' or 'nest-then-extract' seems to be the more robust pattern for most use cases.

@jennybc
Copy link

jennybc commented Apr 23, 2018

Yes! I never felt very comfortable with the "map within map" paradigm. I don't love spread()ing either, but it seems more natural.

And yes I'm developing pretty strong opinions that "nest >> split" for the reasons you say about the handling of the nest-ing or group-ing (or split-ting) variables. I explore some of this a concrete example here: https://github.com/jennybc/row-oriented-workflows/blob/master/ex08_nesting-is-good.md. That shows how damaging it is for a factor variable to transit through list and row names, before being restored to the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment