Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mpettis/1afd9a7f42fff34ba9a2d5c240356acc to your computer and use it in GitHub Desktop.
Save mpettis/1afd9a7f42fff34ba9a2d5c240356acc to your computer and use it in GitHub Desktop.
r-purrr-named-lists-of-dataframes-to-single-dataframe
# Ref: https://jennybc.github.io/purrr-tutorial/ls02_map-extraction-advanced.html#list_inside_a_data_frame
# Ref: https://github.com/tidyverse/tidyr/issues/22
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(purrr))
# Make iris dataset into list of data frames split by `Species`.
my_list <- split(iris, iris$Species)
str(my_list)
#> List of 3
#> $ setosa :'data.frame': 50 obs. of 5 variables:
#> ..$ Sepal.Length: num [1:50] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> ..$ Sepal.Width : num [1:50] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> ..$ Petal.Length: num [1:50] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> ..$ Petal.Width : num [1:50] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ versicolor:'data.frame': 50 obs. of 5 variables:
#> ..$ Sepal.Length: num [1:50] 7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
#> ..$ Sepal.Width : num [1:50] 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
#> ..$ Petal.Length: num [1:50] 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
#> ..$ Petal.Width : num [1:50] 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ virginica :'data.frame': 50 obs. of 5 variables:
#> ..$ Sepal.Length: num [1:50] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 ...
#> ..$ Sepal.Width : num [1:50] 3.3 2.7 3 2.9 3 3 2.5 2.9 2.5 3.6 ...
#> ..$ Petal.Length: num [1:50] 6 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 ...
#> ..$ Petal.Width : num [1:50] 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 ...
#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 3 3 3 3 3 3 3 3 3 3 ...
# I'd like to compute a `fivenum()` on each of the numeric columns.
# First, `fivenum()` gives a vector of 5 numbers as output, but
# doesn't label them. The documenation says what they are, but I'd
# like them in the output as well. Here we make a unary function
# that will compute the fivenum function, but the output structure is
# a data frame with useful column names.
fivenum2df <- . %>%
fivenum() %>%
set_names(c("min", "lower_hinge", "median", "upper_hinge", "max")) %>%
as.list() %>%
as_data_frame()
# Here we do a nested map. The inner map (`map(fivenum2df)`) walks over the supplied
# list of columns and computes the fivenum2df function on them, creating a list of
# data frames.
#
# The outer map walks all of the different source data frames (one per Species)
# and feeds that data frame to the anonymous function in the outer map. That
# anonymous function looks at each data frame, keeps only the numeric columns,
# feeds it to the inner map which computes a list of fivenum data frames,
# and then recombines it into a single data frame with 'bind_rows()`, with an
# added column name called 'col_nm'.
#
# Finally, each of the previous data frames (one per Species) is assembled into
# a single data frame, with a column called 'species_nm' to record which species
# it came from.
dat_fivenum <- my_list %>%
map( ~ .x %>%
keep(is.numeric) %>%
map(fivenum2df) %>%
bind_rows(.id='col_nm')) %>%
bind_rows(.id='species_nm')
dat_fivenum
#> # A tibble: 12 x 7
#> species_nm col_nm min lower_hinge median upper_hinge max
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa Sepal.Length 4.30 4.80 5.00 5.20 5.80
#> 2 setosa Sepal.Width 2.30 3.20 3.40 3.70 4.40
#> 3 setosa Petal.Length 1.00 1.40 1.50 1.60 1.90
#> 4 setosa Petal.Width 0.100 0.200 0.200 0.300 0.600
#> 5 versicolor Sepal.Length 4.90 5.60 5.90 6.30 7.00
#> 6 versicolor Sepal.Width 2.00 2.50 2.80 3.00 3.40
#> 7 versicolor Petal.Length 3.00 4.00 4.35 4.60 5.10
#> 8 versicolor Petal.Width 1.00 1.20 1.30 1.50 1.80
#> 9 virginica Sepal.Length 4.90 6.20 6.50 6.90 7.90
#> 10 virginica Sepal.Width 2.20 2.80 3.00 3.20 3.80
#> 11 virginica Petal.Length 4.50 5.10 5.55 5.90 6.90
#> 12 virginica Petal.Width 1.40 1.80 2.00 2.30 2.50
@jennybc
Copy link

jennybc commented Apr 22, 2018

Nice! I like this example. There's lots of different ways to approach it. I've had this browser tab open for several days but just now got around to playing with it. I think I've started using different patterns since I wrote part of that purrr tutorial. Here are more alternative takes:

library(tidyverse)

nms <- c("min", "lower_hinge", "median", "upper_hinge", "max")

## Option 1
fivenum3 <- . %>%
  fivenum() %>% 
  set_names(nms) %>% 
  enframe() %>% 
  mutate(name = factor(name, levels = nms)) %>%
  list() ## needed to use this in summarize()

iris %>% 
  group_by(Species) %>% 
  summarize_if(is.numeric, fivenum3) %>% 
  gather(key = "variable", value = "value", -Species) %>% 
  unnest() %>% 
  spread(key = name, value = value)
#> # A tibble: 12 x 7
#>    Species    variable       min lower_hinge median upper_hinge   max
#>    <fct>      <chr>        <dbl>       <dbl>  <dbl>       <dbl> <dbl>
#>  1 setosa     Petal.Length 1.00        1.40   1.50        1.60  1.90 
#>  2 setosa     Petal.Width  0.100       0.200  0.200       0.300 0.600
#>  3 setosa     Sepal.Length 4.30        4.80   5.00        5.20  5.80 
#>  4 setosa     Sepal.Width  2.30        3.20   3.40        3.70  4.40 
#>  5 versicolor Petal.Length 3.00        4.00   4.35        4.60  5.10 
#>  6 versicolor Petal.Width  1.00        1.20   1.30        1.50  1.80 
#>  7 versicolor Sepal.Length 4.90        5.60   5.90        6.30  7.00 
#>  8 versicolor Sepal.Width  2.00        2.50   2.80        3.00  3.40 
#>  9 virginica  Petal.Length 4.50        5.10   5.55        5.90  6.90 
#> 10 virginica  Petal.Width  1.40        1.80   2.00        2.30  2.50 
#> 11 virginica  Sepal.Length 4.90        6.20   6.50        6.90  7.90 
#> 12 virginica  Sepal.Width  2.20        2.80   3.00        3.20  3.80
  

## Option 2
iris %>% 
  group_by(Species) %>% 
  summarize_if(is.numeric, compose(list, fivenum)) %>% 
  gather(key = "variable", value = "fivenum", -Species) %>% 
  mutate(fivenum = map(fivenum, ~ tibble(name = nms, value = .x))) %>% 
  unnest() %>% 
  mutate(name = factor(name, levels = nms)) %>% 
  spread(key = name, value = value)
#> # A tibble: 12 x 7
#>    Species    variable       min lower_hinge median upper_hinge   max
#>    <fct>      <chr>        <dbl>       <dbl>  <dbl>       <dbl> <dbl>
#>  1 setosa     Petal.Length 1.00        1.40   1.50        1.60  1.90 
#>  2 setosa     Petal.Width  0.100       0.200  0.200       0.300 0.600
#>  3 setosa     Sepal.Length 4.30        4.80   5.00        5.20  5.80 
#>  4 setosa     Sepal.Width  2.30        3.20   3.40        3.70  4.40 
#>  5 versicolor Petal.Length 3.00        4.00   4.35        4.60  5.10 
#>  6 versicolor Petal.Width  1.00        1.20   1.30        1.50  1.80 
#>  7 versicolor Sepal.Length 4.90        5.60   5.90        6.30  7.00 
#>  8 versicolor Sepal.Width  2.00        2.50   2.80        3.00  3.40 
#>  9 virginica  Petal.Length 4.50        5.10   5.55        5.90  6.90 
#> 10 virginica  Petal.Width  1.40        1.80   2.00        2.30  2.50 
#> 11 virginica  Sepal.Length 4.90        6.20   6.50        6.90  7.90 
#> 12 virginica  Sepal.Width  2.20        2.80   3.00        3.20  3.80

Created on 2018-04-22 by the reprex package (v0.2.0).

PS I think, if you rename your gist with .md extension (or strip the leading and trailing backticks and give .R extension), it might render in a prettier way.

@mpettis
Copy link
Author

mpettis commented Apr 23, 2018

I am going to leave my original code only so that the path from original version to newer, better patterns are explicitly viewable in comments.

I think I see the point of the newer patterns here. It moves from a nesting of two map calls, which is harder to reason about as a coder, I think, to handling that logic through the 'gather-spread' pattern that replaces it. I'm wondering if there is an explicit name for the 'gather-spread' pattern to replace the 'nested map' pattern that I started with. Or if 'pattern' is the right work I'm using...

@mpettis
Copy link
Author

mpettis commented Apr 23, 2018

... for posterity, directed mostly at myself...

I'm thinking that one of the key concepts to this type of processing is the notion that non-atomic objects (vectors of length greater than 1, other lists, objects) need to be wrapped as the value of a one-element list in order to be stored and processed as 'cells' of a data frame. That means that complex things have an additional 'list' layer wrapping them, and need to be 'unwrapped' and accessed in the right way (such as by map). Tracking this level of wrapping and how it gets used within lists and data frames is a key concept that, once internalized, makes following chains of these map and summarise_... statements read much more like English sentence structures.

@mpettis
Copy link
Author

mpettis commented Apr 23, 2018

To try and put some possible narrative around Jenny's statement on Twitter that, to paraphrase, she likes to be more disciplined about staying within the data frame... I have an argument as to why she'd say that, perhaps she'll comment and see if that makes sense.

My original method makes named lists of data frames that get aggregated with bind_rows() into a single data frame. This pulls a data frame apart into the named list, and then puts it back together. Jenny's approach keeps the dataframes intact, but changes their shape to get structures she needs. So, in opposition to pulling apart and then stitching together the data, she nests the data first, then expands it back to the output she wants.

This seems like it would be two equivalent approaches, just rearranging the order of operations (expanding then compacting). But it is probably a better strategy to compact (nest) first and then expand (unnest or extract). The reason I see for this is that when you initially expand the data into a named list of data frames, your keys can be only for 1 column of data. If you need two, you will need nested list of depth 2, which each hierarchical level corresponding to a column you want to operate on. That can make it hard to keep the other columns of the dataframe intact through the whole process, and it is hard to reason about highly nested structures, probably moreso for key nesting, rather than value nesting. When you nest first, you can wrap your arbitrary objects inside of lists, and then when you need to pull out features of those wrapped objects, you can map over them and extract the pieces you need for tabular structure. And, memory permitting, you can carry along those arbitrary objects in a list-column and extract pieces you may need at further processing steps with similar code patterns.

In short, though they seem similar, it is likely that the 'nest-then-unnest' or 'nest-then-extract' seems to be the more robust pattern for most use cases.

@jennybc
Copy link

jennybc commented Apr 23, 2018

Yes! I never felt very comfortable with the "map within map" paradigm. I don't love spread()ing either, but it seems more natural.

And yes I'm developing pretty strong opinions that "nest >> split" for the reasons you say about the handling of the nest-ing or group-ing (or split-ting) variables. I explore some of this a concrete example here: https://github.com/jennybc/row-oriented-workflows/blob/master/ex08_nesting-is-good.md. That shows how damaging it is for a factor variable to transit through list and row names, before being restored to the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment