Skip to content

Instantly share code, notes, and snippets.

@mpettis
Last active October 12, 2016 16:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mpettis/70dcb33f7328e21ec485fdf8727c97ef to your computer and use it in GitHub Desktop.
Save mpettis/70dcb33f7328e21ec485fdf8727c97ef to your computer and use it in GitHub Desktop.
plyr example refrence, compare to base functions, no dplyr yet.

Introduction

This is a quick and dirty set of demonstrative examples for plyr and dplyr.

Setup

knitr::opts_chunk$set(echo = TRUE, fig.width=12, fig.height=9, warnings=FALSE)
options(width=116)

suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(dplyr))

plyr

First, both plyr and dplyr are toolsets that help enact that the split-apply-combine strategy for data manipulation. For that background, see:

The nub of this is that we usually have a chunk of data that is naturally broken into sub-chunks that we want to apply a given function or manipulation to, each independently on each chunk, and then combine those results back into another aggregated chunk, possibly with a different shape.

Now, R has native functions to deal with this (the -apply family of functions, also the by function, and some others), but they are a bit awkard in a lot of cases, and there really isn't a function that works nicely on data frames, much in the same way the by statement does for SAS.

plyr was the first popular package that took this split-apply-combine idea and made a good set of consistently styled functions to deal with this generic problem. plyr has mostly faded from the public in favor of dplyr as the toolset to use with the split-apply-combine strategy of data manipulation, due partially to speed and partially to a different, more functional-oriented api of dplyr.

Some resources for plyr:

A quick map

Part of the uniform api of plyr is that is uses a two-letter prefix on the function names to specify what goes in and what comes out. There are 3 single-letter characters, and one non-letter character: a, d, and l are the letters. a stands for array, d for dataframe, and l for list. The first letter of the function is the data structure that is input, and the second letter is what is output. The _ character is also used, and ony as the second characters. _ stands for 'no ouput', and states that no data structure will be returned.

For examples:

adply

Will accept an array and output a dataframe.

ddply

Will accept a dataframe and return a dataframe.

l_ply will take a list and return nothing (presumably, the things that are operated on are done just for side effects, like printing to the console).

The paper cited above (http://vita.had.co.nz/papers/plyr.pdf) has a table showing all of the possible plyr functions this maps to.

We will give examples below.

a.ply functions

Let's start with a simple array:

  # Make a vector, and give it names
v <- setNames(1:3, letters[1:3])
v

## a b c 
## 1 2 3

str(v)

##  Named int [1:3] 1 2 3
##  - attr(*, "names")= chr [1:3] "a" "b" "c"

a_ply

Let's say we want to print out each individual element of the vector:

a_ply(v, 1, print)

## a 
## 1 
## b 
## 2 
## c 
## 3

Observations:

  • That's ugly. You get the name on one line, value on following line. But it illustrates what the purpose of the function is.

  • The first argument to aaply is the array that will be operated on. Now, arrays can be multidimensional, and we'll leverage that in a following example.

  • The second argument gives the margin of the array to operate on. 1 here indicates rows (and 1-d vectors, like v here, are considered row vectors, so it operates on each entry).

  • The last argument is a function that executes once for each value of the array v fed to it -- one element at a time.

aaply

Let's do something staggeringly simple to the array, like doubling each entry:

  ## Apply a doubling to each element of the vector
res <- aaply(v, 1, function(e) {2 * e})
res

## a b c 
## 2 4 6

str(res)

##  Named num [1:3] 2 4 6
##  - attr(*, "names")= chr [1:3] "a" "b" "c"

Now, this could have been simply done by vectorization:

2 * v

## a b c 
## 2 4 6

but the plyr way lays the groundwork for doing more complex things on a per-element basis. Some things to point out on the aaply method:

  • The last argument is a function. Here it is the actual function definition function(e) {2 * e}. I am using a function definition, which is the most flexible and the way I mostly use the plyr functions because it is the most flexible.

  • The function used as the last argument must take one argument, and that will be set in turn to each value from v for each invocation of the function. The function results are then aggregated into a single array and put into res.

  • The output is an array, with names coming from element names of v

alply

We could also collect the results in a list with alply:

res <- alply(v, 1, function(e) {2 * e})
res

## $`1`
## a 
## 2 
## 
## $`2`
## b 
## 4 
## 
## $`3`
## c 
## 6 
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  a
## 2  b
## 3  c

str(res)

## List of 3
##  $ 1: Named num 2
##   ..- attr(*, "names")= chr "a"
##  $ 2: Named num 4
##   ..- attr(*, "names")= chr "b"
##  $ 3: Named num 6
##   ..- attr(*, "names")= chr "c"
##  - attr(*, "split_type")= chr "array"
##  - attr(*, "split_labels")='data.frame': 3 obs. of  1 variable:
##   ..$ X1: Factor w/ 3 levels "a","b","c": 1 2 3

This structure is a little more complex. Note that:

  • This returns a list, with the output list keys being numeric -- not the names of the array entries, which is too bad.
  • There is a split_labels attribute that tracks the input array values.
  • There is a split_type attribute that says that the input was an array.

We can set the names based on the attribute values on the output object explicitly:

names(res) <- attr(res, "split_labels")[[1]]
res

## $a
## a 
## 2 
## 
## $b
## b 
## 4 
## 
## $c
## c 
## 6 
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  a
## 2  b
## 3  c

str(res)

## List of 3
##  $ a: Named num 2
##   ..- attr(*, "names")= chr "a"
##  $ b: Named num 4
##   ..- attr(*, "names")= chr "b"
##  $ c: Named num 6
##   ..- attr(*, "names")= chr "c"
##  - attr(*, "split_type")= chr "array"
##  - attr(*, "split_labels")='data.frame': 3 obs. of  1 variable:
##   ..$ X1: Factor w/ 3 levels "a","b","c": 1 2 3

This is still more messy than we like. We can get it into a form that seems right as follows (using a base R function):

res <- lapply(res, unname)
res

## $a
## [1] 2
## 
## $b
## [1] 4
## 
## $c
## [1] 6

str(res)

## List of 3
##  $ a: num 2
##  $ b: num 4
##  $ c: num 6

In fact, you can also cast it back to an array (like the result of aaply) like so:

unlist(res)

## a b c 
## 2 4 6

This just helps to figure out how some of the functions work and can map between some data forms.

However, in this case, it may just be easier to use the base R functions, as they seem to conform more to what is desired:

res <- sapply(v, function(e) {2 * e}, simplify=FALSE)
res

## $a
## [1] 2
## 
## $b
## [1] 4
## 
## $c
## [1] 6

str(res)

## List of 3
##  $ a: num 2
##  $ b: num 4
##  $ c: num 6

adply

You can use adply as well when what you want output is a dataframe:

res <- adply(v, 1, function(e) {data.frame(out=2 * e, stringsAsFactors=FALSE)})
res

##   X1 out
## 1  a   2
## 2  b   4
## 3  c   6

str(res)

## 'data.frame':    3 obs. of  2 variables:
##  $ X1 : Factor w/ 3 levels "a","b","c": 1 2 3
##  $ out: num  2 4 6

Note that:

  • It returns a data frame, not just with the cols specified in the function, but also an additional column noting the array value that was used when computing.

  • The column with the value that holds the original vector v value is returned as a factor, not a string.

You can name that output column specifically, as I do below:

res <- adply(v, 1, .id="orig", function(e) {data.frame(out=2 * e, stringsAsFactors=FALSE)})
res

##   orig out
## 1    a   2
## 2    b   4
## 3    c   6

str(res)

## 'data.frame':    3 obs. of  2 variables:
##  $ orig: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ out : num  2 4 6

Also, this marking of the output column is really nice because you may return data frames with multiple rows per single input value. Like so:

res <- adply(v, 1, .id="orig", function(e) {data.frame(out=1:e, stringsAsFactors=FALSE)})
res

##   orig out
## 1    a   1
## 2    b   1
## 3    b   2
## 4    c   1
## 5    c   2
## 6    c   3

str(res)

## 'data.frame':    6 obs. of  2 variables:
##  $ orig: Factor w/ 3 levels "a","b","c": 1 2 2 3 3 3
##  $ out : int  1 1 2 1 2 3

This demonstrates the power that this allows by customizing your own function for the last argument. The connection with split-apply-combine is that:

  • We split the incoming array by row (i.e., each v entry in this case).
  • We apply the function, which makes a dataframe for each value of v.
  • We combine all of the output dataframes into one dataframe, with an index field that tracks which value of v it is associated with.

A note of warning: factors have issues

v <- factor(c("a", "b", "c"))

  ## It doesn't like operating on raw factors
tryCatch({a_ply(v, 1, print)}, error=function(e) {print("ERROR"); print(e)})

## [1] "ERROR"
## <simpleError in splitter_a(.data, .margins, .expand): Invalid margin>

  ## will do ok if converted to characters
a_ply(as.character(v), 1, print)

## [1] "a"
## [1] "b"
## [1] "c"

Multi-dimensional arrays

Multi-dimensional arrays can be operated on by rows or columns:

mat <- matrix(1:9, nrow=3)
mat

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

  ## Sum rows of matrix
aaply(mat, 1, sum)

##  1  2  3 
## 12 15 18

  ## Sum columns of matrix
aaply(mat, 2, sum)

##  1  2  3 
##  6 15 24

Arrays can have more than 2 dimensions, like matricies do, and the second argument can be used to deal with that appropriately. What is passed, in both of these cases, is a vector to the function in the last argument.

l.ply functions

We shouldn't have to do this as exhaustively as we did for the a.ply functions. But there are some differences to note.

l_ply

Starting simple, we can make a list and print out it's values:

lst <- list(a=1, b=2, c=3)
l_ply(lst, print)

## [1] 1
## [1] 2
## [1] 3

llply

Or return the doubled values as its own list:

res <- llply(lst, function(e) {2 * e})
res

## $a
## [1] 2
## 
## $b
## [1] 4
## 
## $c
## [1] 6

str(res)

## List of 3
##  $ a: num 2
##  $ b: num 4
##  $ c: num 6

Note:

  • It returns a list with the list keys being the input list names and values as the result of the function.

So that behaves well, a bit better than the named a.ply functions did for us.

As an aside, here lapply (base R) will behave just as well as llply:

res <- lapply(lst, function(e) {2 * e})
res

## $a
## [1] 2
## 
## $b
## [1] 4
## 
## $c
## [1] 6

str(res)

## List of 3
##  $ a: num 2
##  $ b: num 4
##  $ c: num 6

One thing that can be difficult... llply (and base R functions) iterates over the list values, and returns an object where the key of the output list is associated with the right function output of llply. However, inside the function supplied as the lat llply argument, you have no way of natively accessing the input list's key that was used. This is sometimes problematic. However, if you really want to access it inside, as well as the value, do the following:

res <- alply(names(lst), 1, function(k) {
  e <- lst[[k]]
  sprintf("list key k is %s, list value for that key is %s", k, e)
})
res

## $`1`
## [1] "list key k is a, list value for that key is 1"
## 
## $`2`
## [1] "list key k is b, list value for that key is 2"
## 
## $`3`
## [1] "list key k is c, list value for that key is 3"
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  1
## 2  2
## 3  3

str(res)

## List of 3
##  $ 1: chr "list key k is a, list value for that key is 1"
##  $ 2: chr "list key k is b, list value for that key is 2"
##  $ 3: chr "list key k is c, list value for that key is 3"
##  - attr(*, "split_type")= chr "array"
##  - attr(*, "split_labels")='data.frame': 3 obs. of  1 variable:
##   ..$ X1: Factor w/ 3 levels "1","2","3": 1 2 3

Now, however, you have the problem where the output list keys aren't the input names you proved via names(lst). That sucks. But you can stick names on at the end.

names(res) <- names(lst)
res

## $a
## [1] "list key k is a, list value for that key is 1"
## 
## $b
## [1] "list key k is b, list value for that key is 2"
## 
## $c
## [1] "list key k is c, list value for that key is 3"
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  1
## 2  2
## 3  3

str(res)

## List of 3
##  $ a: chr "list key k is a, list value for that key is 1"
##  $ b: chr "list key k is b, list value for that key is 2"
##  $ c: chr "list key k is c, list value for that key is 3"
##  - attr(*, "split_type")= chr "array"
##  - attr(*, "split_labels")='data.frame': 3 obs. of  1 variable:
##   ..$ X1: Factor w/ 3 levels "1","2","3": 1 2 3

  ## llply doesn't work either, by the way.
res <- llply(names(lst), function(k) {
  e <- lst[[k]]
  sprintf("list key k is %s, list value for that key is %s", k, e)
})
res

## [[1]]
## [1] "list key k is a, list value for that key is 1"
## 
## [[2]]
## [1] "list key k is b, list value for that key is 2"
## 
## [[3]]
## [1] "list key k is c, list value for that key is 3"

str(res)

## List of 3
##  $ : chr "list key k is a, list value for that key is 1"
##  $ : chr "list key k is b, list value for that key is 2"
##  $ : chr "list key k is c, list value for that key is 3"

Again, some complications:

  • The result list doesn't get the elements of names(lst) as output list keys. I don't know why. Doesn't work if I use llply either.

My solution is that I use one of the solutions presented here:

http://stackoverflow.com/a/20546621/1022967

Here's the simplest such solution, which uses base R rather than plyr:

res <- sapply(names(lst), function(k) {
  e <- lst[[k]]
  sprintf("list key k is '%s', list value for that key is '%s'", k, e)
}, simplify=FALSE)
res

## $a
## [1] "list key k is 'a', list value for that key is '1'"
## 
## $b
## [1] "list key k is 'b', list value for that key is '2'"
## 
## $c
## [1] "list key k is 'c', list value for that key is '3'"

str(res)

## List of 3
##  $ a: chr "list key k is 'a', list value for that key is '1'"
##  $ b: chr "list key k is 'b', list value for that key is '2'"
##  $ c: chr "list key k is 'c', list value for that key is '3'"

And the parameter simplify=FALSE is necessary, otherwise, in this case, it will reduce it to a vector from a list. Which may be what you want, or maybe not...

ldply

Let's return a dataframe from the list elements

res <- ldply(lst, .id="original", function(e) {
  data.frame(doubled=2 * e)
})
res

##   original doubled
## 1        a       2
## 2        b       4
## 3        c       6

str(res)

## 'data.frame':    3 obs. of  2 variables:
##  $ original: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ doubled : num  2 4 6

laply

I don't usually use this -- I just use the base R sapply, as it does what I want, and there are examples of that above.

d.ply functions

So, a note... these were probably the most used functions in plyr, more than the a.ply and l.ply ones. That's because everybody usually has their data in dataframes, and that is really where the split-apply-combine model really gives you traction. I think it's popularity is what drove Hadley Wickham to write the dplyr package, which is a rewrite of the api but just for d.ply operations, basically. So when I want to really operate on dataframes and get them out, I use dplyr, but sometimes I want to have either the input or output not be a dataframe, or I want a custom function that prints out diagnostic info or performs other side effects, and in that case, plyr is still really, really handy.

First, let's get a decent dataframe.

  ## Called it `dat` because I hate typing `CO2`
dat <- CO2
str(dat)

## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   84 obs. of  5 variables:
##  $ Plant    : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
##  $ conc     : num  95 175 250 350 500 675 1000 95 175 250 ...
##  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
##  - attr(*, "formula")=Class 'formula'  language uptake ~ conc | Plant
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Treatment * Type
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Ambient carbon dioxide concentration"
##   ..$ y: chr "CO2 uptake rate"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(uL/L)"
##   ..$ y: chr "(umol/m^2 s)"

dat %>% tbl_df()

## # A tibble: 84 x 5
##    Plant   Type  Treatment  conc uptake
## *  <ord> <fctr>     <fctr> <dbl>  <dbl>
## 1    Qn1 Quebec nonchilled    95   16.0
## 2    Qn1 Quebec nonchilled   175   30.4
## 3    Qn1 Quebec nonchilled   250   34.8
## 4    Qn1 Quebec nonchilled   350   37.2
## 5    Qn1 Quebec nonchilled   500   35.3
## 6    Qn1 Quebec nonchilled   675   39.2
## 7    Qn1 Quebec nonchilled  1000   39.7
## 8    Qn2 Quebec nonchilled    95   13.6
## 9    Qn2 Quebec nonchilled   175   27.3
## 10   Qn2 Quebec nonchilled   250   37.1
## # ... with 74 more rows

d_ply

A very common task is to perform an aggregation function on a subset of rows of a dataframe for different values of a set of identifying columns. To be concrete, let's say for this dataframe we ultimately want to find the mean of uptake for each type of Treatment, regardless of Plant or Type values. The first thing I want to show you is how we tell the d.ply functions to split the input dataframe on different values of a column or set of columns. We'll do this by printing out the different 'splits' of the data by Treatment.

As an aside, I like the piping structure (%>%) that the magrittr package provides (which is loaded above via the dplyr package load), and will leverage that.

d_ply(dat, ~ Treatment, function(df) {
    # Following statement same as:
    # ltreat <- unique(as.character(df$Treatment))
  ltreat <- df$Treatment %>% as.character() %>% unique()

  cat("\n------------------------------")
  cat("\nTreatment: ", ltreat)
  cat("\n------------------------------\n")
  print(tbl_df(df))


  cat("\n..............................\n")
  cat("Mean uptake:")
  cat("\n..............................\n")
  df$uptake %>% mean() %>% print()
})

## 
## ------------------------------
## Treatment:  nonchilled
## ------------------------------
## # A tibble: 42 x 5
##    Plant   Type  Treatment  conc uptake
##    <ord> <fctr>     <fctr> <dbl>  <dbl>
## 1    Qn1 Quebec nonchilled    95   16.0
## 2    Qn1 Quebec nonchilled   175   30.4
## 3    Qn1 Quebec nonchilled   250   34.8
## 4    Qn1 Quebec nonchilled   350   37.2
## 5    Qn1 Quebec nonchilled   500   35.3
## 6    Qn1 Quebec nonchilled   675   39.2
## 7    Qn1 Quebec nonchilled  1000   39.7
## 8    Qn2 Quebec nonchilled    95   13.6
## 9    Qn2 Quebec nonchilled   175   27.3
## 10   Qn2 Quebec nonchilled   250   37.1
## # ... with 32 more rows
## 
## ..............................
## Mean uptake:
## ..............................
## [1] 30.64286
## 
## ------------------------------
## Treatment:  chilled
## ------------------------------
## # A tibble: 42 x 5
##    Plant   Type Treatment  conc uptake
##    <ord> <fctr>    <fctr> <dbl>  <dbl>
## 1    Qc1 Quebec   chilled    95   14.2
## 2    Qc1 Quebec   chilled   175   24.1
## 3    Qc1 Quebec   chilled   250   30.3
## 4    Qc1 Quebec   chilled   350   34.6
## 5    Qc1 Quebec   chilled   500   32.5
## 6    Qc1 Quebec   chilled   675   35.4
## 7    Qc1 Quebec   chilled  1000   38.7
## 8    Qc2 Quebec   chilled    95    9.3
## 9    Qc2 Quebec   chilled   175   27.3
## 10   Qc2 Quebec   chilled   250   35.0
## # ... with 32 more rows
## 
## ..............................
## Mean uptake:
## ..............................
## [1] 23.78333

Observations:

  • We specify that we want to consider chunks of dat as split by the values of Treatment. In this case, we specify that by the second argument to d_ply: ~ Treatment. That is a formula interface, which is a common structure in specifying things like linear models to the lm function and in other locations. It has many advantages I won't lay out here, but it is useful to start to get to know. The thing to know is that, for d.ply functions, you specify the columns you want to split as a formula with a leading tilde (~) and the columns you want to split on follwing that, and the columns are separated by the + symbol. We'll see more examples of that below.

  • We extract the unique value of Treatment into ltreat, and use magrittr piping.

  • I like to use cat to control written output to the console to log what I am doing and critical variable values.

  • I am just printing out the subset of dat's rows that is provided to the last argument function via the df parameter in that function. So, d_ply is taking care of invoking that last function with the dat subset of rows once for each unique value of Treatment. That is powerful.

  • I calculate the mean uptake for each dat subset as split by Treatment.

ddply

Well, as said, this is nice to see results, but we often want to store the results of this mean calculation in a nice data structure we can recall and pull values out of. Most commonly, we'd like a dataframe with one column marking the Treatment level considered and the value of mean uptake actually calculated. That can be done via ddply:

This, honestly, is probably the most common plyr function. I, for one, was used to the SAS paradigms of all inputs and outputs being a table. Those in Matlab may be used to everything being vectors or matricies (I'm guessing here). But data and rectangular data structures seem to go hand in glove conceptually. And rectangular data structures are easy to iterate over.

dat_res <- ddply(dat, ~ Treatment, function(df) {
  uptake_mean <- df$uptake %>% mean()
  data.frame(uptake_mean, stringsAsFactors=FALSE)
})

dat_res %>% print()

##    Treatment uptake_mean
## 1 nonchilled    30.64286
## 2    chilled    23.78333

dat_res %>% str()

## 'data.frame':    2 obs. of  2 variables:
##  $ Treatment  : Factor w/ 2 levels "nonchilled","chilled": 1 2
##  $ uptake_mean: num  30.6 23.8

Observations:

  • We return a dataframe from the last function argument with just uptake_mean as a column, but ddply takes care to add a column for the value of the splitting column Treatment.

Splitting on multiple columns

What if we want to split on Treatment and Type?

dat_res <- ddply(dat, ~ Treatment + Type, function(df) {
  uptake_mean <- df$uptake %>% mean()
  data.frame(uptake_mean, stringsAsFactors=FALSE)
})

dat_res %>% print()

##    Treatment        Type uptake_mean
## 1 nonchilled      Quebec    35.33333
## 2 nonchilled Mississippi    25.95238
## 3    chilled      Quebec    31.75238
## 4    chilled Mississippi    15.81429

dat_res %>% str()

## 'data.frame':    4 obs. of  3 variables:
##  $ Treatment  : Factor w/ 2 levels "nonchilled","chilled": 1 1 2 2
##  $ Type       : Factor w/ 2 levels "Quebec","Mississippi": 1 2 1 2
##  $ uptake_mean: num  35.3 26 31.8 15.8

Different ways of specifying the columns to use as levels to split analysis on

We've seen the formula interface. There are others, like specifying character columns:

dat_res <- ddply(dat, c("Treatment", "Type"), function(df) {
  uptake_mean <- df$uptake %>% mean()
  data.frame(uptake_mean, stringsAsFactors=FALSE)
})

dat_res %>% print()

##    Treatment        Type uptake_mean
## 1 nonchilled      Quebec    35.33333
## 2 nonchilled Mississippi    25.95238
## 3    chilled      Quebec    31.75238
## 4    chilled Mississippi    15.81429

dat_res %>% str()

## 'data.frame':    4 obs. of  3 variables:
##  $ Treatment  : Factor w/ 2 levels "nonchilled","chilled": 1 1 2 2
##  $ Type       : Factor w/ 2 levels "Quebec","Mississippi": 1 2 1 2
##  $ uptake_mean: num  35.3 26 31.8 15.8

plyr provides a quoting operator .() that cleans up that expression a bit:

dat_res <- ddply(dat, .(Treatment, Type), function(df) {
  uptake_mean <- df$uptake %>% mean()
  data.frame(uptake_mean, stringsAsFactors=FALSE)
})

dat_res %>% print()

##    Treatment        Type uptake_mean
## 1 nonchilled      Quebec    35.33333
## 2 nonchilled Mississippi    25.95238
## 3    chilled      Quebec    31.75238
## 4    chilled Mississippi    15.81429

dat_res %>% str()

## 'data.frame':    4 obs. of  3 variables:
##  $ Treatment  : Factor w/ 2 levels "nonchilled","chilled": 1 1 2 2
##  $ Type       : Factor w/ 2 levels "Quebec","Mississippi": 1 2 1 2
##  $ uptake_mean: num  35.3 26 31.8 15.8

dlply

Another use case is when you want to operate on a dataframe a subset of rows at a time, but return an object that is not a dataframe, such as, say, a linear model for that subset of data. Here is how to do that:

lst_res <- dlply(dat, ~ Treatment + Type, function(df) {
  lm(uptake ~ conc, df)
})

lst_res %>% print()

## $nonchilled.Quebec
## 
## Call:
## lm(formula = uptake ~ conc, data = df)
## 
## Coefficients:
## (Intercept)         conc  
##    25.58503      0.02241  
## 
## 
## $nonchilled.Mississippi
## 
## Call:
## lm(formula = uptake ~ conc, data = df)
## 
## Coefficients:
## (Intercept)         conc  
##    18.45329      0.01724  
## 
## 
## $chilled.Quebec
## 
## Call:
## lm(formula = uptake ~ conc, data = df)
## 
## Coefficients:
## (Intercept)         conc  
##    21.42104      0.02375  
## 
## 
## $chilled.Mississippi
## 
## Call:
## lm(formula = uptake ~ conc, data = df)
## 
## Coefficients:
## (Intercept)         conc  
##   12.541791     0.007523  
## 
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##    Treatment        Type
## 1 nonchilled      Quebec
## 2 nonchilled Mississippi
## 3    chilled      Quebec
## 4    chilled Mississippi

#lst_res %>% str()

Observations:

  • This makes for very easy mass applications of a given modeling structure and collection of results.

  • The resulting list element names can be a bit cumbersome, but recall that the returned list also has the attribute split_labels that is a dataframe with the two different splitting column values stored in a dataframe, and that corresponds to the list keys and can be associated in a 1:1 manner with the list values (linear models here) returned. The rownumber of that split_labels dataframe corresponds to the list index of the object returned for those level values.

daply

Assume you just want a vector of the means returned:

arr_res <- daply(dat, ~ Treatment, function(df) {
  df$uptake %>% mean()
})

arr_res %>% print()

## nonchilled    chilled 
##   30.64286   23.78333

arr_res %>% str()

##  Named num [1:2] 30.6 23.8
##  - attr(*, "names")= chr [1:2] "nonchilled" "chilled"

If it is more than one column used in splitting, a >1d array is returned:

arr_res <- daply(dat, ~ Treatment + Type, function(df) {
  df$uptake %>% mean()
})

arr_res %>% print()

##             Type
## Treatment      Quebec Mississippi
##   nonchilled 35.33333    25.95238
##   chilled    31.75238    15.81429

arr_res %>% str()

##  num [1:2, 1:2] 35.3 31.8 26 15.8
##  - attr(*, "dimnames")=List of 2
##   ..$ Treatment: chr [1:2] "nonchilled" "chilled"
##   ..$ Type     : chr [1:2] "Quebec" "Mississippi"

Grouping and Summarizing Alternatives

All of this can be done with base R functions, and was needed to be done that way, prior to the plyr and dplyr packages. Here are some options that were (and are) available...

Base tapply

dat <- CO2
str(dat)

## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   84 obs. of  5 variables:
##  $ Plant    : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
##  $ conc     : num  95 175 250 350 500 675 1000 95 175 250 ...
##  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
##  - attr(*, "formula")=Class 'formula'  language uptake ~ conc | Plant
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Treatment * Type
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Ambient carbon dioxide concentration"
##   ..$ y: chr "CO2 uptake rate"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(uL/L)"
##   ..$ y: chr "(umol/m^2 s)"

dat %>% tbl_df()

## # A tibble: 84 x 5
##    Plant   Type  Treatment  conc uptake
## *  <ord> <fctr>     <fctr> <dbl>  <dbl>
## 1    Qn1 Quebec nonchilled    95   16.0
## 2    Qn1 Quebec nonchilled   175   30.4
## 3    Qn1 Quebec nonchilled   250   34.8
## 4    Qn1 Quebec nonchilled   350   37.2
## 5    Qn1 Quebec nonchilled   500   35.3
## 6    Qn1 Quebec nonchilled   675   39.2
## 7    Qn1 Quebec nonchilled  1000   39.7
## 8    Qn2 Quebec nonchilled    95   13.6
## 9    Qn2 Quebec nonchilled   175   27.3
## 10   Qn2 Quebec nonchilled   250   37.1
## # ... with 74 more rows

  ## tapply, one grouping col
res <- tapply(dat$uptake, dat$Treatment, mean)
res %>% print()

## nonchilled    chilled 
##   30.64286   23.78333

res %>% str()

##  num [1:2(1d)] 30.6 23.8
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:2] "nonchilled" "chilled"

  ## tapply, > 1 grouping col
res <- tapply(dat$uptake, list(dat$Treatment, dat$Type), mean)
res %>% print()

##              Quebec Mississippi
## nonchilled 35.33333    25.95238
## chilled    31.75238    15.81429

res %>% str()

##  num [1:2, 1:2] 35.3 31.8 26 15.8
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2] "nonchilled" "chilled"
##   ..$ : chr [1:2] "Quebec" "Mississippi"

Observations:

  • Least flexible. First arg must be an atomic vector. Argument to last function is then the subset of that first vector.

Base by function

The base by function can do a lot of this, but it is awkward. Here is an example:

  ## by
bydat <- by(dat, list(dat$Treatment, dat$Type), function(df) mean(df$uptake), simplify=FALSE)

bydat %>% print()

## : nonchilled
## : Quebec
## [1] 35.33333
## --------------------------------------------------------------------------------------- 
## : chilled
## : Quebec
## [1] 31.75238
## --------------------------------------------------------------------------------------- 
## : nonchilled
## : Mississippi
## [1] 25.95238
## --------------------------------------------------------------------------------------- 
## : chilled
## : Mississippi
## [1] 15.81429

bydat %>% str()

## List of 4
##  $ : num 35.3
##  $ : num 31.8
##  $ : num 26
##  $ : num 15.8
##  - attr(*, "dim")= int [1:2] 2 2
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2] "nonchilled" "chilled"
##   ..$ : chr [1:2] "Quebec" "Mississippi"
##  - attr(*, "call")= language by.data.frame(data = dat, INDICES = list(dat$Treatment, dat$Type), FUN = function(df) mean(df$uptake), simplify = FALSE)
##  - attr(*, "class")= chr "by"

Observations

  • by seems to work a lot like dlply, but specifying the splitting columns is a bit clunky.

Base aggregate function

  ## Aggregate
aggdat <- aggregate(uptake ~ Treatment + Type, dat, mean)

aggdat %>% print()

##    Treatment        Type   uptake
## 1 nonchilled      Quebec 35.33333
## 2    chilled      Quebec 31.75238
## 3 nonchilled Mississippi 25.95238
## 4    chilled Mississippi 15.81429

aggdat %>% str()

## 'data.frame':    4 obs. of  3 variables:
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 2 1 2
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 2 2
##  $ uptake   : num  35.3 31.8 26 15.8

  ## Aggregate, two vars
  ## Not so flexible in LHS -- can't operate individually on each LHS term --
  ## actually computes sum, not what is usually wanted.
aggdat <- aggregate(uptake + conc ~ Treatment + Type, dat, mean)

aggdat %>% print()

##    Treatment        Type uptake + conc
## 1 nonchilled      Quebec      470.3333
## 2    chilled      Quebec      466.7524
## 3 nonchilled Mississippi      460.9524
## 4    chilled Mississippi      450.8143

aggdat %>% str()

## 'data.frame':    4 obs. of  3 variables:
##  $ Treatment    : Factor w/ 2 levels "nonchilled","chilled": 1 2 1 2
##  $ Type         : Factor w/ 2 levels "Quebec","Mississippi": 1 1 2 2
##  $ uptake + conc: num  470 467 461 451

  ## This is what is wanted, which is still not all that flexible:
aggdat <- aggregate(cbind(uptake, conc) ~ Treatment + Type, dat, mean)

aggdat %>% print()

##    Treatment        Type   uptake conc
## 1 nonchilled      Quebec 35.33333  435
## 2    chilled      Quebec 31.75238  435
## 3 nonchilled Mississippi 25.95238  435
## 4    chilled Mississippi 15.81429  435

aggdat %>% str()

## 'data.frame':    4 obs. of  4 variables:
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 2 1 2
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 2 2
##  $ uptake   : num  35.3 31.8 26 15.8
##  $ conc     : num  435 435 435 435

Observations

  • Has some power, just has odd syntax, and even with that, not all that flexible. I favor plyr because it has a simpler interface.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment