mpettis/plyr-and-dplyr.md

## plyr-and-dplyr.md

      
    Raw
  

              plyr-and-dplyr.md
            
          
Introduction
Setup
plyr

A quick map
a.ply functions

a_ply
aaply
alply
adply
Multi-dimensional arrays


l.ply functions

l_ply
llply
ldply
laply


d.ply functions

d_ply
ddply
dlply
daply


Grouping and Summarizing
Alternatives

Base tapply
Base by function
Base aggregate function


Introduction

This is a quick and dirty set of demonstrative examples for plyr and
dplyr.
Setup

knitr::opts_chunk$set(echo = TRUE, fig.width=12, fig.height=9, warnings=FALSE)
options(width=116)

suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(dplyr))

plyr

First, both plyr and dplyr are toolsets that help enact that the
split-apply-combine strategy for data manipulation. For that
background, see:

https://www.jstatsoft.org/index.php/jss/article/view/v040i01/v40i01.pdf
http://vita.had.co.nz/papers/plyr.pdf (same paper as above,
I think)

The nub of this is that we usually have a chunk of data that is
naturally broken into sub-chunks that we want to apply a given
function or manipulation to, each independently on each chunk, and
then combine those results back into another aggregated chunk,
possibly with a different shape.
Now, R has native functions to deal with this (the -apply family of
functions, also the by function, and some others), but they are a bit
awkard in a lot of cases, and there really isn't a function that works
nicely on data frames, much in the same way the by statement does for
SAS.
plyr was the first popular package that took this
split-apply-combine idea and made a good set of consistently styled
functions to deal with this generic problem. plyr has mostly faded
from the public in favor of dplyr as the toolset to use with the
split-apply-combine strategy of data manipulation, due partially to
speed and partially to a different, more functional-oriented api of
dplyr.
Some resources for plyr:

http://plyr.had.co.nz/
http://plyr.had.co.nz/09-user/

A quick map

Part of the uniform api of plyr is that is uses a two-letter prefix on
the function names to specify what goes in and what comes out. There are
3 single-letter characters, and one non-letter character: a, d, and
l are the letters. a stands for array, d for dataframe, and
l for list. The first letter of the function is the data structure
that is input, and the second letter is what is output. The _
character is also used, and ony as the second characters. _ stands for
'no ouput', and states that no data structure will be returned.
For examples:
adply
Will accept an array and output a dataframe.
ddply
Will accept a dataframe and return a dataframe.
l_ply will take a list and return nothing (presumably, the things that
are operated on are done just for side effects, like printing to the
console).
The paper cited above (http://vita.had.co.nz/papers/plyr.pdf) has a
table showing all of the possible plyr functions this maps to.
We will give examples below.
a.ply functions

Let's start with a simple array:
  # Make a vector, and give it names
v <- setNames(1:3, letters[1:3])
v

## a b c 
## 1 2 3

str(v)

##  Named int [1:3] 1 2 3
##  - attr(*, "names")= chr [1:3] "a" "b" "c"

a_ply

Let's say we want to print out each individual element of the vector:
a_ply(v, 1, print)

## a 
## 1 
## b 
## 2 
## c 
## 3

Observations:


That's ugly. You get the name on one line, value on following line.
But it illustrates what the purpose of the function is.


The first argument to aaply is the array that will be operated on.
Now, arrays can be multidimensional, and we'll leverage that in a
following example.


The second argument gives the margin of the array to operate on.
1 here indicates rows (and 1-d vectors, like v here, are
considered row vectors, so it operates on each entry).


The last argument is a function that executes once for each value of
the array v fed to it -- one element at a time.


aaply

Let's do something staggeringly simple to the array, like doubling each
entry:
  ## Apply a doubling to each element of the vector
res <- aaply(v, 1, function(e) {2 * e})
res

## a b c 
## 2 4 6

str(res)

##  Named num [1:3] 2 4 6
##  - attr(*, "names")= chr [1:3] "a" "b" "c"

Now, this could have been simply done by vectorization:
2 * v

## a b c 
## 2 4 6

but the plyr way lays the groundwork for doing more complex things on
a per-element basis. Some things to point out on the aaply method:


The last argument is a function. Here it is the actual function
definition function(e) {2 * e}. I am using a function definition,
which is the most flexible and the way I mostly use the plyr
functions because it is the most flexible.


The function used as the last argument must take one argument, and
that will be set in turn to each value from v for each invocation
of the function. The function results are then aggregated into a
single array and put into res.


The output is an array, with names coming from element names of v


alply

We could also collect the results in a list with alply:
res <- alply(v, 1, function(e) {2 * e})
res

## $`1`
## a 
## 2 
## 
## $`2`
## b 
## 4 
## 
## $`3`
## c 
## 6 
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  a
## 2  b
## 3  c

str(res)

## List of 3
##  $ 1: Named num 2
##   ..- attr(*, "names")= chr "a"
##  $ 2: Named num 4
##   ..- attr(*, "names")= chr "b"
##  $ 3: Named num 6
##   ..- attr(*, "names")= chr "c"
##  - attr(*, "split_type")= chr "array"
##  - attr(*, "split_labels")='data.frame': 3 obs. of  1 variable:
##   ..$ X1: Factor w/ 3 levels "a","b","c": 1 2 3

This structure is a little more complex. Note that:

This returns a list, with the output list keys being numeric -- not
the names of the array entries, which is too bad.
There is a split_labels attribute that tracks the input
array values.
There is a split_type attribute that says that the input was
an array.

We can set the names based on the attribute values on the output object
explicitly:
names(res) <- attr(res, "split_labels")[[1]]
res

## $a
## a 
## 2 
## 
## $b
## b 
## 4 
## 
## $c
## c 
## 6 
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  a
## 2  b
## 3  c

str(res)

## List of 3
##  $ a: Named num 2
##   ..- attr(*, "names")= chr "a"
##  $ b: Named num 4
##   ..- attr(*, "names")= chr "b"
##  $ c: Named num 6
##   ..- attr(*, "names")= chr "c"
##  - attr(*, "split_type")= chr "array"
##  - attr(*, "split_labels")='data.frame': 3 obs. of  1 variable:
##   ..$ X1: Factor w/ 3 levels "a","b","c": 1 2 3

This is still more messy than we like. We can get it into a form that
seems right as follows (using a base R function):
res <- lapply(res, unname)
res

## $a
## [1] 2
## 
## $b
## [1] 4
## 
## $c
## [1] 6

str(res)

## List of 3
##  $ a: num 2
##  $ b: num 4
##  $ c: num 6

In fact, you can also cast it back to an array (like the result of
aaply) like so:
unlist(res)

## a b c 
## 2 4 6

This just helps to figure out how some of the functions work and can map
between some data forms.
However, in this case, it may just be easier to use the base R
functions, as they seem to conform more to what is desired:
res <- sapply(v, function(e) {2 * e}, simplify=FALSE)
res

## $a
## [1] 2
## 
## $b
## [1] 4
## 
## $c
## [1] 6

str(res)

## List of 3
##  $ a: num 2
##  $ b: num 4
##  $ c: num 6

adply

You can use adply as well when what you want output is a dataframe:
res <- adply(v, 1, function(e) {data.frame(out=2 * e, stringsAsFactors=FALSE)})
res

##   X1 out
## 1  a   2
## 2  b   4
## 3  c   6

str(res)

## 'data.frame':    3 obs. of  2 variables:
##  $ X1 : Factor w/ 3 levels "a","b","c": 1 2 3
##  $ out: num  2 4 6

Note that:


It returns a data frame, not just with the cols specified in the
function, but also an additional column noting the array value that
was used when computing.


The column with the value that holds the original vector v value
is returned as a factor, not a string.


You can name that output column specifically, as I do below:
res <- adply(v, 1, .id="orig", function(e) {data.frame(out=2 * e, stringsAsFactors=FALSE)})
res

##   orig out
## 1    a   2
## 2    b   4
## 3    c   6

str(res)

## 'data.frame':    3 obs. of  2 variables:
##  $ orig: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ out : num  2 4 6

Also, this marking of the output column is really nice because you may
return data frames with multiple rows per single input value. Like so:
res <- adply(v, 1, .id="orig", function(e) {data.frame(out=1:e, stringsAsFactors=FALSE)})
res

##   orig out
## 1    a   1
## 2    b   1
## 3    b   2
## 4    c   1
## 5    c   2
## 6    c   3

str(res)

## 'data.frame':    6 obs. of  2 variables:
##  $ orig: Factor w/ 3 levels "a","b","c": 1 2 2 3 3 3
##  $ out : int  1 1 2 1 2 3

This demonstrates the power that this allows by customizing your own
function for the last argument. The connection with
split-apply-combine is that:

We split the incoming array by row (i.e., each v entry in
this case).
We apply the function, which makes a dataframe for each value of
v.
We combine all of the output dataframes into one dataframe, with
an index field that tracks which value of v it is associated with.

A note of warning: factors have issues
v <- factor(c("a", "b", "c"))

  ## It doesn't like operating on raw factors
tryCatch({a_ply(v, 1, print)}, error=function(e) {print("ERROR"); print(e)})

## [1] "ERROR"
## <simpleError in splitter_a(.data, .margins, .expand): Invalid margin>

  ## will do ok if converted to characters
a_ply(as.character(v), 1, print)

## [1] "a"
## [1] "b"
## [1] "c"

Multi-dimensional arrays

Multi-dimensional arrays can be operated on by rows or columns:
mat <- matrix(1:9, nrow=3)
mat

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

  ## Sum rows of matrix
aaply(mat, 1, sum)

##  1  2  3 
## 12 15 18

  ## Sum columns of matrix
aaply(mat, 2, sum)

##  1  2  3 
##  6 15 24

Arrays can have more than 2 dimensions, like matricies do, and the
second argument can be used to deal with that appropriately. What is
passed, in both of these cases, is a vector to the function in the last
argument.
l.ply functions

We shouldn't have to do this as exhaustively as we did for the a.ply
functions. But there are some differences to note.
l_ply

Starting simple, we can make a list and print out it's values:
lst <- list(a=1, b=2, c=3)
l_ply(lst, print)

## [1] 1
## [1] 2
## [1] 3

llply

Or return the doubled values as its own list:
res <- llply(lst, function(e) {2 * e})
res

## $a
## [1] 2
## 
## $b
## [1] 4
## 
## $c
## [1] 6

str(res)

## List of 3
##  $ a: num 2
##  $ b: num 4
##  $ c: num 6

Note:

It returns a list with the list keys being the input list names and
values as the result of the function.

So that behaves well, a bit better than the named a.ply functions did
for us.
As an aside, here lapply (base R) will behave just as well as
llply:
res <- lapply(lst, function(e) {2 * e})
res

## $a
## [1] 2
## 
## $b
## [1] 4
## 
## $c
## [1] 6

str(res)

## List of 3
##  $ a: num 2
##  $ b: num 4
##  $ c: num 6

One thing that can be difficult... llply (and base R functions)
iterates over the list values, and returns an object where the key
of the output list is associated with the right function output of
llply. However, inside the function supplied as the lat llply
argument, you have no way of natively accessing the input list's key
that was used. This is sometimes problematic. However, if you really
want to access it inside, as well as the value, do the following:
res <- alply(names(lst), 1, function(k) {
  e <- lst[[k]]
  sprintf("list key k is %s, list value for that key is %s", k, e)
})
res

## $`1`
## [1] "list key k is a, list value for that key is 1"
## 
## $`2`
## [1] "list key k is b, list value for that key is 2"
## 
## $`3`
## [1] "list key k is c, list value for that key is 3"
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  1
## 2  2
## 3  3

str(res)

## List of 3
##  $ 1: chr "list key k is a, list value for that key is 1"
##  $ 2: chr "list key k is b, list value for that key is 2"
##  $ 3: chr "list key k is c, list value for that key is 3"
##  - attr(*, "split_type")= chr "array"
##  - attr(*, "split_labels")='data.frame': 3 obs. of  1 variable:
##   ..$ X1: Factor w/ 3 levels "1","2","3": 1 2 3

Now, however, you have the problem where the output list keys aren't the
input names you proved via names(lst). That sucks. But you can stick
names on at the end.
names(res) <- names(lst)
res

## $a
## [1] "list key k is a, list value for that key is 1"
## 
## $b
## [1] "list key k is b, list value for that key is 2"
## 
## $c
## [1] "list key k is c, list value for that key is 3"
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  1
## 2  2
## 3  3

str(res)

## List of 3
##  $ a: chr "list key k is a, list value for that key is 1"
##  $ b: chr "list key k is b, list value for that key is 2"
##  $ c: chr "list key k is c, list value for that key is 3"
##  - attr(*, "split_type")= chr "array"
##  - attr(*, "split_labels")='data.frame': 3 obs. of  1 variable:
##   ..$ X1: Factor w/ 3 levels "1","2","3": 1 2 3

  ## llply doesn't work either, by the way.
res <- llply(names(lst), function(k) {
  e <- lst[[k]]
  sprintf("list key k is %s, list value for that key is %s", k, e)
})
res

## [[1]]
## [1] "list key k is a, list value for that key is 1"
## 
## [[2]]
## [1] "list key k is b, list value for that key is 2"
## 
## [[3]]
## [1] "list key k is c, list value for that key is 3"

str(res)

## List of 3
##  $ : chr "list key k is a, list value for that key is 1"
##  $ : chr "list key k is b, list value for that key is 2"
##  $ : chr "list key k is c, list value for that key is 3"

Again, some complications:

The result list doesn't get the elements of names(lst) as output
list keys. I don't know why. Doesn't work if I use llply either.

My solution is that I use one of the solutions presented here:
http://stackoverflow.com/a/20546621/1022967
Here's the simplest such solution, which uses base R rather than
plyr:
res <- sapply(names(lst), function(k) {
  e <- lst[[k]]
  sprintf("list key k is '%s', list value for that key is '%s'", k, e)
}, simplify=FALSE)
res

## $a
## [1] "list key k is 'a', list value for that key is '1'"
## 
## $b
## [1] "list key k is 'b', list value for that key is '2'"
## 
## $c
## [1] "list key k is 'c', list value for that key is '3'"

str(res)

## List of 3
##  $ a: chr "list key k is 'a', list value for that key is '1'"
##  $ b: chr "list key k is 'b', list value for that key is '2'"
##  $ c: chr "list key k is 'c', list value for that key is '3'"

And the parameter simplify=FALSE is necessary, otherwise, in this
case, it will reduce it to a vector from a list. Which may be what you
want, or maybe not...
ldply

Let's return a dataframe from the list elements
res <- ldply(lst, .id="original", function(e) {
  data.frame(doubled=2 * e)
})
res

##   original doubled
## 1        a       2
## 2        b       4
## 3        c       6

str(res)

## 'data.frame':    3 obs. of  2 variables:
##  $ original: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ doubled : num  2 4 6

laply

I don't usually use this -- I just use the base R sapply, as it does
what I want, and there are examples of that above.
d.ply functions

So, a note... these were probably the most used functions in plyr,
more than the a.ply and l.ply ones. That's because everybody usually
has their data in dataframes, and that is really where the
split-apply-combine model really gives you traction. I think it's
popularity is what drove Hadley Wickham to write the dplyr package,
which is a rewrite of the api but just for d.ply operations,
basically. So when I want to really operate on dataframes and get them
out, I use dplyr, but sometimes I want to have either the input or
output not be a dataframe, or I want a custom function that prints out
diagnostic info or performs other side effects, and in that case, plyr
is still really, really handy.
First, let's get a decent dataframe.
  ## Called it `dat` because I hate typing `CO2`
dat <- CO2
str(dat)

## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   84 obs. of  5 variables:
##  $ Plant    : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
##  $ conc     : num  95 175 250 350 500 675 1000 95 175 250 ...
##  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
##  - attr(*, "formula")=Class 'formula'  language uptake ~ conc | Plant
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Treatment * Type
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Ambient carbon dioxide concentration"
##   ..$ y: chr "CO2 uptake rate"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(uL/L)"
##   ..$ y: chr "(umol/m^2 s)"

dat %>% tbl_df()

## # A tibble: 84 x 5
##    Plant   Type  Treatment  conc uptake
## *  <ord> <fctr>     <fctr> <dbl>  <dbl>
## 1    Qn1 Quebec nonchilled    95   16.0
## 2    Qn1 Quebec nonchilled   175   30.4
## 3    Qn1 Quebec nonchilled   250   34.8
## 4    Qn1 Quebec nonchilled   350   37.2
## 5    Qn1 Quebec nonchilled   500   35.3
## 6    Qn1 Quebec nonchilled   675   39.2
## 7    Qn1 Quebec nonchilled  1000   39.7
## 8    Qn2 Quebec nonchilled    95   13.6
## 9    Qn2 Quebec nonchilled   175   27.3
## 10   Qn2 Quebec nonchilled   250   37.1
## # ... with 74 more rows

d_ply

A very common task is to perform an aggregation function on a subset of
rows of a dataframe for different values of a set of identifying
columns. To be concrete, let's say for this dataframe we ultimately want
to find the mean of uptake for each type of Treatment, regardless of
Plant or Type values. The first thing I want to show you is how we
tell the d.ply functions to split the input dataframe on different
values of a column or set of columns. We'll do this by printing out the
different 'splits' of the data by Treatment.
As an aside, I like the piping structure (%>%) that the magrittr
package provides (which is loaded above via the dplyr package load),
and will leverage that.
d_ply(dat, ~ Treatment, function(df) {
    # Following statement same as:
    # ltreat <- unique(as.character(df$Treatment))
  ltreat <- df$Treatment %>% as.character() %>% unique()

  cat("\n------------------------------")
  cat("\nTreatment: ", ltreat)
  cat("\n------------------------------\n")
  print(tbl_df(df))


  cat("\n..............................\n")
  cat("Mean uptake:")
  cat("\n..............................\n")
  df$uptake %>% mean() %>% print()
})

## 
## ------------------------------
## Treatment:  nonchilled
## ------------------------------
## # A tibble: 42 x 5
##    Plant   Type  Treatment  conc uptake
##    <ord> <fctr>     <fctr> <dbl>  <dbl>
## 1    Qn1 Quebec nonchilled    95   16.0
## 2    Qn1 Quebec nonchilled   175   30.4
## 3    Qn1 Quebec nonchilled   250   34.8
## 4    Qn1 Quebec nonchilled   350   37.2
## 5    Qn1 Quebec nonchilled   500   35.3
## 6    Qn1 Quebec nonchilled   675   39.2
## 7    Qn1 Quebec nonchilled  1000   39.7
## 8    Qn2 Quebec nonchilled    95   13.6
## 9    Qn2 Quebec nonchilled   175   27.3
## 10   Qn2 Quebec nonchilled   250   37.1
## # ... with 32 more rows
## 
## ..............................
## Mean uptake:
## ..............................
## [1] 30.64286
## 
## ------------------------------
## Treatment:  chilled
## ------------------------------
## # A tibble: 42 x 5
##    Plant   Type Treatment  conc uptake
##    <ord> <fctr>    <fctr> <dbl>  <dbl>
## 1    Qc1 Quebec   chilled    95   14.2
## 2    Qc1 Quebec   chilled   175   24.1
## 3    Qc1 Quebec   chilled   250   30.3
## 4    Qc1 Quebec   chilled   350   34.6
## 5    Qc1 Quebec   chilled   500   32.5
## 6    Qc1 Quebec   chilled   675   35.4
## 7    Qc1 Quebec   chilled  1000   38.7
## 8    Qc2 Quebec   chilled    95    9.3
## 9    Qc2 Quebec   chilled   175   27.3
## 10   Qc2 Quebec   chilled   250   35.0
## # ... with 32 more rows
## 
## ..............................
## Mean uptake:
## ..............................
## [1] 23.78333

Observations:


We specify that we want to consider chunks of dat as split by the
values of Treatment. In this case, we specify that by the second
argument to d_ply: ~ Treatment. That is a formula interface,
which is a common structure in specifying things like linear models
to the lm function and in other locations. It has many advantages
I won't lay out here, but it is useful to start to get to know. The
thing to know is that, for d.ply functions, you specify the
columns you want to split as a formula with a leading tilde (~)
and the columns you want to split on follwing that, and the columns
are separated by the + symbol. We'll see more examples of
that below.


We extract the unique value of Treatment into ltreat, and use
magrittr piping.


I like to use cat to control written output to the console to log
what I am doing and critical variable values.


I am just printing out the subset of dat's rows that is provided
to the last argument function via the df parameter in
that function. So, d_ply is taking care of invoking that last
function with the dat subset of rows once for each unique value of
Treatment. That is powerful.


I calculate the mean uptake for each dat subset as split by
Treatment.


ddply

Well, as said, this is nice to see results, but we often want to store
the results of this mean calculation in a nice data structure we can
recall and pull values out of. Most commonly, we'd like a dataframe with
one column marking the Treatment level considered and the value of
mean uptake actually calculated. That can be done via ddply:
This, honestly, is probably the most common plyr function. I, for one,
was used to the SAS paradigms of all inputs and outputs being a table.
Those in Matlab may be used to everything being vectors or matricies
(I'm guessing here). But data and rectangular data structures seem to go
hand in glove conceptually. And rectangular data structures are easy to
iterate over.
dat_res <- ddply(dat, ~ Treatment, function(df) {
  uptake_mean <- df$uptake %>% mean()
  data.frame(uptake_mean, stringsAsFactors=FALSE)
})

dat_res %>% print()

##    Treatment uptake_mean
## 1 nonchilled    30.64286
## 2    chilled    23.78333

dat_res %>% str()

## 'data.frame':    2 obs. of  2 variables:
##  $ Treatment  : Factor w/ 2 levels "nonchilled","chilled": 1 2
##  $ uptake_mean: num  30.6 23.8

Observations:

We return a dataframe from the last function argument with just
uptake_mean as a column, but ddply takes care to add a column
for the value of the splitting column Treatment.

Splitting on multiple columns

What if we want to split on Treatment and Type?
dat_res <- ddply(dat, ~ Treatment + Type, function(df) {
  uptake_mean <- df$uptake %>% mean()
  data.frame(uptake_mean, stringsAsFactors=FALSE)
})

dat_res %>% print()

##    Treatment        Type uptake_mean
## 1 nonchilled      Quebec    35.33333
## 2 nonchilled Mississippi    25.95238
## 3    chilled      Quebec    31.75238
## 4    chilled Mississippi    15.81429

dat_res %>% str()

## 'data.frame':    4 obs. of  3 variables:
##  $ Treatment  : Factor w/ 2 levels "nonchilled","chilled": 1 1 2 2
##  $ Type       : Factor w/ 2 levels "Quebec","Mississippi": 1 2 1 2
##  $ uptake_mean: num  35.3 26 31.8 15.8

Different ways of specifying the columns to use as levels to split analysis on

We've seen the formula interface. There are others, like specifying
character columns:
dat_res <- ddply(dat, c("Treatment", "Type"), function(df) {
  uptake_mean <- df$uptake %>% mean()
  data.frame(uptake_mean, stringsAsFactors=FALSE)
})

dat_res %>% print()

##    Treatment        Type uptake_mean
## 1 nonchilled      Quebec    35.33333
## 2 nonchilled Mississippi    25.95238
## 3    chilled      Quebec    31.75238
## 4    chilled Mississippi    15.81429

dat_res %>% str()

## 'data.frame':    4 obs. of  3 variables:
##  $ Treatment  : Factor w/ 2 levels "nonchilled","chilled": 1 1 2 2
##  $ Type       : Factor w/ 2 levels "Quebec","Mississippi": 1 2 1 2
##  $ uptake_mean: num  35.3 26 31.8 15.8

plyr provides a quoting operator .() that cleans up that expression
a bit:
dat_res <- ddply(dat, .(Treatment, Type), function(df) {
  uptake_mean <- df$uptake %>% mean()
  data.frame(uptake_mean, stringsAsFactors=FALSE)
})

dat_res %>% print()

##    Treatment        Type uptake_mean
## 1 nonchilled      Quebec    35.33333
## 2 nonchilled Mississippi    25.95238
## 3    chilled      Quebec    31.75238
## 4    chilled Mississippi    15.81429

dat_res %>% str()

## 'data.frame':    4 obs. of  3 variables:
##  $ Treatment  : Factor w/ 2 levels "nonchilled","chilled": 1 1 2 2
##  $ Type       : Factor w/ 2 levels "Quebec","Mississippi": 1 2 1 2
##  $ uptake_mean: num  35.3 26 31.8 15.8

dlply

Another use case is when you want to operate on a dataframe a subset of
rows at a time, but return an object that is not a dataframe, such as,
say, a linear model for that subset of data. Here is how to do that:
lst_res <- dlply(dat, ~ Treatment + Type, function(df) {
  lm(uptake ~ conc, df)
})

lst_res %>% print()

## $nonchilled.Quebec
## 
## Call:
## lm(formula = uptake ~ conc, data = df)
## 
## Coefficients:
## (Intercept)         conc  
##    25.58503      0.02241  
## 
## 
## $nonchilled.Mississippi
## 
## Call:
## lm(formula = uptake ~ conc, data = df)
## 
## Coefficients:
## (Intercept)         conc  
##    18.45329      0.01724  
## 
## 
## $chilled.Quebec
## 
## Call:
## lm(formula = uptake ~ conc, data = df)
## 
## Coefficients:
## (Intercept)         conc  
##    21.42104      0.02375  
## 
## 
## $chilled.Mississippi
## 
## Call:
## lm(formula = uptake ~ conc, data = df)
## 
## Coefficients:
## (Intercept)         conc  
##   12.541791     0.007523  
## 
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##    Treatment        Type
## 1 nonchilled      Quebec
## 2 nonchilled Mississippi
## 3    chilled      Quebec
## 4    chilled Mississippi

#lst_res %>% str()

Observations:


This makes for very easy mass applications of a given modeling
structure and collection of results.


The resulting list element names can be a bit cumbersome, but recall
that the returned list also has the attribute split_labels that is
a dataframe with the two different splitting column values stored in
a dataframe, and that corresponds to the list keys and can be
associated in a 1:1 manner with the list values (linear models here)
returned. The rownumber of that split_labels dataframe corresponds
to the list index of the object returned for those level values.


daply

Assume you just want a vector of the means returned:
arr_res <- daply(dat, ~ Treatment, function(df) {
  df$uptake %>% mean()
})

arr_res %>% print()

## nonchilled    chilled 
##   30.64286   23.78333

arr_res %>% str()

##  Named num [1:2] 30.6 23.8
##  - attr(*, "names")= chr [1:2] "nonchilled" "chilled"

If it is more than one column used in splitting, a >1d array is
returned:
arr_res <- daply(dat, ~ Treatment + Type, function(df) {
  df$uptake %>% mean()
})

arr_res %>% print()

##             Type
## Treatment      Quebec Mississippi
##   nonchilled 35.33333    25.95238
##   chilled    31.75238    15.81429

arr_res %>% str()

##  num [1:2, 1:2] 35.3 31.8 26 15.8
##  - attr(*, "dimnames")=List of 2
##   ..$ Treatment: chr [1:2] "nonchilled" "chilled"
##   ..$ Type     : chr [1:2] "Quebec" "Mississippi"

Grouping and Summarizing Alternatives

All of this can be done with base R functions, and was needed to be
done that way, prior to the plyr and dplyr packages. Here are some
options that were (and are) available...
Base tapply

dat <- CO2
str(dat)

## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   84 obs. of  5 variables:
##  $ Plant    : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
##  $ conc     : num  95 175 250 350 500 675 1000 95 175 250 ...
##  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
##  - attr(*, "formula")=Class 'formula'  language uptake ~ conc | Plant
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Treatment * Type
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Ambient carbon dioxide concentration"
##   ..$ y: chr "CO2 uptake rate"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(uL/L)"
##   ..$ y: chr "(umol/m^2 s)"

dat %>% tbl_df()

## # A tibble: 84 x 5
##    Plant   Type  Treatment  conc uptake
## *  <ord> <fctr>     <fctr> <dbl>  <dbl>
## 1    Qn1 Quebec nonchilled    95   16.0
## 2    Qn1 Quebec nonchilled   175   30.4
## 3    Qn1 Quebec nonchilled   250   34.8
## 4    Qn1 Quebec nonchilled   350   37.2
## 5    Qn1 Quebec nonchilled   500   35.3
## 6    Qn1 Quebec nonchilled   675   39.2
## 7    Qn1 Quebec nonchilled  1000   39.7
## 8    Qn2 Quebec nonchilled    95   13.6
## 9    Qn2 Quebec nonchilled   175   27.3
## 10   Qn2 Quebec nonchilled   250   37.1
## # ... with 74 more rows

  ## tapply, one grouping col
res <- tapply(dat$uptake, dat$Treatment, mean)
res %>% print()

## nonchilled    chilled 
##   30.64286   23.78333

res %>% str()

##  num [1:2(1d)] 30.6 23.8
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:2] "nonchilled" "chilled"

  ## tapply, > 1 grouping col
res <- tapply(dat$uptake, list(dat$Treatment, dat$Type), mean)
res %>% print()

##              Quebec Mississippi
## nonchilled 35.33333    25.95238
## chilled    31.75238    15.81429

res %>% str()

##  num [1:2, 1:2] 35.3 31.8 26 15.8
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2] "nonchilled" "chilled"
##   ..$ : chr [1:2] "Quebec" "Mississippi"

Observations:

Least flexible. First arg must be an atomic vector. Argument to last
function is then the subset of that first vector.

Base by function

The base by function can do a lot of this, but it is awkward. Here is
an example:
  ## by
bydat <- by(dat, list(dat$Treatment, dat$Type), function(df) mean(df$uptake), simplify=FALSE)

bydat %>% print()

## : nonchilled
## : Quebec
## [1] 35.33333
## --------------------------------------------------------------------------------------- 
## : chilled
## : Quebec
## [1] 31.75238
## --------------------------------------------------------------------------------------- 
## : nonchilled
## : Mississippi
## [1] 25.95238
## --------------------------------------------------------------------------------------- 
## : chilled
## : Mississippi
## [1] 15.81429

bydat %>% str()

## List of 4
##  $ : num 35.3
##  $ : num 31.8
##  $ : num 26
##  $ : num 15.8
##  - attr(*, "dim")= int [1:2] 2 2
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2] "nonchilled" "chilled"
##   ..$ : chr [1:2] "Quebec" "Mississippi"
##  - attr(*, "call")= language by.data.frame(data = dat, INDICES = list(dat$Treatment, dat$Type), FUN = function(df) mean(df$uptake), simplify = FALSE)
##  - attr(*, "class")= chr "by"

Observations

by seems to work a lot like dlply, but specifying the splitting
columns is a bit clunky.

Base aggregate function

  ## Aggregate
aggdat <- aggregate(uptake ~ Treatment + Type, dat, mean)

aggdat %>% print()

##    Treatment        Type   uptake
## 1 nonchilled      Quebec 35.33333
## 2    chilled      Quebec 31.75238
## 3 nonchilled Mississippi 25.95238
## 4    chilled Mississippi 15.81429

aggdat %>% str()

## 'data.frame':    4 obs. of  3 variables:
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 2 1 2
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 2 2
##  $ uptake   : num  35.3 31.8 26 15.8

  ## Aggregate, two vars
  ## Not so flexible in LHS -- can't operate individually on each LHS term --
  ## actually computes sum, not what is usually wanted.
aggdat <- aggregate(uptake + conc ~ Treatment + Type, dat, mean)

aggdat %>% print()

##    Treatment        Type uptake + conc
## 1 nonchilled      Quebec      470.3333
## 2    chilled      Quebec      466.7524
## 3 nonchilled Mississippi      460.9524
## 4    chilled Mississippi      450.8143

aggdat %>% str()

## 'data.frame':    4 obs. of  3 variables:
##  $ Treatment    : Factor w/ 2 levels "nonchilled","chilled": 1 2 1 2
##  $ Type         : Factor w/ 2 levels "Quebec","Mississippi": 1 1 2 2
##  $ uptake + conc: num  470 467 461 451

  ## This is what is wanted, which is still not all that flexible:
aggdat <- aggregate(cbind(uptake, conc) ~ Treatment + Type, dat, mean)

aggdat %>% print()

##    Treatment        Type   uptake conc
## 1 nonchilled      Quebec 35.33333  435
## 2    chilled      Quebec 31.75238  435
## 3 nonchilled Mississippi 25.95238  435
## 4    chilled Mississippi 15.81429  435

aggdat %>% str()

## 'data.frame':    4 obs. of  4 variables:
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 2 1 2
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 2 2
##  $ uptake   : num  35.3 31.8 26 15.8
##  $ conc     : num  435 435 435 435

Observations

Has some power, just has odd syntax, and even with that, not all
that flexible. I favor plyr because it has a simpler interface.