This is a quick and dirty set of demonstrative examples for plyr and dplyr.
knitr::opts_chunk$set(echo = TRUE, fig.width=12, fig.height=9, warnings=FALSE)
options(width=116)
suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(dplyr))
First, both plyr
and dplyr
are toolsets that help enact that the
split-apply-combine
strategy for data manipulation. For that
background, see:
- https://www.jstatsoft.org/index.php/jss/article/view/v040i01/v40i01.pdf
- http://vita.had.co.nz/papers/plyr.pdf (same paper as above, I think)
The nub of this is that we usually have a chunk of data that is naturally broken into sub-chunks that we want to apply a given function or manipulation to, each independently on each chunk, and then combine those results back into another aggregated chunk, possibly with a different shape.
Now, R
has native functions to deal with this (the -apply
family of
functions, also the by
function, and some others), but they are a bit
awkard in a lot of cases, and there really isn't a function that works
nicely on data frames, much in the same way the by
statement does for
SAS
.
plyr
was the first popular package that took this
split-apply-combine
idea and made a good set of consistently styled
functions to deal with this generic problem. plyr
has mostly faded
from the public in favor of dplyr
as the toolset to use with the
split-apply-combine
strategy of data manipulation, due partially to
speed and partially to a different, more functional-oriented api of
dplyr
.
Some resources for plyr
:
Part of the uniform api of plyr
is that is uses a two-letter prefix on
the function names to specify what goes in and what comes out. There are
3 single-letter characters, and one non-letter character: a
, d
, and
l
are the letters. a
stands for array
, d
for dataframe
, and
l
for list. The first letter of the function is the data structure
that is input, and the second letter is what is output. The _
character is also used, and ony as the second characters. _
stands for
'no ouput', and states that no data structure will be returned.
For examples:
adply
Will accept an array
and output a dataframe
.
ddply
Will accept a dataframe and return a dataframe.
l_ply
will take a list and return nothing (presumably, the things that
are operated on are done just for side effects, like printing to the
console).
The paper cited above (http://vita.had.co.nz/papers/plyr.pdf) has a table showing all of the possible plyr functions this maps to.
We will give examples below.
Let's start with a simple array:
# Make a vector, and give it names
v <- setNames(1:3, letters[1:3])
v
## a b c
## 1 2 3
str(v)
## Named int [1:3] 1 2 3
## - attr(*, "names")= chr [1:3] "a" "b" "c"
Let's say we want to print out each individual element of the vector:
a_ply(v, 1, print)
## a
## 1
## b
## 2
## c
## 3
Observations:
-
That's ugly. You get the name on one line, value on following line. But it illustrates what the purpose of the function is.
-
The first argument to
aaply
is the array that will be operated on. Now, arrays can be multidimensional, and we'll leverage that in a following example. -
The second argument gives the
margin
of the array to operate on.1
here indicates rows (and 1-d vectors, likev
here, are considered row vectors, so it operates on each entry). -
The last argument is a function that executes once for each value of the array
v
fed to it -- one element at a time.
Let's do something staggeringly simple to the array, like doubling each entry:
## Apply a doubling to each element of the vector
res <- aaply(v, 1, function(e) {2 * e})
res
## a b c
## 2 4 6
str(res)
## Named num [1:3] 2 4 6
## - attr(*, "names")= chr [1:3] "a" "b" "c"
Now, this could have been simply done by vectorization:
2 * v
## a b c
## 2 4 6
but the plyr
way lays the groundwork for doing more complex things on
a per-element basis. Some things to point out on the aaply
method:
-
The last argument is a function. Here it is the actual function definition
function(e) {2 * e}
. I am using a function definition, which is the most flexible and the way I mostly use theplyr
functions because it is the most flexible. -
The function used as the last argument must take one argument, and that will be set in turn to each value from
v
for each invocation of the function. The function results are then aggregated into a single array and put intores
. -
The output is an array, with names coming from element names of
v
We could also collect the results in a list with alply
:
res <- alply(v, 1, function(e) {2 * e})
res
## $`1`
## a
## 2
##
## $`2`
## b
## 4
##
## $`3`
## c
## 6
##
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
## X1
## 1 a
## 2 b
## 3 c
str(res)
## List of 3
## $ 1: Named num 2
## ..- attr(*, "names")= chr "a"
## $ 2: Named num 4
## ..- attr(*, "names")= chr "b"
## $ 3: Named num 6
## ..- attr(*, "names")= chr "c"
## - attr(*, "split_type")= chr "array"
## - attr(*, "split_labels")='data.frame': 3 obs. of 1 variable:
## ..$ X1: Factor w/ 3 levels "a","b","c": 1 2 3
This structure is a little more complex. Note that:
- This returns a list, with the output list keys being numeric -- not the names of the array entries, which is too bad.
- There is a
split_labels
attribute that tracks the input array values. - There is a
split_type
attribute that says that the input was an array.
We can set the names based on the attribute values on the output object explicitly:
names(res) <- attr(res, "split_labels")[[1]]
res
## $a
## a
## 2
##
## $b
## b
## 4
##
## $c
## c
## 6
##
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
## X1
## 1 a
## 2 b
## 3 c
str(res)
## List of 3
## $ a: Named num 2
## ..- attr(*, "names")= chr "a"
## $ b: Named num 4
## ..- attr(*, "names")= chr "b"
## $ c: Named num 6
## ..- attr(*, "names")= chr "c"
## - attr(*, "split_type")= chr "array"
## - attr(*, "split_labels")='data.frame': 3 obs. of 1 variable:
## ..$ X1: Factor w/ 3 levels "a","b","c": 1 2 3
This is still more messy than we like. We can get it into a form that
seems right as follows (using a base R
function):
res <- lapply(res, unname)
res
## $a
## [1] 2
##
## $b
## [1] 4
##
## $c
## [1] 6
str(res)
## List of 3
## $ a: num 2
## $ b: num 4
## $ c: num 6
In fact, you can also cast it back to an array (like the result of
aaply
) like so:
unlist(res)
## a b c
## 2 4 6
This just helps to figure out how some of the functions work and can map between some data forms.
However, in this case, it may just be easier to use the base R
functions, as they seem to conform more to what is desired:
res <- sapply(v, function(e) {2 * e}, simplify=FALSE)
res
## $a
## [1] 2
##
## $b
## [1] 4
##
## $c
## [1] 6
str(res)
## List of 3
## $ a: num 2
## $ b: num 4
## $ c: num 6
You can use adply
as well when what you want output is a dataframe:
res <- adply(v, 1, function(e) {data.frame(out=2 * e, stringsAsFactors=FALSE)})
res
## X1 out
## 1 a 2
## 2 b 4
## 3 c 6
str(res)
## 'data.frame': 3 obs. of 2 variables:
## $ X1 : Factor w/ 3 levels "a","b","c": 1 2 3
## $ out: num 2 4 6
Note that:
-
It returns a data frame, not just with the cols specified in the function, but also an additional column noting the array value that was used when computing.
-
The column with the value that holds the original vector
v
value is returned as a factor, not a string.
You can name that output column specifically, as I do below:
res <- adply(v, 1, .id="orig", function(e) {data.frame(out=2 * e, stringsAsFactors=FALSE)})
res
## orig out
## 1 a 2
## 2 b 4
## 3 c 6
str(res)
## 'data.frame': 3 obs. of 2 variables:
## $ orig: Factor w/ 3 levels "a","b","c": 1 2 3
## $ out : num 2 4 6
Also, this marking of the output column is really nice because you may return data frames with multiple rows per single input value. Like so:
res <- adply(v, 1, .id="orig", function(e) {data.frame(out=1:e, stringsAsFactors=FALSE)})
res
## orig out
## 1 a 1
## 2 b 1
## 3 b 2
## 4 c 1
## 5 c 2
## 6 c 3
str(res)
## 'data.frame': 6 obs. of 2 variables:
## $ orig: Factor w/ 3 levels "a","b","c": 1 2 2 3 3 3
## $ out : int 1 1 2 1 2 3
This demonstrates the power that this allows by customizing your own
function for the last argument. The connection with
split-apply-combine
is that:
- We split the incoming array by row (i.e., each
v
entry in this case). - We apply the function, which makes a dataframe for each value of
v
. - We combine all of the output dataframes into one dataframe, with
an index field that tracks which value of
v
it is associated with.
A note of warning: factors have issues
v <- factor(c("a", "b", "c"))
## It doesn't like operating on raw factors
tryCatch({a_ply(v, 1, print)}, error=function(e) {print("ERROR"); print(e)})
## [1] "ERROR"
## <simpleError in splitter_a(.data, .margins, .expand): Invalid margin>
## will do ok if converted to characters
a_ply(as.character(v), 1, print)
## [1] "a"
## [1] "b"
## [1] "c"
Multi-dimensional arrays can be operated on by rows or columns:
mat <- matrix(1:9, nrow=3)
mat
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## Sum rows of matrix
aaply(mat, 1, sum)
## 1 2 3
## 12 15 18
## Sum columns of matrix
aaply(mat, 2, sum)
## 1 2 3
## 6 15 24
Arrays can have more than 2 dimensions, like matricies do, and the second argument can be used to deal with that appropriately. What is passed, in both of these cases, is a vector to the function in the last argument.
We shouldn't have to do this as exhaustively as we did for the a.ply functions. But there are some differences to note.
Starting simple, we can make a list and print out it's values:
lst <- list(a=1, b=2, c=3)
l_ply(lst, print)
## [1] 1
## [1] 2
## [1] 3
Or return the doubled values as its own list:
res <- llply(lst, function(e) {2 * e})
res
## $a
## [1] 2
##
## $b
## [1] 4
##
## $c
## [1] 6
str(res)
## List of 3
## $ a: num 2
## $ b: num 4
## $ c: num 6
Note:
- It returns a list with the list keys being the input list names and values as the result of the function.
So that behaves well, a bit better than the named a.ply functions did for us.
As an aside, here lapply
(base R
) will behave just as well as
llply
:
res <- lapply(lst, function(e) {2 * e})
res
## $a
## [1] 2
##
## $b
## [1] 4
##
## $c
## [1] 6
str(res)
## List of 3
## $ a: num 2
## $ b: num 4
## $ c: num 6
One thing that can be difficult... llply
(and base R
functions)
iterates over the list values, and returns an object where the key
of the output list is associated with the right function output of
llply
. However, inside the function supplied as the lat llply
argument, you have no way of natively accessing the input list's key
that was used. This is sometimes problematic. However, if you really
want to access it inside, as well as the value, do the following:
res <- alply(names(lst), 1, function(k) {
e <- lst[[k]]
sprintf("list key k is %s, list value for that key is %s", k, e)
})
res
## $`1`
## [1] "list key k is a, list value for that key is 1"
##
## $`2`
## [1] "list key k is b, list value for that key is 2"
##
## $`3`
## [1] "list key k is c, list value for that key is 3"
##
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
## X1
## 1 1
## 2 2
## 3 3
str(res)
## List of 3
## $ 1: chr "list key k is a, list value for that key is 1"
## $ 2: chr "list key k is b, list value for that key is 2"
## $ 3: chr "list key k is c, list value for that key is 3"
## - attr(*, "split_type")= chr "array"
## - attr(*, "split_labels")='data.frame': 3 obs. of 1 variable:
## ..$ X1: Factor w/ 3 levels "1","2","3": 1 2 3
Now, however, you have the problem where the output list keys aren't the
input names you proved via names(lst)
. That sucks. But you can stick
names on at the end.
names(res) <- names(lst)
res
## $a
## [1] "list key k is a, list value for that key is 1"
##
## $b
## [1] "list key k is b, list value for that key is 2"
##
## $c
## [1] "list key k is c, list value for that key is 3"
##
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
## X1
## 1 1
## 2 2
## 3 3
str(res)
## List of 3
## $ a: chr "list key k is a, list value for that key is 1"
## $ b: chr "list key k is b, list value for that key is 2"
## $ c: chr "list key k is c, list value for that key is 3"
## - attr(*, "split_type")= chr "array"
## - attr(*, "split_labels")='data.frame': 3 obs. of 1 variable:
## ..$ X1: Factor w/ 3 levels "1","2","3": 1 2 3
## llply doesn't work either, by the way.
res <- llply(names(lst), function(k) {
e <- lst[[k]]
sprintf("list key k is %s, list value for that key is %s", k, e)
})
res
## [[1]]
## [1] "list key k is a, list value for that key is 1"
##
## [[2]]
## [1] "list key k is b, list value for that key is 2"
##
## [[3]]
## [1] "list key k is c, list value for that key is 3"
str(res)
## List of 3
## $ : chr "list key k is a, list value for that key is 1"
## $ : chr "list key k is b, list value for that key is 2"
## $ : chr "list key k is c, list value for that key is 3"
Again, some complications:
- The result list doesn't get the elements of
names(lst)
as output list keys. I don't know why. Doesn't work if I usellply
either.
My solution is that I use one of the solutions presented here:
http://stackoverflow.com/a/20546621/1022967
Here's the simplest such solution, which uses base R
rather than
plyr
:
res <- sapply(names(lst), function(k) {
e <- lst[[k]]
sprintf("list key k is '%s', list value for that key is '%s'", k, e)
}, simplify=FALSE)
res
## $a
## [1] "list key k is 'a', list value for that key is '1'"
##
## $b
## [1] "list key k is 'b', list value for that key is '2'"
##
## $c
## [1] "list key k is 'c', list value for that key is '3'"
str(res)
## List of 3
## $ a: chr "list key k is 'a', list value for that key is '1'"
## $ b: chr "list key k is 'b', list value for that key is '2'"
## $ c: chr "list key k is 'c', list value for that key is '3'"
And the parameter simplify=FALSE
is necessary, otherwise, in this
case, it will reduce it to a vector from a list. Which may be what you
want, or maybe not...
Let's return a dataframe from the list elements
res <- ldply(lst, .id="original", function(e) {
data.frame(doubled=2 * e)
})
res
## original doubled
## 1 a 2
## 2 b 4
## 3 c 6
str(res)
## 'data.frame': 3 obs. of 2 variables:
## $ original: Factor w/ 3 levels "a","b","c": 1 2 3
## $ doubled : num 2 4 6
I don't usually use this -- I just use the base R
sapply
, as it does
what I want, and there are examples of that above.
So, a note... these were probably the most used functions in plyr
,
more than the a.ply
and l.ply
ones. That's because everybody usually
has their data in dataframes, and that is really where the
split-apply-combine
model really gives you traction. I think it's
popularity is what drove Hadley Wickham to write the dplyr
package,
which is a rewrite of the api but just for d.ply
operations,
basically. So when I want to really operate on dataframes and get them
out, I use dplyr
, but sometimes I want to have either the input or
output not be a dataframe, or I want a custom function that prints out
diagnostic info or performs other side effects, and in that case, plyr
is still really, really handy.
First, let's get a decent dataframe.
## Called it `dat` because I hate typing `CO2`
dat <- CO2
str(dat)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 84 obs. of 5 variables:
## $ Plant : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
## $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
## $ conc : num 95 175 250 350 500 675 1000 95 175 250 ...
## $ uptake : num 16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
## - attr(*, "formula")=Class 'formula' language uptake ~ conc | Plant
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "outer")=Class 'formula' language ~Treatment * Type
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Ambient carbon dioxide concentration"
## ..$ y: chr "CO2 uptake rate"
## - attr(*, "units")=List of 2
## ..$ x: chr "(uL/L)"
## ..$ y: chr "(umol/m^2 s)"
dat %>% tbl_df()
## # A tibble: 84 x 5
## Plant Type Treatment conc uptake
## * <ord> <fctr> <fctr> <dbl> <dbl>
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## 7 Qn1 Quebec nonchilled 1000 39.7
## 8 Qn2 Quebec nonchilled 95 13.6
## 9 Qn2 Quebec nonchilled 175 27.3
## 10 Qn2 Quebec nonchilled 250 37.1
## # ... with 74 more rows
A very common task is to perform an aggregation function on a subset of
rows of a dataframe for different values of a set of identifying
columns. To be concrete, let's say for this dataframe we ultimately want
to find the mean of uptake
for each type of Treatment
, regardless of
Plant
or Type
values. The first thing I want to show you is how we
tell the d.ply
functions to split the input dataframe on different
values of a column or set of columns. We'll do this by printing out the
different 'splits' of the data by Treatment
.
As an aside, I like the piping structure (%>%
) that the magrittr
package provides (which is loaded above via the dplyr
package load),
and will leverage that.
d_ply(dat, ~ Treatment, function(df) {
# Following statement same as:
# ltreat <- unique(as.character(df$Treatment))
ltreat <- df$Treatment %>% as.character() %>% unique()
cat("\n------------------------------")
cat("\nTreatment: ", ltreat)
cat("\n------------------------------\n")
print(tbl_df(df))
cat("\n..............................\n")
cat("Mean uptake:")
cat("\n..............................\n")
df$uptake %>% mean() %>% print()
})
##
## ------------------------------
## Treatment: nonchilled
## ------------------------------
## # A tibble: 42 x 5
## Plant Type Treatment conc uptake
## <ord> <fctr> <fctr> <dbl> <dbl>
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## 7 Qn1 Quebec nonchilled 1000 39.7
## 8 Qn2 Quebec nonchilled 95 13.6
## 9 Qn2 Quebec nonchilled 175 27.3
## 10 Qn2 Quebec nonchilled 250 37.1
## # ... with 32 more rows
##
## ..............................
## Mean uptake:
## ..............................
## [1] 30.64286
##
## ------------------------------
## Treatment: chilled
## ------------------------------
## # A tibble: 42 x 5
## Plant Type Treatment conc uptake
## <ord> <fctr> <fctr> <dbl> <dbl>
## 1 Qc1 Quebec chilled 95 14.2
## 2 Qc1 Quebec chilled 175 24.1
## 3 Qc1 Quebec chilled 250 30.3
## 4 Qc1 Quebec chilled 350 34.6
## 5 Qc1 Quebec chilled 500 32.5
## 6 Qc1 Quebec chilled 675 35.4
## 7 Qc1 Quebec chilled 1000 38.7
## 8 Qc2 Quebec chilled 95 9.3
## 9 Qc2 Quebec chilled 175 27.3
## 10 Qc2 Quebec chilled 250 35.0
## # ... with 32 more rows
##
## ..............................
## Mean uptake:
## ..............................
## [1] 23.78333
Observations:
-
We specify that we want to consider chunks of
dat
as split by the values ofTreatment
. In this case, we specify that by the second argument tod_ply
:~ Treatment
. That is a formula interface, which is a common structure in specifying things like linear models to thelm
function and in other locations. It has many advantages I won't lay out here, but it is useful to start to get to know. The thing to know is that, ford.ply
functions, you specify the columns you want to split as a formula with a leading tilde (~
) and the columns you want to split on follwing that, and the columns are separated by the+
symbol. We'll see more examples of that below. -
We extract the unique value of
Treatment
intoltreat
, and usemagrittr
piping. -
I like to use
cat
to control written output to the console to log what I am doing and critical variable values. -
I am just printing out the subset of
dat
's rows that is provided to the last argument function via thedf
parameter in that function. So,d_ply
is taking care of invoking that last function with thedat
subset of rows once for each unique value ofTreatment
. That is powerful. -
I calculate the mean
uptake
for eachdat
subset as split byTreatment
.
Well, as said, this is nice to see results, but we often want to store
the results of this mean calculation in a nice data structure we can
recall and pull values out of. Most commonly, we'd like a dataframe with
one column marking the Treatment
level considered and the value of
mean uptake actually calculated. That can be done via ddply
:
This, honestly, is probably the most common plyr
function. I, for one,
was used to the SAS
paradigms of all inputs and outputs being a table.
Those in Matlab
may be used to everything being vectors or matricies
(I'm guessing here). But data and rectangular data structures seem to go
hand in glove conceptually. And rectangular data structures are easy to
iterate over.
dat_res <- ddply(dat, ~ Treatment, function(df) {
uptake_mean <- df$uptake %>% mean()
data.frame(uptake_mean, stringsAsFactors=FALSE)
})
dat_res %>% print()
## Treatment uptake_mean
## 1 nonchilled 30.64286
## 2 chilled 23.78333
dat_res %>% str()
## 'data.frame': 2 obs. of 2 variables:
## $ Treatment : Factor w/ 2 levels "nonchilled","chilled": 1 2
## $ uptake_mean: num 30.6 23.8
Observations:
- We return a dataframe from the last function argument with just
uptake_mean
as a column, butddply
takes care to add a column for the value of the splitting columnTreatment
.
What if we want to split on Treatment
and Type
?
dat_res <- ddply(dat, ~ Treatment + Type, function(df) {
uptake_mean <- df$uptake %>% mean()
data.frame(uptake_mean, stringsAsFactors=FALSE)
})
dat_res %>% print()
## Treatment Type uptake_mean
## 1 nonchilled Quebec 35.33333
## 2 nonchilled Mississippi 25.95238
## 3 chilled Quebec 31.75238
## 4 chilled Mississippi 15.81429
dat_res %>% str()
## 'data.frame': 4 obs. of 3 variables:
## $ Treatment : Factor w/ 2 levels "nonchilled","chilled": 1 1 2 2
## $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 2 1 2
## $ uptake_mean: num 35.3 26 31.8 15.8
We've seen the formula interface. There are others, like specifying character columns:
dat_res <- ddply(dat, c("Treatment", "Type"), function(df) {
uptake_mean <- df$uptake %>% mean()
data.frame(uptake_mean, stringsAsFactors=FALSE)
})
dat_res %>% print()
## Treatment Type uptake_mean
## 1 nonchilled Quebec 35.33333
## 2 nonchilled Mississippi 25.95238
## 3 chilled Quebec 31.75238
## 4 chilled Mississippi 15.81429
dat_res %>% str()
## 'data.frame': 4 obs. of 3 variables:
## $ Treatment : Factor w/ 2 levels "nonchilled","chilled": 1 1 2 2
## $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 2 1 2
## $ uptake_mean: num 35.3 26 31.8 15.8
plyr
provides a quoting operator .()
that cleans up that expression
a bit:
dat_res <- ddply(dat, .(Treatment, Type), function(df) {
uptake_mean <- df$uptake %>% mean()
data.frame(uptake_mean, stringsAsFactors=FALSE)
})
dat_res %>% print()
## Treatment Type uptake_mean
## 1 nonchilled Quebec 35.33333
## 2 nonchilled Mississippi 25.95238
## 3 chilled Quebec 31.75238
## 4 chilled Mississippi 15.81429
dat_res %>% str()
## 'data.frame': 4 obs. of 3 variables:
## $ Treatment : Factor w/ 2 levels "nonchilled","chilled": 1 1 2 2
## $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 2 1 2
## $ uptake_mean: num 35.3 26 31.8 15.8
Another use case is when you want to operate on a dataframe a subset of rows at a time, but return an object that is not a dataframe, such as, say, a linear model for that subset of data. Here is how to do that:
lst_res <- dlply(dat, ~ Treatment + Type, function(df) {
lm(uptake ~ conc, df)
})
lst_res %>% print()
## $nonchilled.Quebec
##
## Call:
## lm(formula = uptake ~ conc, data = df)
##
## Coefficients:
## (Intercept) conc
## 25.58503 0.02241
##
##
## $nonchilled.Mississippi
##
## Call:
## lm(formula = uptake ~ conc, data = df)
##
## Coefficients:
## (Intercept) conc
## 18.45329 0.01724
##
##
## $chilled.Quebec
##
## Call:
## lm(formula = uptake ~ conc, data = df)
##
## Coefficients:
## (Intercept) conc
## 21.42104 0.02375
##
##
## $chilled.Mississippi
##
## Call:
## lm(formula = uptake ~ conc, data = df)
##
## Coefficients:
## (Intercept) conc
## 12.541791 0.007523
##
##
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
## Treatment Type
## 1 nonchilled Quebec
## 2 nonchilled Mississippi
## 3 chilled Quebec
## 4 chilled Mississippi
#lst_res %>% str()
Observations:
-
This makes for very easy mass applications of a given modeling structure and collection of results.
-
The resulting list element names can be a bit cumbersome, but recall that the returned list also has the attribute
split_labels
that is a dataframe with the two different splitting column values stored in a dataframe, and that corresponds to the list keys and can be associated in a 1:1 manner with the list values (linear models here) returned. The rownumber of thatsplit_labels
dataframe corresponds to the list index of the object returned for those level values.
Assume you just want a vector of the means returned:
arr_res <- daply(dat, ~ Treatment, function(df) {
df$uptake %>% mean()
})
arr_res %>% print()
## nonchilled chilled
## 30.64286 23.78333
arr_res %>% str()
## Named num [1:2] 30.6 23.8
## - attr(*, "names")= chr [1:2] "nonchilled" "chilled"
If it is more than one column used in splitting, a >1d array is returned:
arr_res <- daply(dat, ~ Treatment + Type, function(df) {
df$uptake %>% mean()
})
arr_res %>% print()
## Type
## Treatment Quebec Mississippi
## nonchilled 35.33333 25.95238
## chilled 31.75238 15.81429
arr_res %>% str()
## num [1:2, 1:2] 35.3 31.8 26 15.8
## - attr(*, "dimnames")=List of 2
## ..$ Treatment: chr [1:2] "nonchilled" "chilled"
## ..$ Type : chr [1:2] "Quebec" "Mississippi"
All of this can be done with base R
functions, and was needed to be
done that way, prior to the plyr
and dplyr
packages. Here are some
options that were (and are) available...
dat <- CO2
str(dat)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 84 obs. of 5 variables:
## $ Plant : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
## $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
## $ conc : num 95 175 250 350 500 675 1000 95 175 250 ...
## $ uptake : num 16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
## - attr(*, "formula")=Class 'formula' language uptake ~ conc | Plant
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "outer")=Class 'formula' language ~Treatment * Type
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Ambient carbon dioxide concentration"
## ..$ y: chr "CO2 uptake rate"
## - attr(*, "units")=List of 2
## ..$ x: chr "(uL/L)"
## ..$ y: chr "(umol/m^2 s)"
dat %>% tbl_df()
## # A tibble: 84 x 5
## Plant Type Treatment conc uptake
## * <ord> <fctr> <fctr> <dbl> <dbl>
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## 7 Qn1 Quebec nonchilled 1000 39.7
## 8 Qn2 Quebec nonchilled 95 13.6
## 9 Qn2 Quebec nonchilled 175 27.3
## 10 Qn2 Quebec nonchilled 250 37.1
## # ... with 74 more rows
## tapply, one grouping col
res <- tapply(dat$uptake, dat$Treatment, mean)
res %>% print()
## nonchilled chilled
## 30.64286 23.78333
res %>% str()
## num [1:2(1d)] 30.6 23.8
## - attr(*, "dimnames")=List of 1
## ..$ : chr [1:2] "nonchilled" "chilled"
## tapply, > 1 grouping col
res <- tapply(dat$uptake, list(dat$Treatment, dat$Type), mean)
res %>% print()
## Quebec Mississippi
## nonchilled 35.33333 25.95238
## chilled 31.75238 15.81429
res %>% str()
## num [1:2, 1:2] 35.3 31.8 26 15.8
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:2] "nonchilled" "chilled"
## ..$ : chr [1:2] "Quebec" "Mississippi"
Observations:
- Least flexible. First arg must be an atomic vector. Argument to last function is then the subset of that first vector.
The base by
function can do a lot of this, but it is awkward. Here is
an example:
## by
bydat <- by(dat, list(dat$Treatment, dat$Type), function(df) mean(df$uptake), simplify=FALSE)
bydat %>% print()
## : nonchilled
## : Quebec
## [1] 35.33333
## ---------------------------------------------------------------------------------------
## : chilled
## : Quebec
## [1] 31.75238
## ---------------------------------------------------------------------------------------
## : nonchilled
## : Mississippi
## [1] 25.95238
## ---------------------------------------------------------------------------------------
## : chilled
## : Mississippi
## [1] 15.81429
bydat %>% str()
## List of 4
## $ : num 35.3
## $ : num 31.8
## $ : num 26
## $ : num 15.8
## - attr(*, "dim")= int [1:2] 2 2
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:2] "nonchilled" "chilled"
## ..$ : chr [1:2] "Quebec" "Mississippi"
## - attr(*, "call")= language by.data.frame(data = dat, INDICES = list(dat$Treatment, dat$Type), FUN = function(df) mean(df$uptake), simplify = FALSE)
## - attr(*, "class")= chr "by"
Observations
by
seems to work a lot likedlply
, but specifying the splitting columns is a bit clunky.
## Aggregate
aggdat <- aggregate(uptake ~ Treatment + Type, dat, mean)
aggdat %>% print()
## Treatment Type uptake
## 1 nonchilled Quebec 35.33333
## 2 chilled Quebec 31.75238
## 3 nonchilled Mississippi 25.95238
## 4 chilled Mississippi 15.81429
aggdat %>% str()
## 'data.frame': 4 obs. of 3 variables:
## $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 2 1 2
## $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 1 2 2
## $ uptake : num 35.3 31.8 26 15.8
## Aggregate, two vars
## Not so flexible in LHS -- can't operate individually on each LHS term --
## actually computes sum, not what is usually wanted.
aggdat <- aggregate(uptake + conc ~ Treatment + Type, dat, mean)
aggdat %>% print()
## Treatment Type uptake + conc
## 1 nonchilled Quebec 470.3333
## 2 chilled Quebec 466.7524
## 3 nonchilled Mississippi 460.9524
## 4 chilled Mississippi 450.8143
aggdat %>% str()
## 'data.frame': 4 obs. of 3 variables:
## $ Treatment : Factor w/ 2 levels "nonchilled","chilled": 1 2 1 2
## $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 1 2 2
## $ uptake + conc: num 470 467 461 451
## This is what is wanted, which is still not all that flexible:
aggdat <- aggregate(cbind(uptake, conc) ~ Treatment + Type, dat, mean)
aggdat %>% print()
## Treatment Type uptake conc
## 1 nonchilled Quebec 35.33333 435
## 2 chilled Quebec 31.75238 435
## 3 nonchilled Mississippi 25.95238 435
## 4 chilled Mississippi 15.81429 435
aggdat %>% str()
## 'data.frame': 4 obs. of 4 variables:
## $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 2 1 2
## $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 1 2 2
## $ uptake : num 35.3 31.8 26 15.8
## $ conc : num 435 435 435 435
Observations
- Has some power, just has odd syntax, and even with that, not all
that flexible. I favor
plyr
because it has a simpler interface.