mpettis/summarize-group-data-talk.md

## summarize-group-data-talk.md

      
    Raw
  

              summarize-group-data-talk.md
            
          
Introduction
Dataset to work with

What are the total number of observations
made?


Reveiw and such

How does 'mean' work?


Base functions for
summarization

tapply
by
ave
aggregate


CRAN packages for summarization

plyr
dplyr


Conclusion

Author: Matt Pettis
email: matthew.pettis@gmail.com
github: mpettis
Introduction

It's one thing to get data into R in a form that you can deal with it;
it's another thing to do to the data the things you want to do. As R is
at heart a statistical language, one of the more common things users
want to do with the data they have is to summarize it. And, not only do
they want simple means, totals, etc., but they like to have R do it
within naturally occuring groups.
For instance, if you have a list of heights of people, along with
whether or not they are male or female, you may like to find the average
heights for females separately from the average heights for males. This
is what is meant by doing 'summaries by groups.' Further, you may want
more granular groupings, such as what decade the people were born
(1970s, 1980s, etc.) as well as by female/male distinction. So you will
want to be able to tell R 'do your summaries by sex and by birth
decade.'
This document is intended to walk you through the different ways you can
do this in the R system.
Dataset to work with

We will look at the dataset warpbreaks that is included with R. You
can run help("warpbreaks") to see what is in the dataset. For
instrucional purposes, what we care about is:

There is one measurement variable: breaks.
There are two categorical variables, called wool (2 levels
indicating types) and tension (3 levels: L=Low, M=Medium, H=High).

We will be able to use this dataset to illustrate a variety of
techniques for common analysis needs.
I'm also going to add a second, made-up numeric column called qscore
just for the sake of having a second numeric variable to play with. You
can look at the code, but it is not necessary to understand it for the
sake of this tutorial. This variable will be a normal variable about
some mean for each combination of wool and tension.
data("warpbreaks")
dat <- warpbreaks

  ## Add a made-up variable called `qscore` to have a second numeric variable.
  ## Each wool/tension combo has qscore as a normal variate of sd = 1
  ## and mean as a random selection between 1 and 100 (for each group)
set.seed(1234)
dat <- dat %>%
  group_by(wool, tension) %>%
  do({ldf <- .; rmean <- sample(1:100, 1); ldf %>% mutate(qscore=rnorm(n(), rmean))}) %>%
  ungroup()

  # Sample of data (head)
head(dat)

## # A tibble: 6 × 4
##   breaks   wool tension   qscore
##    <dbl> <fctr>  <fctr>    <dbl>
## 1     26      A       L 12.31153
## 2     30      A       L 12.31437
## 3     54      A       L 12.35929
## 4     25      A       L 11.26953
## 5     70      A       L 12.03573
## 6     52      A       L 12.11298

What are the total number of observations made?

nrow(dat)

## [1] 54

Reveiw and such

How does 'mean' work?

Recall a few things:

A dataframe is just a collection of same-length vectors
(the columns)
... all stored in one group (a list)
... with an attribute recording the fact it is a collection of
same-length vectors (class = "data.frame")

To repeat, a dataframe is just a list of vectors that are all of the
same length.
You can take out the individual vectors and store them in what you
consider 'normal' looking vectors, like so:
vBreaks <- dat$breaks
vBreaks

##  [1] 26 30 54 25 70 52 51 26 67 18 21 29 17 12 18 35 30 36 36 21 24 18 10 43 28 15 26 27 14 29 19 29 31 41 20 44 42
## [38] 26 19 16 39 28 21 39 29 20 21 24 17 13 15 15 16 28

vWool <- dat$wool
vWool

##  [1] A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B B B B B B B B B B B B B
## Levels: A B

vTension <- dat$tension
vTension

##  [1] L L L L L L L L L M M M M M M M M M H H H H H H H H H L L L L L L L L L M M M M M M M M M H H H H H H H H H
## Levels: L M H

When you use the function mean(), you take the mean of all of the
values stored in a vector:
mean(vBreaks)

## [1] 28.14815

Let's also be clear: mean() computes the mean of a single argument,
which is a vector. So be careful that you know the difference of what
happens when you do:
mean(c(1, 9))

## [1] 5

mean(1,9)

## [1] 1

In the first, you have an argument of a single vector, whereas in the
second, you are providing 2 arguments, and what mean() does is just
take the mean of the first argument (1) and ignores the remaining
arguments.
Base functions for summarization

tapply

I personally never use this method, as it is more for dealing with data
in vectors, and there are better APIs for data stored in data frames.
But as you may encounter it, we will illustrate it.
For instance, above, we stored the break vector of data in the vBreaks
vector, and the vector that records the wool type for each entry in
vBreaks in vWool. Sometimes, the data is just stored in vectors and
not dataframes, and you just have to deal.
tapply() is a base function that deal with making summaries of numbers
in one vector when the second vector indicates which category each
number in the first vector belongs to. It is easier to see with the
data:
vBreaks

##  [1] 26 30 54 25 70 52 51 26 67 18 21 29 17 12 18 35 30 36 36 21 24 18 10 43 28 15 26 27 14 29 19 29 31 41 20 44 42
## [38] 26 19 16 39 28 21 39 29 20 21 24 17 13 15 15 16 28

vWool

##  [1] A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B B B B B B B B B B B B B
## Levels: A B

tapply(vBreaks, vWool, mean)

##        A        B 
## 31.03704 25.25926

So, this is the mean of the two different wool groups, and tapply()
know which numbers belong to which groups by the entries in the vWool
vector.
If you are so pressed, you can actually do mulit-dimensional summaries
with tapply(). Because tension is also a factor, you can still use
tapply() to find the means of all of the wool/tension level
combinations:
tapply(vBreaks, list(vWool, vTension), mean)

##          L        M        H
## A 44.55556 24.00000 24.55556
## B 28.22222 28.77778 18.77778

It is an array that you get as an output (like a matrix). This may be
what you want. I rarely want this.
by

by() is another aggregating function, again, one I don't use much.
According to it's documentation, by() is a convenience wrapper for
tapply() in order to easily apply it to dataframes. I've had some
trouble in that department. What it really seems better at is returning
general objects per level-combination.
Note that:

You have to extract the columns you want as a dataframe (or list
of vectors).
Your function gets passed a dataframe with the subset of rows
related to a particular combination of factor levels dictated by the
values in the second argument. Therefore, your function needs to
deal with dataframes, and not vectors -- you may need to unpack a
vector from the dataframe passed to it.


byObject <- by(dat
   , dat[,c("wool", "tension")]
   , function(df) mean(df$breaks))
byObject

## wool: A
## tension: L
## [1] 44.55556
## --------------------------------------------------------------------------------------- 
## wool: B
## tension: L
## [1] 28.22222
## --------------------------------------------------------------------------------------- 
## wool: A
## tension: M
## [1] 24
## --------------------------------------------------------------------------------------- 
## wool: B
## tension: M
## [1] 28.77778
## --------------------------------------------------------------------------------------- 
## wool: A
## tension: H
## [1] 24.55556
## --------------------------------------------------------------------------------------- 
## wool: B
## tension: H
## [1] 18.77778

Since by() can handle more complex objects than tapply(), we should
look at what beast by() actually returns in this case:
str(byObject)

##  by [1:2, 1:3] 44.6 28.2 24 28.8 24.6 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ wool   : chr [1:2] "A" "B"
##   ..$ tension: chr [1:3] "L" "M" "H"
##  - attr(*, "call")= language by.data.frame(data = dat, INDICES = dat[, c("wool", "tension")], FUN = function(df) mean(df$breaks))

You get back an object that by() tools know how to handle, and so it
is a little complex under the hood. For instance, how would you extract
the value where wool = 'B' and tension = 'M'?
byObject["B", "M"]

## [1] 28.77778

I don't like that -- you have to really know the internals of the
byObject, and that it has wool as the first index and tension as the
second one.
summary() computes means and medians on a dataframe easily, and to get
it to do that on a per wool/tension combination, you can use by():
byObject <- by(dat[,c("breaks", "qscore")]
   , dat[,c("wool", "tension")]
   , summary)
byObject

## wool: A
## tension: L
##      breaks          qscore     
##  Min.   :25.00   Min.   :11.27  
##  1st Qu.:26.00   1st Qu.:12.04  
##  Median :51.00   Median :12.31  
##  Mean   :44.56   Mean   :12.24  
##  3rd Qu.:54.00   3rd Qu.:12.36  
##  Max.   :70.00   Max.   :13.43  
## --------------------------------------------------------------------------------------- 
## wool: B
## tension: L
##      breaks          qscore     
##  Min.   :14.00   Min.   :73.82  
##  1st Qu.:20.00   1st Qu.:74.66  
##  Median :29.00   Median :75.06  
##  Mean   :28.22   Mean   :75.13  
##  3rd Qu.:31.00   3rd Qu.:75.50  
##  Max.   :44.00   Max.   :77.10  
## --------------------------------------------------------------------------------------- 
## wool: A
## tension: M
##      breaks       qscore     
##  Min.   :12   Min.   :23.00  
##  1st Qu.:18   1st Qu.:23.16  
##  Median :21   Median :23.49  
##  Mean   :24   Mean   :23.60  
##  3rd Qu.:30   3rd Qu.:23.89  
##  Max.   :36   Max.   :24.96  
## --------------------------------------------------------------------------------------- 
## wool: B
## tension: M
##      breaks          qscore     
##  Min.   :16.00   Min.   :37.52  
##  1st Qu.:21.00   1st Qu.:38.51  
##  Median :28.00   Median :39.11  
##  Mean   :28.78   Mean   :39.13  
##  3rd Qu.:39.00   3rd Qu.:40.26  
##  Max.   :42.00   Max.   :40.28  
## --------------------------------------------------------------------------------------- 
## wool: A
## tension: H
##      breaks          qscore     
##  Min.   :10.00   Min.   : 99.5  
##  1st Qu.:18.00   1st Qu.:100.0  
##  Median :24.00   Median :100.0  
##  Mean   :24.56   Mean   :100.2  
##  3rd Qu.:28.00   3rd Qu.:100.4  
##  Max.   :43.00   Max.   :100.9  
## --------------------------------------------------------------------------------------- 
## wool: B
## tension: H
##      breaks          qscore     
##  Min.   :13.00   Min.   :50.19  
##  1st Qu.:15.00   1st Qu.:50.99  
##  Median :17.00   Median :51.48  
##  Mean   :18.78   Mean   :51.61  
##  3rd Qu.:21.00   3rd Qu.:51.84  
##  Max.   :28.00   Max.   :53.65

What does that object look like?
str(byObject)

## List of 6
##  $ : 'table' chr [1:6, 1:2] "Min.   :25.00  " "1st Qu.:26.00  " "Median :51.00  " "Mean   :44.56  " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "" "" "" "" ...
##   .. ..$ : chr [1:2] "    breaks" "    qscore"
##  $ : 'table' chr [1:6, 1:2] "Min.   :14.00  " "1st Qu.:20.00  " "Median :29.00  " "Mean   :28.22  " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "" "" "" "" ...
##   .. ..$ : chr [1:2] "    breaks" "    qscore"
##  $ : 'table' chr [1:6, 1:2] "Min.   :12  " "1st Qu.:18  " "Median :21  " "Mean   :24  " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "" "" "" "" ...
##   .. ..$ : chr [1:2] "    breaks" "    qscore"
##  $ : 'table' chr [1:6, 1:2] "Min.   :16.00  " "1st Qu.:21.00  " "Median :28.00  " "Mean   :28.78  " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "" "" "" "" ...
##   .. ..$ : chr [1:2] "    breaks" "    qscore"
##  $ : 'table' chr [1:6, 1:2] "Min.   :10.00  " "1st Qu.:18.00  " "Median :24.00  " "Mean   :24.56  " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "" "" "" "" ...
##   .. ..$ : chr [1:2] "    breaks" "    qscore"
##  $ : 'table' chr [1:6, 1:2] "Min.   :13.00  " "1st Qu.:15.00  " "Median :17.00  " "Mean   :18.78  " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "" "" "" "" ...
##   .. ..$ : chr [1:2] "    breaks" "    qscore"
##  - attr(*, "dim")= int [1:2] 2 3
##  - attr(*, "dimnames")=List of 2
##   ..$ wool   : chr [1:2] "A" "B"
##   ..$ tension: chr [1:3] "L" "M" "H"
##  - attr(*, "call")= language by.data.frame(data = dat[, c("breaks", "qscore")], INDICES = dat[, c("wool", "tension")], FUN = summary)
##  - attr(*, "class")= chr "by"

It's a little more difficult to get at the numerical parts of that
output easily, so I don't do things this way. Frankly, I don't ever use
by()...
ave

Very much like tapply() but with one twist -- it doesn't return a
summary vector, but a vector the same length as the original numeric
vector, with the mean repeated at each of the appropriate indicies:
ave(vBreaks, list(vWool, vTension))

##  [1] 44.55556 44.55556 44.55556 44.55556 44.55556 44.55556 44.55556 44.55556 44.55556 24.00000 24.00000 24.00000
## [13] 24.00000 24.00000 24.00000 24.00000 24.00000 24.00000 24.55556 24.55556 24.55556 24.55556 24.55556 24.55556
## [25] 24.55556 24.55556 24.55556 28.22222 28.22222 28.22222 28.22222 28.22222 28.22222 28.22222 28.22222 28.22222
## [37] 28.77778 28.77778 28.77778 28.77778 28.77778 28.77778 28.77778 28.77778 28.77778 18.77778 18.77778 18.77778
## [49] 18.77778 18.77778 18.77778 18.77778 18.77778 18.77778

This is helpful if you want to attach a column of mean-per-level values
on the original raw or granular dataset:
datt <- dat
datt$breaks_mean <- ave(vBreaks, list(vWool, vTension))
head(datt, 11)

## # A tibble: 11 × 5
##    breaks   wool tension   qscore breaks_mean
##     <dbl> <fctr>  <fctr>    <dbl>       <dbl>
## 1      26      A       L 12.31153    44.55556
## 2      30      A       L 12.31437    44.55556
## 3      54      A       L 12.35929    44.55556
## 4      25      A       L 11.26953    44.55556
## 5      70      A       L 12.03573    44.55556
## 6      52      A       L 12.11298    44.55556
## 7      51      A       L 13.42855    44.55556
## 8      26      A       L 12.98340    44.55556
## 9      67      A       L 11.37754    44.55556
## 10     18      A       M 23.52281    24.00000
## 11     21      A       M 23.00161    24.00000

aggregate

If your data is in a dataframe, which it usually is, it is often easier
to use functions tailored for use in data frames, like aggregate().
Below is an example:
aggregate(formula=breaks ~ wool
          , data = dat
          , FUN=mean)

##   wool   breaks
## 1    A 31.03704
## 2    B 25.25926

Here, we feed aggregate() three parameters.
First is formula, which tell R what the numeric variables are that you
want to compute statistics on (here, breaks), and what the
classification factors are (here, wool). For aggregate(), the ~
separates the numeric variables, which are on the left side of ~, from
the classification factors, which are to the right of the ~. We'll do
a more complex example of a formula in a later example.
The second parameter is data, and it names the dataframe in which we
find the column names we use in the formula parameter.
The third parameter is FUN, which is the name of a function we apply
to the numeric variable. Here we apply mean to the column breaks.
Note that usually the parameter names will not be present, as it assumes
those are the order of the parameters. This will give the same thing:
aggregate(breaks ~ wool
          , dat
          , mean)

##   wool   breaks
## 1    A 31.03704
## 2    B 25.25926

What if I want the mean by the levels of wool and tension together?
aggregate(breaks ~ wool + tension
          , dat
          , mean)

##   wool tension   breaks
## 1    A       L 44.55556
## 2    B       L 28.22222
## 3    A       M 24.00000
## 4    B       M 28.77778
## 5    A       H 24.55556
## 6    B       H 18.77778

It is as simple as putting all of the columns you want to use as
discriminating factors separate by + symbols on the right side of the
~.
How about if we want the mean of breaks and qscore in the output?
  ## WARNING: THIS IS INCORRECT
aggregate(breaks + qscore ~ wool + tension
          , dat
          , mean)

##   wool tension breaks + qscore
## 1    A       L        56.79921
## 2    B       L       103.35137
## 3    A       M        47.60027
## 4    B       M        67.91183
## 5    A       H       124.75222
## 6    B       H        70.39066

OK, that didn't work as expected. It took the mean of the result of
adding breaks to qscore. We can fix that with:
aggregate(cbind(breaks, qscore) ~ wool + tension
          , dat
          , mean)

##   wool tension   breaks    qscore
## 1    A       L 44.55556  12.24366
## 2    B       L 28.22222  75.12915
## 3    A       M 24.00000  23.60027
## 4    B       M 28.77778  39.13406
## 5    A       H 24.55556 100.19666
## 6    B       H 18.77778  51.61288

So, you have to cbind() together the columns you want to apply means
to. Ugh.
What if you want to calculate the mean and median of breaks in one
output?
aggregate(breaks ~ wool + tension
          , dat
          , function(e) c("xbar"=mean(e), "xm"=median(e)))

##   wool tension breaks.xbar breaks.xm
## 1    A       L    44.55556  51.00000
## 2    B       L    28.22222  29.00000
## 3    A       M    24.00000  21.00000
## 4    B       M    28.77778  28.00000
## 5    A       H    24.55556  24.00000
## 6    B       H    18.77778  17.00000

The output names get appended to the variable name, but you can rename
in post-processing if you want to.
Not here that for the first time we used an anonymous function.
Instead of the name of an existing function (mean), we created a
function on the fly that took a vector (e) and returned a vector with
different computations done to that vector. Using anonymous functions is
a very powerful tool to help you get the output you want.
You can use this in conjunction with cbind() if you want:
aggregate(cbind(breaks, qscore) ~ wool + tension
          , dat
          , function(e) c("xbar"=mean(e), "xm"=median(e)))

##   wool tension breaks.xbar breaks.xm qscore.xbar qscore.xm
## 1    A       L    44.55556  51.00000    12.24366  12.31153
## 2    B       L    28.22222  29.00000    75.12915  75.06405
## 3    A       M    24.00000  21.00000    23.60027  23.48899
## 4    B       M    28.77778  28.00000    39.13406  39.11120
## 5    A       H    24.55556  24.00000   100.19666 100.01140
## 6    B       H    18.77778  17.00000    51.61288  51.47617

And in this case, appending the output name to the variable name is
helpful.
Using a formula for telling aggregate() what to do isn't the only way
of doing it. Here, the first argument just has a data frame of numeric
columns to operate on, the second argumeent is a list of vectors that
represent the combination of factor levels, and the third argument is
still the function to use on the numeric data.
aggregate(dat[,c("breaks", "qscore")]
          , list(wool=dat$wool, tension=dat$tension)
          , function(e) c("xbar"=mean(e), "xm"=median(e)))

##   wool tension breaks.xbar breaks.xm qscore.xbar qscore.xm
## 1    A       L    44.55556  51.00000    12.24366  12.31153
## 2    B       L    28.22222  29.00000    75.12915  75.06405
## 3    A       M    24.00000  21.00000    23.60027  23.48899
## 4    B       M    28.77778  28.00000    39.13406  39.11120
## 5    A       H    24.55556  24.00000   100.19666 100.01140
## 6    B       H    18.77778  17.00000    51.61288  51.47617

To be honest, even if I get data in vectors, and I just have base R,
instead of using tapply() on the vectors to do summaries, I'll turn
the vectors into a dataframe and use aggregate(). It takes a bit more
code, and is a little less efficient, but I like the formula interface
for aggregate() so much it usually outweights these cons.
You just shove the vectors into a dataframe, and then use the aggregate
formula interface. Here, I shove the vBreaks, vWool, and vTension
vectors back into a dataframe:
aggregate( breaks ~ wool + tension
          , data.frame(breaks=vBreaks, wool=vWool, tension=vTension)
          , mean)

##   wool tension   breaks
## 1    A       L 44.55556
## 2    B       L 28.22222
## 3    A       M 24.00000
## 4    B       M 28.77778
## 5    A       H 24.55556
## 6    B       H 18.77778

CRAN packages for summarization

The base functions have their peculiarites and difficulties, so people
have attempted to fix, or augment, the language with packages that
provide a smoother API for doing summary-by-group processing. Below are
the ones I've used.
plyr

A Hadley Wickham package that tries to provide a uniform API for group
processing of arrays, lists, and dataframes. It is worth exploring the
whole set of these functions, which I explain here in a very similar
document:
https://gist.github.com/mpettis/70dcb33f7328e21ec485fdf8727c97ef .
For now, we will just look at the following.
suppressPackageStartupMessages(library(plyr))

## Warning: package 'plyr' was built under R version 3.3.2

Let's look at making a dataframe of per wool/tension means for our
dataset. We'll use the plyr function ddply(). You can read about it at
the above link, but the mnemonic for this function is that the first two
letters, dd, tell you that it expects a dataframe as input (the first
d) and returns a dataframe (the second d). Here's how it works:
ddply(dat, ~ wool + tension, function(df) mean(df$breaks)) 

##   wool tension       V1
## 1    A       L 44.55556
## 2    A       M 24.00000
## 3    A       H 24.55556
## 4    B       L 28.22222
## 5    B       M 28.77778
## 6    B       H 18.77778

What's going on here:

The first argument is the dataframe to do computations on.
The second argument is a formula, starting with a ~, and the terms
to the right of it indicate the factor columns.
The third argument is a function that takes a dataframe. ddply()
feeds one subsetted dataframe, one for each wool/tension level
combination, to that function. The function returns a value in this
case, and it is reassembled back into a dataframe with columns
coming from the second argument with the right values, and the last
column having the return value of the function for that
wool/tension combination.

However, I don't like that the last column is named V1 (a default
name). But I can easily fix that by recoding the function to return a
dataframe:
ddply(dat, ~ wool + tension, function(df) data.frame(breaks_mean=mean(df$breaks))) 

##   wool tension breaks_mean
## 1    A       L    44.55556
## 2    A       M    24.00000
## 3    A       H    24.55556
## 4    B       L    28.22222
## 5    B       M    28.77778
## 6    B       H    18.77778

Note that ddply() does the right thing and merges the returned
dataframe to the wool/tension columns. This is a nice behavior.
One nice thing is that, with anonymous functions, you can have
multi-step code in that last argument:
ddply(dat
      , ~ wool + tension
      , function(df) {
          vlBreaks <- df$breaks
          breaks_mean <- mean(vlBreaks)
          data.frame(breaks_mean)
        })

##   wool tension breaks_mean
## 1    A       L    44.55556
## 2    A       M    24.00000
## 3    A       H    24.55556
## 4    B       L    28.22222
## 5    B       M    28.77778
## 6    B       H    18.77778

This allows me a lot of flexibility on operating on the groups.
I can return means and medians, like above:
ddply(dat
      , ~ wool + tension
      , function(df) {
          vlBreaks <- df$breaks
          breaks_mean <- mean(vlBreaks)
          breaks_median <- median(vlBreaks)
          data.frame(breaks_mean, breaks_median)
        })

##   wool tension breaks_mean breaks_median
## 1    A       L    44.55556            51
## 2    A       M    24.00000            21
## 3    A       H    24.55556            24
## 4    B       L    28.22222            29
## 5    B       M    28.77778            28
## 6    B       H    18.77778            17

And it does the right thing.
Note that you need not use a formula interface for the second argument.
You can wrap your factor column names in .(), like the list interface
of aggregate() above:
ddply(dat
      , .(wool, tension)
      , function(df) {
          vlBreaks <- df$breaks
          breaks_mean <- mean(vlBreaks)
          breaks_median <- median(vlBreaks)
          data.frame(breaks_mean, breaks_median)
        })

##   wool tension breaks_mean breaks_median
## 1    A       L    44.55556            51
## 2    A       M    24.00000            21
## 3    A       H    24.55556            24
## 4    B       L    28.22222            29
## 5    B       M    28.77778            28
## 6    B       H    18.77778            17

These functions allow you some flexibility. If you want the anwser back
as a matrix, and not a dataframe, you can use daply(), where the
second letter a means 'return an array':
daply(dat
      , ~ wool + tension
      , function(df) {
          vlBreaks <- df$breaks
          breaks_mean <- mean(vlBreaks)
          breaks_mean
        })

##     tension
## wool        L        M        H
##    A 44.55556 24.00000 24.55556
##    B 28.22222 28.77778 18.77778

Or a list:
dlply(dat
      , ~ wool + tension
      , function(df) {
          vlBreaks <- df$breaks
          breaks_mean <- mean(vlBreaks)
          breaks_mean
        })

## $A.L
## [1] 44.55556
## 
## $A.M
## [1] 24
## 
## $A.H
## [1] 24.55556
## 
## $B.L
## [1] 28.22222
## 
## $B.M
## [1] 28.77778
## 
## $B.H
## [1] 18.77778
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##   wool tension
## 1    A       L
## 2    A       M
## 3    A       H
## 4    B       L
## 5    B       M
## 6    B       H

Use an output object type that is convenient for your needs.
dplyr

This is Hadely Wickham's sequel to the plyr package. He was unhappy
with a few aspects of it, and, as most people like to operate on data
frames, he decided to make a package that focused solely on dataframes.
This has some significant conceptual changes that can be used. One is
the concept of 'piping', where results of calculations can be sent on to
the next function and used implicitly as the first argument to that next
function. That'll be shown below. The other is the concept of 'verbs',
where the functions that are used are considered 'verbs', or
'transformations' that can be sequentially applied to a dataframe.
Here's and example:
suppressPackageStartupMessages(library(dplyr))

First, let's take the mean of all everything at once for breaks:
summarise(dat, breaks_mean=mean(breaks))

##   breaks_mean
## 1    28.14815

That's about as simple as you get. To get the flavor for piping, we
rewrite this with a %>% operator that takes what is on the left side
(which may be a calculation result), and sticks it in as the first
argument of the function on it's right side:
dat %>% summarise(breaks_mean=mean(breaks))

##   breaks_mean
## 1    28.14815

Compare the syntax, and the answers should be identical.
Now, to do this summarization by groups, we need to tell the dataframe
what columns to group by, and then pass it to the summarise()
function:
dat %>%
  group_by(wool, tension) %>%
  summarise(breaks_mean=mean(breaks))

##   breaks_mean
## 1    28.14815

Note the piping structure. That is equivalent to the following code,
which is written in recognizable nested function calls, but the former
is arguably easier to read:
summarise(group_by(dat, wool, tension), breaks_mean=mean(breaks))

##   breaks_mean
## 1    28.14815

In this case, you have to read from the inside out ("group_by is the
inner call, then summarise is the outer one..."). Compare that to the
previous example, where you can read %>% as the word 'then', and your
flow becomes, "Take dat, then group it by wool and tension, then
summarise the breaks column with a mean."
Conclusion

I am sure there are more examples, but this document should take you
through some of the more common ways of summarizing data you may see. In
addtion, it should give you a cookbook of how you may start to make some
of your summary code.