Skip to content

Instantly share code, notes, and snippets.

@mpettis
Last active March 16, 2017 21:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save mpettis/dd97aedee1340197bf142382947d9e5a to your computer and use it in GitHub Desktop.
Save mpettis/dd97aedee1340197bf142382947d9e5a to your computer and use it in GitHub Desktop.
TCRUG talk on different ways to summarize data

Author: Matt Pettis

email: matthew.pettis@gmail.com

github: mpettis

Introduction

It's one thing to get data into R in a form that you can deal with it; it's another thing to do to the data the things you want to do. As R is at heart a statistical language, one of the more common things users want to do with the data they have is to summarize it. And, not only do they want simple means, totals, etc., but they like to have R do it within naturally occuring groups.

For instance, if you have a list of heights of people, along with whether or not they are male or female, you may like to find the average heights for females separately from the average heights for males. This is what is meant by doing 'summaries by groups.' Further, you may want more granular groupings, such as what decade the people were born (1970s, 1980s, etc.) as well as by female/male distinction. So you will want to be able to tell R 'do your summaries by sex and by birth decade.'

This document is intended to walk you through the different ways you can do this in the R system.

Dataset to work with

We will look at the dataset warpbreaks that is included with R. You can run help("warpbreaks") to see what is in the dataset. For instrucional purposes, what we care about is:

  • There is one measurement variable: breaks.
  • There are two categorical variables, called wool (2 levels indicating types) and tension (3 levels: L=Low, M=Medium, H=High).

We will be able to use this dataset to illustrate a variety of techniques for common analysis needs.

I'm also going to add a second, made-up numeric column called qscore just for the sake of having a second numeric variable to play with. You can look at the code, but it is not necessary to understand it for the sake of this tutorial. This variable will be a normal variable about some mean for each combination of wool and tension.

data("warpbreaks")
dat <- warpbreaks

  ## Add a made-up variable called `qscore` to have a second numeric variable.
  ## Each wool/tension combo has qscore as a normal variate of sd = 1
  ## and mean as a random selection between 1 and 100 (for each group)
set.seed(1234)
dat <- dat %>%
  group_by(wool, tension) %>%
  do({ldf <- .; rmean <- sample(1:100, 1); ldf %>% mutate(qscore=rnorm(n(), rmean))}) %>%
  ungroup()

  # Sample of data (head)
head(dat)

## # A tibble: 6 × 4
##   breaks   wool tension   qscore
##    <dbl> <fctr>  <fctr>    <dbl>
## 1     26      A       L 12.31153
## 2     30      A       L 12.31437
## 3     54      A       L 12.35929
## 4     25      A       L 11.26953
## 5     70      A       L 12.03573
## 6     52      A       L 12.11298

What are the total number of observations made?

nrow(dat)

## [1] 54

Reveiw and such

How does 'mean' work?

Recall a few things:

  • A dataframe is just a collection of same-length vectors (the columns)
  • ... all stored in one group (a list)
  • ... with an attribute recording the fact it is a collection of same-length vectors (class = "data.frame")

To repeat, a dataframe is just a list of vectors that are all of the same length.

You can take out the individual vectors and store them in what you consider 'normal' looking vectors, like so:

vBreaks <- dat$breaks
vBreaks

##  [1] 26 30 54 25 70 52 51 26 67 18 21 29 17 12 18 35 30 36 36 21 24 18 10 43 28 15 26 27 14 29 19 29 31 41 20 44 42
## [38] 26 19 16 39 28 21 39 29 20 21 24 17 13 15 15 16 28

vWool <- dat$wool
vWool

##  [1] A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B B B B B B B B B B B B B
## Levels: A B

vTension <- dat$tension
vTension

##  [1] L L L L L L L L L M M M M M M M M M H H H H H H H H H L L L L L L L L L M M M M M M M M M H H H H H H H H H
## Levels: L M H

When you use the function mean(), you take the mean of all of the values stored in a vector:

mean(vBreaks)

## [1] 28.14815

Let's also be clear: mean() computes the mean of a single argument, which is a vector. So be careful that you know the difference of what happens when you do:

mean(c(1, 9))

## [1] 5

mean(1,9)

## [1] 1

In the first, you have an argument of a single vector, whereas in the second, you are providing 2 arguments, and what mean() does is just take the mean of the first argument (1) and ignores the remaining arguments.

Base functions for summarization

tapply

I personally never use this method, as it is more for dealing with data in vectors, and there are better APIs for data stored in data frames. But as you may encounter it, we will illustrate it.

For instance, above, we stored the break vector of data in the vBreaks vector, and the vector that records the wool type for each entry in vBreaks in vWool. Sometimes, the data is just stored in vectors and not dataframes, and you just have to deal.

tapply() is a base function that deal with making summaries of numbers in one vector when the second vector indicates which category each number in the first vector belongs to. It is easier to see with the data:

vBreaks

##  [1] 26 30 54 25 70 52 51 26 67 18 21 29 17 12 18 35 30 36 36 21 24 18 10 43 28 15 26 27 14 29 19 29 31 41 20 44 42
## [38] 26 19 16 39 28 21 39 29 20 21 24 17 13 15 15 16 28

vWool

##  [1] A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B B B B B B B B B B B B B
## Levels: A B

tapply(vBreaks, vWool, mean)

##        A        B 
## 31.03704 25.25926

So, this is the mean of the two different wool groups, and tapply() know which numbers belong to which groups by the entries in the vWool vector.

If you are so pressed, you can actually do mulit-dimensional summaries with tapply(). Because tension is also a factor, you can still use tapply() to find the means of all of the wool/tension level combinations:

tapply(vBreaks, list(vWool, vTension), mean)

##          L        M        H
## A 44.55556 24.00000 24.55556
## B 28.22222 28.77778 18.77778

It is an array that you get as an output (like a matrix). This may be what you want. I rarely want this.

by

by() is another aggregating function, again, one I don't use much. According to it's documentation, by() is a convenience wrapper for tapply() in order to easily apply it to dataframes. I've had some trouble in that department. What it really seems better at is returning general objects per level-combination.

Note that:

  • You have to extract the columns you want as a dataframe (or list of vectors).
  • Your function gets passed a dataframe with the subset of rows related to a particular combination of factor levels dictated by the values in the second argument. Therefore, your function needs to deal with dataframes, and not vectors -- you may need to unpack a vector from the dataframe passed to it.
byObject <- by(dat
   , dat[,c("wool", "tension")]
   , function(df) mean(df$breaks))
byObject

## wool: A
## tension: L
## [1] 44.55556
## --------------------------------------------------------------------------------------- 
## wool: B
## tension: L
## [1] 28.22222
## --------------------------------------------------------------------------------------- 
## wool: A
## tension: M
## [1] 24
## --------------------------------------------------------------------------------------- 
## wool: B
## tension: M
## [1] 28.77778
## --------------------------------------------------------------------------------------- 
## wool: A
## tension: H
## [1] 24.55556
## --------------------------------------------------------------------------------------- 
## wool: B
## tension: H
## [1] 18.77778

Since by() can handle more complex objects than tapply(), we should look at what beast by() actually returns in this case:

str(byObject)

##  by [1:2, 1:3] 44.6 28.2 24 28.8 24.6 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ wool   : chr [1:2] "A" "B"
##   ..$ tension: chr [1:3] "L" "M" "H"
##  - attr(*, "call")= language by.data.frame(data = dat, INDICES = dat[, c("wool", "tension")], FUN = function(df) mean(df$breaks))

You get back an object that by() tools know how to handle, and so it is a little complex under the hood. For instance, how would you extract the value where wool = 'B' and tension = 'M'?

byObject["B", "M"]

## [1] 28.77778

I don't like that -- you have to really know the internals of the byObject, and that it has wool as the first index and tension as the second one.

summary() computes means and medians on a dataframe easily, and to get it to do that on a per wool/tension combination, you can use by():

byObject <- by(dat[,c("breaks", "qscore")]
   , dat[,c("wool", "tension")]
   , summary)
byObject

## wool: A
## tension: L
##      breaks          qscore     
##  Min.   :25.00   Min.   :11.27  
##  1st Qu.:26.00   1st Qu.:12.04  
##  Median :51.00   Median :12.31  
##  Mean   :44.56   Mean   :12.24  
##  3rd Qu.:54.00   3rd Qu.:12.36  
##  Max.   :70.00   Max.   :13.43  
## --------------------------------------------------------------------------------------- 
## wool: B
## tension: L
##      breaks          qscore     
##  Min.   :14.00   Min.   :73.82  
##  1st Qu.:20.00   1st Qu.:74.66  
##  Median :29.00   Median :75.06  
##  Mean   :28.22   Mean   :75.13  
##  3rd Qu.:31.00   3rd Qu.:75.50  
##  Max.   :44.00   Max.   :77.10  
## --------------------------------------------------------------------------------------- 
## wool: A
## tension: M
##      breaks       qscore     
##  Min.   :12   Min.   :23.00  
##  1st Qu.:18   1st Qu.:23.16  
##  Median :21   Median :23.49  
##  Mean   :24   Mean   :23.60  
##  3rd Qu.:30   3rd Qu.:23.89  
##  Max.   :36   Max.   :24.96  
## --------------------------------------------------------------------------------------- 
## wool: B
## tension: M
##      breaks          qscore     
##  Min.   :16.00   Min.   :37.52  
##  1st Qu.:21.00   1st Qu.:38.51  
##  Median :28.00   Median :39.11  
##  Mean   :28.78   Mean   :39.13  
##  3rd Qu.:39.00   3rd Qu.:40.26  
##  Max.   :42.00   Max.   :40.28  
## --------------------------------------------------------------------------------------- 
## wool: A
## tension: H
##      breaks          qscore     
##  Min.   :10.00   Min.   : 99.5  
##  1st Qu.:18.00   1st Qu.:100.0  
##  Median :24.00   Median :100.0  
##  Mean   :24.56   Mean   :100.2  
##  3rd Qu.:28.00   3rd Qu.:100.4  
##  Max.   :43.00   Max.   :100.9  
## --------------------------------------------------------------------------------------- 
## wool: B
## tension: H
##      breaks          qscore     
##  Min.   :13.00   Min.   :50.19  
##  1st Qu.:15.00   1st Qu.:50.99  
##  Median :17.00   Median :51.48  
##  Mean   :18.78   Mean   :51.61  
##  3rd Qu.:21.00   3rd Qu.:51.84  
##  Max.   :28.00   Max.   :53.65

What does that object look like?

str(byObject)

## List of 6
##  $ : 'table' chr [1:6, 1:2] "Min.   :25.00  " "1st Qu.:26.00  " "Median :51.00  " "Mean   :44.56  " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "" "" "" "" ...
##   .. ..$ : chr [1:2] "    breaks" "    qscore"
##  $ : 'table' chr [1:6, 1:2] "Min.   :14.00  " "1st Qu.:20.00  " "Median :29.00  " "Mean   :28.22  " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "" "" "" "" ...
##   .. ..$ : chr [1:2] "    breaks" "    qscore"
##  $ : 'table' chr [1:6, 1:2] "Min.   :12  " "1st Qu.:18  " "Median :21  " "Mean   :24  " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "" "" "" "" ...
##   .. ..$ : chr [1:2] "    breaks" "    qscore"
##  $ : 'table' chr [1:6, 1:2] "Min.   :16.00  " "1st Qu.:21.00  " "Median :28.00  " "Mean   :28.78  " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "" "" "" "" ...
##   .. ..$ : chr [1:2] "    breaks" "    qscore"
##  $ : 'table' chr [1:6, 1:2] "Min.   :10.00  " "1st Qu.:18.00  " "Median :24.00  " "Mean   :24.56  " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "" "" "" "" ...
##   .. ..$ : chr [1:2] "    breaks" "    qscore"
##  $ : 'table' chr [1:6, 1:2] "Min.   :13.00  " "1st Qu.:15.00  " "Median :17.00  " "Mean   :18.78  " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "" "" "" "" ...
##   .. ..$ : chr [1:2] "    breaks" "    qscore"
##  - attr(*, "dim")= int [1:2] 2 3
##  - attr(*, "dimnames")=List of 2
##   ..$ wool   : chr [1:2] "A" "B"
##   ..$ tension: chr [1:3] "L" "M" "H"
##  - attr(*, "call")= language by.data.frame(data = dat[, c("breaks", "qscore")], INDICES = dat[, c("wool", "tension")], FUN = summary)
##  - attr(*, "class")= chr "by"

It's a little more difficult to get at the numerical parts of that output easily, so I don't do things this way. Frankly, I don't ever use by()...

ave

Very much like tapply() but with one twist -- it doesn't return a summary vector, but a vector the same length as the original numeric vector, with the mean repeated at each of the appropriate indicies:

ave(vBreaks, list(vWool, vTension))

##  [1] 44.55556 44.55556 44.55556 44.55556 44.55556 44.55556 44.55556 44.55556 44.55556 24.00000 24.00000 24.00000
## [13] 24.00000 24.00000 24.00000 24.00000 24.00000 24.00000 24.55556 24.55556 24.55556 24.55556 24.55556 24.55556
## [25] 24.55556 24.55556 24.55556 28.22222 28.22222 28.22222 28.22222 28.22222 28.22222 28.22222 28.22222 28.22222
## [37] 28.77778 28.77778 28.77778 28.77778 28.77778 28.77778 28.77778 28.77778 28.77778 18.77778 18.77778 18.77778
## [49] 18.77778 18.77778 18.77778 18.77778 18.77778 18.77778

This is helpful if you want to attach a column of mean-per-level values on the original raw or granular dataset:

datt <- dat
datt$breaks_mean <- ave(vBreaks, list(vWool, vTension))
head(datt, 11)

## # A tibble: 11 × 5
##    breaks   wool tension   qscore breaks_mean
##     <dbl> <fctr>  <fctr>    <dbl>       <dbl>
## 1      26      A       L 12.31153    44.55556
## 2      30      A       L 12.31437    44.55556
## 3      54      A       L 12.35929    44.55556
## 4      25      A       L 11.26953    44.55556
## 5      70      A       L 12.03573    44.55556
## 6      52      A       L 12.11298    44.55556
## 7      51      A       L 13.42855    44.55556
## 8      26      A       L 12.98340    44.55556
## 9      67      A       L 11.37754    44.55556
## 10     18      A       M 23.52281    24.00000
## 11     21      A       M 23.00161    24.00000

aggregate

If your data is in a dataframe, which it usually is, it is often easier to use functions tailored for use in data frames, like aggregate(). Below is an example:

aggregate(formula=breaks ~ wool
          , data = dat
          , FUN=mean)

##   wool   breaks
## 1    A 31.03704
## 2    B 25.25926

Here, we feed aggregate() three parameters.

First is formula, which tell R what the numeric variables are that you want to compute statistics on (here, breaks), and what the classification factors are (here, wool). For aggregate(), the ~ separates the numeric variables, which are on the left side of ~, from the classification factors, which are to the right of the ~. We'll do a more complex example of a formula in a later example.

The second parameter is data, and it names the dataframe in which we find the column names we use in the formula parameter.

The third parameter is FUN, which is the name of a function we apply to the numeric variable. Here we apply mean to the column breaks.

Note that usually the parameter names will not be present, as it assumes those are the order of the parameters. This will give the same thing:

aggregate(breaks ~ wool
          , dat
          , mean)

##   wool   breaks
## 1    A 31.03704
## 2    B 25.25926

What if I want the mean by the levels of wool and tension together?

aggregate(breaks ~ wool + tension
          , dat
          , mean)

##   wool tension   breaks
## 1    A       L 44.55556
## 2    B       L 28.22222
## 3    A       M 24.00000
## 4    B       M 28.77778
## 5    A       H 24.55556
## 6    B       H 18.77778

It is as simple as putting all of the columns you want to use as discriminating factors separate by + symbols on the right side of the ~.

How about if we want the mean of breaks and qscore in the output?

  ## WARNING: THIS IS INCORRECT
aggregate(breaks + qscore ~ wool + tension
          , dat
          , mean)

##   wool tension breaks + qscore
## 1    A       L        56.79921
## 2    B       L       103.35137
## 3    A       M        47.60027
## 4    B       M        67.91183
## 5    A       H       124.75222
## 6    B       H        70.39066

OK, that didn't work as expected. It took the mean of the result of adding breaks to qscore. We can fix that with:

aggregate(cbind(breaks, qscore) ~ wool + tension
          , dat
          , mean)

##   wool tension   breaks    qscore
## 1    A       L 44.55556  12.24366
## 2    B       L 28.22222  75.12915
## 3    A       M 24.00000  23.60027
## 4    B       M 28.77778  39.13406
## 5    A       H 24.55556 100.19666
## 6    B       H 18.77778  51.61288

So, you have to cbind() together the columns you want to apply means to. Ugh.

What if you want to calculate the mean and median of breaks in one output?

aggregate(breaks ~ wool + tension
          , dat
          , function(e) c("xbar"=mean(e), "xm"=median(e)))

##   wool tension breaks.xbar breaks.xm
## 1    A       L    44.55556  51.00000
## 2    B       L    28.22222  29.00000
## 3    A       M    24.00000  21.00000
## 4    B       M    28.77778  28.00000
## 5    A       H    24.55556  24.00000
## 6    B       H    18.77778  17.00000

The output names get appended to the variable name, but you can rename in post-processing if you want to.

Not here that for the first time we used an anonymous function. Instead of the name of an existing function (mean), we created a function on the fly that took a vector (e) and returned a vector with different computations done to that vector. Using anonymous functions is a very powerful tool to help you get the output you want.

You can use this in conjunction with cbind() if you want:

aggregate(cbind(breaks, qscore) ~ wool + tension
          , dat
          , function(e) c("xbar"=mean(e), "xm"=median(e)))

##   wool tension breaks.xbar breaks.xm qscore.xbar qscore.xm
## 1    A       L    44.55556  51.00000    12.24366  12.31153
## 2    B       L    28.22222  29.00000    75.12915  75.06405
## 3    A       M    24.00000  21.00000    23.60027  23.48899
## 4    B       M    28.77778  28.00000    39.13406  39.11120
## 5    A       H    24.55556  24.00000   100.19666 100.01140
## 6    B       H    18.77778  17.00000    51.61288  51.47617

And in this case, appending the output name to the variable name is helpful.

Using a formula for telling aggregate() what to do isn't the only way of doing it. Here, the first argument just has a data frame of numeric columns to operate on, the second argumeent is a list of vectors that represent the combination of factor levels, and the third argument is still the function to use on the numeric data.

aggregate(dat[,c("breaks", "qscore")]
          , list(wool=dat$wool, tension=dat$tension)
          , function(e) c("xbar"=mean(e), "xm"=median(e)))

##   wool tension breaks.xbar breaks.xm qscore.xbar qscore.xm
## 1    A       L    44.55556  51.00000    12.24366  12.31153
## 2    B       L    28.22222  29.00000    75.12915  75.06405
## 3    A       M    24.00000  21.00000    23.60027  23.48899
## 4    B       M    28.77778  28.00000    39.13406  39.11120
## 5    A       H    24.55556  24.00000   100.19666 100.01140
## 6    B       H    18.77778  17.00000    51.61288  51.47617

To be honest, even if I get data in vectors, and I just have base R, instead of using tapply() on the vectors to do summaries, I'll turn the vectors into a dataframe and use aggregate(). It takes a bit more code, and is a little less efficient, but I like the formula interface for aggregate() so much it usually outweights these cons.

You just shove the vectors into a dataframe, and then use the aggregate formula interface. Here, I shove the vBreaks, vWool, and vTension vectors back into a dataframe:

aggregate( breaks ~ wool + tension
          , data.frame(breaks=vBreaks, wool=vWool, tension=vTension)
          , mean)

##   wool tension   breaks
## 1    A       L 44.55556
## 2    B       L 28.22222
## 3    A       M 24.00000
## 4    B       M 28.77778
## 5    A       H 24.55556
## 6    B       H 18.77778

CRAN packages for summarization

The base functions have their peculiarites and difficulties, so people have attempted to fix, or augment, the language with packages that provide a smoother API for doing summary-by-group processing. Below are the ones I've used.

plyr

A Hadley Wickham package that tries to provide a uniform API for group processing of arrays, lists, and dataframes. It is worth exploring the whole set of these functions, which I explain here in a very similar document: https://gist.github.com/mpettis/70dcb33f7328e21ec485fdf8727c97ef .

For now, we will just look at the following.

suppressPackageStartupMessages(library(plyr))

## Warning: package 'plyr' was built under R version 3.3.2

Let's look at making a dataframe of per wool/tension means for our dataset. We'll use the plyr function ddply(). You can read about it at the above link, but the mnemonic for this function is that the first two letters, dd, tell you that it expects a dataframe as input (the first d) and returns a dataframe (the second d). Here's how it works:

ddply(dat, ~ wool + tension, function(df) mean(df$breaks)) 

##   wool tension       V1
## 1    A       L 44.55556
## 2    A       M 24.00000
## 3    A       H 24.55556
## 4    B       L 28.22222
## 5    B       M 28.77778
## 6    B       H 18.77778

What's going on here:

  • The first argument is the dataframe to do computations on.
  • The second argument is a formula, starting with a ~, and the terms to the right of it indicate the factor columns.
  • The third argument is a function that takes a dataframe. ddply() feeds one subsetted dataframe, one for each wool/tension level combination, to that function. The function returns a value in this case, and it is reassembled back into a dataframe with columns coming from the second argument with the right values, and the last column having the return value of the function for that wool/tension combination.

However, I don't like that the last column is named V1 (a default name). But I can easily fix that by recoding the function to return a dataframe:

ddply(dat, ~ wool + tension, function(df) data.frame(breaks_mean=mean(df$breaks))) 

##   wool tension breaks_mean
## 1    A       L    44.55556
## 2    A       M    24.00000
## 3    A       H    24.55556
## 4    B       L    28.22222
## 5    B       M    28.77778
## 6    B       H    18.77778

Note that ddply() does the right thing and merges the returned dataframe to the wool/tension columns. This is a nice behavior.

One nice thing is that, with anonymous functions, you can have multi-step code in that last argument:

ddply(dat
      , ~ wool + tension
      , function(df) {
          vlBreaks <- df$breaks
          breaks_mean <- mean(vlBreaks)
          data.frame(breaks_mean)
        })

##   wool tension breaks_mean
## 1    A       L    44.55556
## 2    A       M    24.00000
## 3    A       H    24.55556
## 4    B       L    28.22222
## 5    B       M    28.77778
## 6    B       H    18.77778

This allows me a lot of flexibility on operating on the groups.

I can return means and medians, like above:

ddply(dat
      , ~ wool + tension
      , function(df) {
          vlBreaks <- df$breaks
          breaks_mean <- mean(vlBreaks)
          breaks_median <- median(vlBreaks)
          data.frame(breaks_mean, breaks_median)
        })

##   wool tension breaks_mean breaks_median
## 1    A       L    44.55556            51
## 2    A       M    24.00000            21
## 3    A       H    24.55556            24
## 4    B       L    28.22222            29
## 5    B       M    28.77778            28
## 6    B       H    18.77778            17

And it does the right thing.

Note that you need not use a formula interface for the second argument. You can wrap your factor column names in .(), like the list interface of aggregate() above:

ddply(dat
      , .(wool, tension)
      , function(df) {
          vlBreaks <- df$breaks
          breaks_mean <- mean(vlBreaks)
          breaks_median <- median(vlBreaks)
          data.frame(breaks_mean, breaks_median)
        })

##   wool tension breaks_mean breaks_median
## 1    A       L    44.55556            51
## 2    A       M    24.00000            21
## 3    A       H    24.55556            24
## 4    B       L    28.22222            29
## 5    B       M    28.77778            28
## 6    B       H    18.77778            17

These functions allow you some flexibility. If you want the anwser back as a matrix, and not a dataframe, you can use daply(), where the second letter a means 'return an array':

daply(dat
      , ~ wool + tension
      , function(df) {
          vlBreaks <- df$breaks
          breaks_mean <- mean(vlBreaks)
          breaks_mean
        })

##     tension
## wool        L        M        H
##    A 44.55556 24.00000 24.55556
##    B 28.22222 28.77778 18.77778

Or a list:

dlply(dat
      , ~ wool + tension
      , function(df) {
          vlBreaks <- df$breaks
          breaks_mean <- mean(vlBreaks)
          breaks_mean
        })

## $A.L
## [1] 44.55556
## 
## $A.M
## [1] 24
## 
## $A.H
## [1] 24.55556
## 
## $B.L
## [1] 28.22222
## 
## $B.M
## [1] 28.77778
## 
## $B.H
## [1] 18.77778
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##   wool tension
## 1    A       L
## 2    A       M
## 3    A       H
## 4    B       L
## 5    B       M
## 6    B       H

Use an output object type that is convenient for your needs.

dplyr

This is Hadely Wickham's sequel to the plyr package. He was unhappy with a few aspects of it, and, as most people like to operate on data frames, he decided to make a package that focused solely on dataframes.

This has some significant conceptual changes that can be used. One is the concept of 'piping', where results of calculations can be sent on to the next function and used implicitly as the first argument to that next function. That'll be shown below. The other is the concept of 'verbs', where the functions that are used are considered 'verbs', or 'transformations' that can be sequentially applied to a dataframe.

Here's and example:

suppressPackageStartupMessages(library(dplyr))

First, let's take the mean of all everything at once for breaks:

summarise(dat, breaks_mean=mean(breaks))

##   breaks_mean
## 1    28.14815

That's about as simple as you get. To get the flavor for piping, we rewrite this with a %>% operator that takes what is on the left side (which may be a calculation result), and sticks it in as the first argument of the function on it's right side:

dat %>% summarise(breaks_mean=mean(breaks))

##   breaks_mean
## 1    28.14815

Compare the syntax, and the answers should be identical.

Now, to do this summarization by groups, we need to tell the dataframe what columns to group by, and then pass it to the summarise() function:

dat %>%
  group_by(wool, tension) %>%
  summarise(breaks_mean=mean(breaks))

##   breaks_mean
## 1    28.14815

Note the piping structure. That is equivalent to the following code, which is written in recognizable nested function calls, but the former is arguably easier to read:

summarise(group_by(dat, wool, tension), breaks_mean=mean(breaks))

##   breaks_mean
## 1    28.14815

In this case, you have to read from the inside out ("group_by is the inner call, then summarise is the outer one..."). Compare that to the previous example, where you can read %>% as the word 'then', and your flow becomes, "Take dat, then group it by wool and tension, then summarise the breaks column with a mean."

Conclusion

I am sure there are more examples, but this document should take you through some of the more common ways of summarizing data you may see. In addtion, it should give you a cookbook of how you may start to make some of your summary code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment