- Introduction
- Dataset to work with
- Reveiw and such
- Base functions for summarization
- CRAN packages for summarization
- Conclusion
Author: Matt Pettis
email: matthew.pettis@gmail.com
github: mpettis
It's one thing to get data into R in a form that you can deal with it; it's another thing to do to the data the things you want to do. As R is at heart a statistical language, one of the more common things users want to do with the data they have is to summarize it. And, not only do they want simple means, totals, etc., but they like to have R do it within naturally occuring groups.
For instance, if you have a list of heights of people, along with whether or not they are male or female, you may like to find the average heights for females separately from the average heights for males. This is what is meant by doing 'summaries by groups.' Further, you may want more granular groupings, such as what decade the people were born (1970s, 1980s, etc.) as well as by female/male distinction. So you will want to be able to tell R 'do your summaries by sex and by birth decade.'
This document is intended to walk you through the different ways you can do this in the R system.
We will look at the dataset warpbreaks
that is included with R. You
can run help("warpbreaks")
to see what is in the dataset. For
instrucional purposes, what we care about is:
- There is one measurement variable:
breaks
. - There are two categorical variables, called
wool
(2 levels indicating types) andtension
(3 levels: L=Low, M=Medium, H=High).
We will be able to use this dataset to illustrate a variety of techniques for common analysis needs.
I'm also going to add a second, made-up numeric column called qscore
just for the sake of having a second numeric variable to play with. You
can look at the code, but it is not necessary to understand it for the
sake of this tutorial. This variable will be a normal variable about
some mean for each combination of wool and tension.
data("warpbreaks")
dat <- warpbreaks
## Add a made-up variable called `qscore` to have a second numeric variable.
## Each wool/tension combo has qscore as a normal variate of sd = 1
## and mean as a random selection between 1 and 100 (for each group)
set.seed(1234)
dat <- dat %>%
group_by(wool, tension) %>%
do({ldf <- .; rmean <- sample(1:100, 1); ldf %>% mutate(qscore=rnorm(n(), rmean))}) %>%
ungroup()
# Sample of data (head)
head(dat)
## # A tibble: 6 × 4
## breaks wool tension qscore
## <dbl> <fctr> <fctr> <dbl>
## 1 26 A L 12.31153
## 2 30 A L 12.31437
## 3 54 A L 12.35929
## 4 25 A L 11.26953
## 5 70 A L 12.03573
## 6 52 A L 12.11298
nrow(dat)
## [1] 54
Recall a few things:
- A dataframe is just a collection of same-length vectors (the columns)
- ... all stored in one group (a list)
- ... with an attribute recording the fact it is a collection of same-length vectors (class = "data.frame")
To repeat, a dataframe is just a list of vectors that are all of the same length.
You can take out the individual vectors and store them in what you consider 'normal' looking vectors, like so:
vBreaks <- dat$breaks
vBreaks
## [1] 26 30 54 25 70 52 51 26 67 18 21 29 17 12 18 35 30 36 36 21 24 18 10 43 28 15 26 27 14 29 19 29 31 41 20 44 42
## [38] 26 19 16 39 28 21 39 29 20 21 24 17 13 15 15 16 28
vWool <- dat$wool
vWool
## [1] A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B B B B B B B B B B B B B
## Levels: A B
vTension <- dat$tension
vTension
## [1] L L L L L L L L L M M M M M M M M M H H H H H H H H H L L L L L L L L L M M M M M M M M M H H H H H H H H H
## Levels: L M H
When you use the function mean()
, you take the mean of all of the
values stored in a vector:
mean(vBreaks)
## [1] 28.14815
Let's also be clear: mean()
computes the mean of a single argument,
which is a vector. So be careful that you know the difference of what
happens when you do:
mean(c(1, 9))
## [1] 5
mean(1,9)
## [1] 1
In the first, you have an argument of a single vector, whereas in the
second, you are providing 2 arguments, and what mean()
does is just
take the mean of the first argument (1
) and ignores the remaining
arguments.
I personally never use this method, as it is more for dealing with data in vectors, and there are better APIs for data stored in data frames. But as you may encounter it, we will illustrate it.
For instance, above, we stored the break vector of data in the vBreaks
vector, and the vector that records the wool type for each entry in
vBreaks
in vWool
. Sometimes, the data is just stored in vectors and
not dataframes, and you just have to deal.
tapply()
is a base function that deal with making summaries of numbers
in one vector when the second vector indicates which category each
number in the first vector belongs to. It is easier to see with the
data:
vBreaks
## [1] 26 30 54 25 70 52 51 26 67 18 21 29 17 12 18 35 30 36 36 21 24 18 10 43 28 15 26 27 14 29 19 29 31 41 20 44 42
## [38] 26 19 16 39 28 21 39 29 20 21 24 17 13 15 15 16 28
vWool
## [1] A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B B B B B B B B B B B B B
## Levels: A B
tapply(vBreaks, vWool, mean)
## A B
## 31.03704 25.25926
So, this is the mean of the two different wool groups, and tapply()
know which numbers belong to which groups by the entries in the vWool
vector.
If you are so pressed, you can actually do mulit-dimensional summaries
with tapply()
. Because tension
is also a factor, you can still use
tapply()
to find the means of all of the wool/tension
level
combinations:
tapply(vBreaks, list(vWool, vTension), mean)
## L M H
## A 44.55556 24.00000 24.55556
## B 28.22222 28.77778 18.77778
It is an array that you get as an output (like a matrix). This may be what you want. I rarely want this.
by()
is another aggregating function, again, one I don't use much.
According to it's documentation, by()
is a convenience wrapper for
tapply()
in order to easily apply it to dataframes. I've had some
trouble in that department. What it really seems better at is returning
general objects per level-combination.
Note that:
- You have to extract the columns you want as a dataframe (or list of vectors).
- Your function gets passed a dataframe with the subset of rows related to a particular combination of factor levels dictated by the values in the second argument. Therefore, your function needs to deal with dataframes, and not vectors -- you may need to unpack a vector from the dataframe passed to it.
byObject <- by(dat
, dat[,c("wool", "tension")]
, function(df) mean(df$breaks))
byObject
## wool: A
## tension: L
## [1] 44.55556
## ---------------------------------------------------------------------------------------
## wool: B
## tension: L
## [1] 28.22222
## ---------------------------------------------------------------------------------------
## wool: A
## tension: M
## [1] 24
## ---------------------------------------------------------------------------------------
## wool: B
## tension: M
## [1] 28.77778
## ---------------------------------------------------------------------------------------
## wool: A
## tension: H
## [1] 24.55556
## ---------------------------------------------------------------------------------------
## wool: B
## tension: H
## [1] 18.77778
Since by()
can handle more complex objects than tapply()
, we should
look at what beast by()
actually returns in this case:
str(byObject)
## by [1:2, 1:3] 44.6 28.2 24 28.8 24.6 ...
## - attr(*, "dimnames")=List of 2
## ..$ wool : chr [1:2] "A" "B"
## ..$ tension: chr [1:3] "L" "M" "H"
## - attr(*, "call")= language by.data.frame(data = dat, INDICES = dat[, c("wool", "tension")], FUN = function(df) mean(df$breaks))
You get back an object that by()
tools know how to handle, and so it
is a little complex under the hood. For instance, how would you extract
the value where wool = 'B' and tension = 'M'?
byObject["B", "M"]
## [1] 28.77778
I don't like that -- you have to really know the internals of the
byObject
, and that it has wool as the first index and tension as the
second one.
summary()
computes means and medians on a dataframe easily, and to get
it to do that on a per wool/tension
combination, you can use by()
:
byObject <- by(dat[,c("breaks", "qscore")]
, dat[,c("wool", "tension")]
, summary)
byObject
## wool: A
## tension: L
## breaks qscore
## Min. :25.00 Min. :11.27
## 1st Qu.:26.00 1st Qu.:12.04
## Median :51.00 Median :12.31
## Mean :44.56 Mean :12.24
## 3rd Qu.:54.00 3rd Qu.:12.36
## Max. :70.00 Max. :13.43
## ---------------------------------------------------------------------------------------
## wool: B
## tension: L
## breaks qscore
## Min. :14.00 Min. :73.82
## 1st Qu.:20.00 1st Qu.:74.66
## Median :29.00 Median :75.06
## Mean :28.22 Mean :75.13
## 3rd Qu.:31.00 3rd Qu.:75.50
## Max. :44.00 Max. :77.10
## ---------------------------------------------------------------------------------------
## wool: A
## tension: M
## breaks qscore
## Min. :12 Min. :23.00
## 1st Qu.:18 1st Qu.:23.16
## Median :21 Median :23.49
## Mean :24 Mean :23.60
## 3rd Qu.:30 3rd Qu.:23.89
## Max. :36 Max. :24.96
## ---------------------------------------------------------------------------------------
## wool: B
## tension: M
## breaks qscore
## Min. :16.00 Min. :37.52
## 1st Qu.:21.00 1st Qu.:38.51
## Median :28.00 Median :39.11
## Mean :28.78 Mean :39.13
## 3rd Qu.:39.00 3rd Qu.:40.26
## Max. :42.00 Max. :40.28
## ---------------------------------------------------------------------------------------
## wool: A
## tension: H
## breaks qscore
## Min. :10.00 Min. : 99.5
## 1st Qu.:18.00 1st Qu.:100.0
## Median :24.00 Median :100.0
## Mean :24.56 Mean :100.2
## 3rd Qu.:28.00 3rd Qu.:100.4
## Max. :43.00 Max. :100.9
## ---------------------------------------------------------------------------------------
## wool: B
## tension: H
## breaks qscore
## Min. :13.00 Min. :50.19
## 1st Qu.:15.00 1st Qu.:50.99
## Median :17.00 Median :51.48
## Mean :18.78 Mean :51.61
## 3rd Qu.:21.00 3rd Qu.:51.84
## Max. :28.00 Max. :53.65
What does that object look like?
str(byObject)
## List of 6
## $ : 'table' chr [1:6, 1:2] "Min. :25.00 " "1st Qu.:26.00 " "Median :51.00 " "Mean :44.56 " ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "" "" "" "" ...
## .. ..$ : chr [1:2] " breaks" " qscore"
## $ : 'table' chr [1:6, 1:2] "Min. :14.00 " "1st Qu.:20.00 " "Median :29.00 " "Mean :28.22 " ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "" "" "" "" ...
## .. ..$ : chr [1:2] " breaks" " qscore"
## $ : 'table' chr [1:6, 1:2] "Min. :12 " "1st Qu.:18 " "Median :21 " "Mean :24 " ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "" "" "" "" ...
## .. ..$ : chr [1:2] " breaks" " qscore"
## $ : 'table' chr [1:6, 1:2] "Min. :16.00 " "1st Qu.:21.00 " "Median :28.00 " "Mean :28.78 " ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "" "" "" "" ...
## .. ..$ : chr [1:2] " breaks" " qscore"
## $ : 'table' chr [1:6, 1:2] "Min. :10.00 " "1st Qu.:18.00 " "Median :24.00 " "Mean :24.56 " ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "" "" "" "" ...
## .. ..$ : chr [1:2] " breaks" " qscore"
## $ : 'table' chr [1:6, 1:2] "Min. :13.00 " "1st Qu.:15.00 " "Median :17.00 " "Mean :18.78 " ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "" "" "" "" ...
## .. ..$ : chr [1:2] " breaks" " qscore"
## - attr(*, "dim")= int [1:2] 2 3
## - attr(*, "dimnames")=List of 2
## ..$ wool : chr [1:2] "A" "B"
## ..$ tension: chr [1:3] "L" "M" "H"
## - attr(*, "call")= language by.data.frame(data = dat[, c("breaks", "qscore")], INDICES = dat[, c("wool", "tension")], FUN = summary)
## - attr(*, "class")= chr "by"
It's a little more difficult to get at the numerical parts of that
output easily, so I don't do things this way. Frankly, I don't ever use
by()
...
Very much like tapply()
but with one twist -- it doesn't return a
summary vector, but a vector the same length as the original numeric
vector, with the mean repeated at each of the appropriate indicies:
ave(vBreaks, list(vWool, vTension))
## [1] 44.55556 44.55556 44.55556 44.55556 44.55556 44.55556 44.55556 44.55556 44.55556 24.00000 24.00000 24.00000
## [13] 24.00000 24.00000 24.00000 24.00000 24.00000 24.00000 24.55556 24.55556 24.55556 24.55556 24.55556 24.55556
## [25] 24.55556 24.55556 24.55556 28.22222 28.22222 28.22222 28.22222 28.22222 28.22222 28.22222 28.22222 28.22222
## [37] 28.77778 28.77778 28.77778 28.77778 28.77778 28.77778 28.77778 28.77778 28.77778 18.77778 18.77778 18.77778
## [49] 18.77778 18.77778 18.77778 18.77778 18.77778 18.77778
This is helpful if you want to attach a column of mean-per-level values on the original raw or granular dataset:
datt <- dat
datt$breaks_mean <- ave(vBreaks, list(vWool, vTension))
head(datt, 11)
## # A tibble: 11 × 5
## breaks wool tension qscore breaks_mean
## <dbl> <fctr> <fctr> <dbl> <dbl>
## 1 26 A L 12.31153 44.55556
## 2 30 A L 12.31437 44.55556
## 3 54 A L 12.35929 44.55556
## 4 25 A L 11.26953 44.55556
## 5 70 A L 12.03573 44.55556
## 6 52 A L 12.11298 44.55556
## 7 51 A L 13.42855 44.55556
## 8 26 A L 12.98340 44.55556
## 9 67 A L 11.37754 44.55556
## 10 18 A M 23.52281 24.00000
## 11 21 A M 23.00161 24.00000
If your data is in a dataframe, which it usually is, it is often easier
to use functions tailored for use in data frames, like aggregate()
.
Below is an example:
aggregate(formula=breaks ~ wool
, data = dat
, FUN=mean)
## wool breaks
## 1 A 31.03704
## 2 B 25.25926
Here, we feed aggregate()
three parameters.
First is formula
, which tell R what the numeric variables are that you
want to compute statistics on (here, breaks
), and what the
classification factors are (here, wool
). For aggregate()
, the ~
separates the numeric variables, which are on the left side of ~
, from
the classification factors, which are to the right of the ~
. We'll do
a more complex example of a formula in a later example.
The second parameter is data
, and it names the dataframe in which we
find the column names we use in the formula
parameter.
The third parameter is FUN
, which is the name of a function we apply
to the numeric variable. Here we apply mean
to the column breaks
.
Note that usually the parameter names will not be present, as it assumes those are the order of the parameters. This will give the same thing:
aggregate(breaks ~ wool
, dat
, mean)
## wool breaks
## 1 A 31.03704
## 2 B 25.25926
What if I want the mean by the levels of wool and tension together?
aggregate(breaks ~ wool + tension
, dat
, mean)
## wool tension breaks
## 1 A L 44.55556
## 2 B L 28.22222
## 3 A M 24.00000
## 4 B M 28.77778
## 5 A H 24.55556
## 6 B H 18.77778
It is as simple as putting all of the columns you want to use as
discriminating factors separate by +
symbols on the right side of the
~
.
How about if we want the mean of breaks
and qscore
in the output?
## WARNING: THIS IS INCORRECT
aggregate(breaks + qscore ~ wool + tension
, dat
, mean)
## wool tension breaks + qscore
## 1 A L 56.79921
## 2 B L 103.35137
## 3 A M 47.60027
## 4 B M 67.91183
## 5 A H 124.75222
## 6 B H 70.39066
OK, that didn't work as expected. It took the mean of the result of
adding breaks
to qscore
. We can fix that with:
aggregate(cbind(breaks, qscore) ~ wool + tension
, dat
, mean)
## wool tension breaks qscore
## 1 A L 44.55556 12.24366
## 2 B L 28.22222 75.12915
## 3 A M 24.00000 23.60027
## 4 B M 28.77778 39.13406
## 5 A H 24.55556 100.19666
## 6 B H 18.77778 51.61288
So, you have to cbind()
together the columns you want to apply means
to. Ugh.
What if you want to calculate the mean and median of breaks
in one
output?
aggregate(breaks ~ wool + tension
, dat
, function(e) c("xbar"=mean(e), "xm"=median(e)))
## wool tension breaks.xbar breaks.xm
## 1 A L 44.55556 51.00000
## 2 B L 28.22222 29.00000
## 3 A M 24.00000 21.00000
## 4 B M 28.77778 28.00000
## 5 A H 24.55556 24.00000
## 6 B H 18.77778 17.00000
The output names get appended to the variable name, but you can rename in post-processing if you want to.
Not here that for the first time we used an anonymous function.
Instead of the name of an existing function (mean
), we created a
function on the fly that took a vector (e
) and returned a vector with
different computations done to that vector. Using anonymous functions is
a very powerful tool to help you get the output you want.
You can use this in conjunction with cbind()
if you want:
aggregate(cbind(breaks, qscore) ~ wool + tension
, dat
, function(e) c("xbar"=mean(e), "xm"=median(e)))
## wool tension breaks.xbar breaks.xm qscore.xbar qscore.xm
## 1 A L 44.55556 51.00000 12.24366 12.31153
## 2 B L 28.22222 29.00000 75.12915 75.06405
## 3 A M 24.00000 21.00000 23.60027 23.48899
## 4 B M 28.77778 28.00000 39.13406 39.11120
## 5 A H 24.55556 24.00000 100.19666 100.01140
## 6 B H 18.77778 17.00000 51.61288 51.47617
And in this case, appending the output name to the variable name is helpful.
Using a formula for telling aggregate()
what to do isn't the only way
of doing it. Here, the first argument just has a data frame of numeric
columns to operate on, the second argumeent is a list of vectors that
represent the combination of factor levels, and the third argument is
still the function to use on the numeric data.
aggregate(dat[,c("breaks", "qscore")]
, list(wool=dat$wool, tension=dat$tension)
, function(e) c("xbar"=mean(e), "xm"=median(e)))
## wool tension breaks.xbar breaks.xm qscore.xbar qscore.xm
## 1 A L 44.55556 51.00000 12.24366 12.31153
## 2 B L 28.22222 29.00000 75.12915 75.06405
## 3 A M 24.00000 21.00000 23.60027 23.48899
## 4 B M 28.77778 28.00000 39.13406 39.11120
## 5 A H 24.55556 24.00000 100.19666 100.01140
## 6 B H 18.77778 17.00000 51.61288 51.47617
To be honest, even if I get data in vectors, and I just have base R
,
instead of using tapply()
on the vectors to do summaries, I'll turn
the vectors into a dataframe and use aggregate()
. It takes a bit more
code, and is a little less efficient, but I like the formula interface
for aggregate()
so much it usually outweights these cons.
You just shove the vectors into a dataframe, and then use the aggregate
formula interface. Here, I shove the vBreaks
, vWool
, and vTension
vectors back into a dataframe:
aggregate( breaks ~ wool + tension
, data.frame(breaks=vBreaks, wool=vWool, tension=vTension)
, mean)
## wool tension breaks
## 1 A L 44.55556
## 2 B L 28.22222
## 3 A M 24.00000
## 4 B M 28.77778
## 5 A H 24.55556
## 6 B H 18.77778
The base functions have their peculiarites and difficulties, so people have attempted to fix, or augment, the language with packages that provide a smoother API for doing summary-by-group processing. Below are the ones I've used.
A Hadley Wickham package that tries to provide a uniform API for group processing of arrays, lists, and dataframes. It is worth exploring the whole set of these functions, which I explain here in a very similar document: https://gist.github.com/mpettis/70dcb33f7328e21ec485fdf8727c97ef .
For now, we will just look at the following.
suppressPackageStartupMessages(library(plyr))
## Warning: package 'plyr' was built under R version 3.3.2
Let's look at making a dataframe of per wool/tension
means for our
dataset. We'll use the plyr function ddply()
. You can read about it at
the above link, but the mnemonic for this function is that the first two
letters, dd
, tell you that it expects a dataframe as input (the first
d
) and returns a dataframe (the second d
). Here's how it works:
ddply(dat, ~ wool + tension, function(df) mean(df$breaks))
## wool tension V1
## 1 A L 44.55556
## 2 A M 24.00000
## 3 A H 24.55556
## 4 B L 28.22222
## 5 B M 28.77778
## 6 B H 18.77778
What's going on here:
- The first argument is the dataframe to do computations on.
- The second argument is a formula, starting with a
~
, and the terms to the right of it indicate the factor columns. - The third argument is a function that takes a dataframe.
ddply()
feeds one subsetted dataframe, one for eachwool/tension
level combination, to that function. The function returns a value in this case, and it is reassembled back into a dataframe with columns coming from the second argument with the right values, and the last column having the return value of the function for thatwool/tension
combination.
However, I don't like that the last column is named V1
(a default
name). But I can easily fix that by recoding the function to return a
dataframe:
ddply(dat, ~ wool + tension, function(df) data.frame(breaks_mean=mean(df$breaks)))
## wool tension breaks_mean
## 1 A L 44.55556
## 2 A M 24.00000
## 3 A H 24.55556
## 4 B L 28.22222
## 5 B M 28.77778
## 6 B H 18.77778
Note that ddply()
does the right thing and merges the returned
dataframe to the wool/tension
columns. This is a nice behavior.
One nice thing is that, with anonymous functions, you can have multi-step code in that last argument:
ddply(dat
, ~ wool + tension
, function(df) {
vlBreaks <- df$breaks
breaks_mean <- mean(vlBreaks)
data.frame(breaks_mean)
})
## wool tension breaks_mean
## 1 A L 44.55556
## 2 A M 24.00000
## 3 A H 24.55556
## 4 B L 28.22222
## 5 B M 28.77778
## 6 B H 18.77778
This allows me a lot of flexibility on operating on the groups.
I can return means and medians, like above:
ddply(dat
, ~ wool + tension
, function(df) {
vlBreaks <- df$breaks
breaks_mean <- mean(vlBreaks)
breaks_median <- median(vlBreaks)
data.frame(breaks_mean, breaks_median)
})
## wool tension breaks_mean breaks_median
## 1 A L 44.55556 51
## 2 A M 24.00000 21
## 3 A H 24.55556 24
## 4 B L 28.22222 29
## 5 B M 28.77778 28
## 6 B H 18.77778 17
And it does the right thing.
Note that you need not use a formula interface for the second argument.
You can wrap your factor column names in .()
, like the list interface
of aggregate()
above:
ddply(dat
, .(wool, tension)
, function(df) {
vlBreaks <- df$breaks
breaks_mean <- mean(vlBreaks)
breaks_median <- median(vlBreaks)
data.frame(breaks_mean, breaks_median)
})
## wool tension breaks_mean breaks_median
## 1 A L 44.55556 51
## 2 A M 24.00000 21
## 3 A H 24.55556 24
## 4 B L 28.22222 29
## 5 B M 28.77778 28
## 6 B H 18.77778 17
These functions allow you some flexibility. If you want the anwser back
as a matrix, and not a dataframe, you can use daply()
, where the
second letter a
means 'return an array':
daply(dat
, ~ wool + tension
, function(df) {
vlBreaks <- df$breaks
breaks_mean <- mean(vlBreaks)
breaks_mean
})
## tension
## wool L M H
## A 44.55556 24.00000 24.55556
## B 28.22222 28.77778 18.77778
Or a list:
dlply(dat
, ~ wool + tension
, function(df) {
vlBreaks <- df$breaks
breaks_mean <- mean(vlBreaks)
breaks_mean
})
## $A.L
## [1] 44.55556
##
## $A.M
## [1] 24
##
## $A.H
## [1] 24.55556
##
## $B.L
## [1] 28.22222
##
## $B.M
## [1] 28.77778
##
## $B.H
## [1] 18.77778
##
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
## wool tension
## 1 A L
## 2 A M
## 3 A H
## 4 B L
## 5 B M
## 6 B H
Use an output object type that is convenient for your needs.
This is Hadely Wickham's sequel to the plyr
package. He was unhappy
with a few aspects of it, and, as most people like to operate on data
frames, he decided to make a package that focused solely on dataframes.
This has some significant conceptual changes that can be used. One is the concept of 'piping', where results of calculations can be sent on to the next function and used implicitly as the first argument to that next function. That'll be shown below. The other is the concept of 'verbs', where the functions that are used are considered 'verbs', or 'transformations' that can be sequentially applied to a dataframe.
Here's and example:
suppressPackageStartupMessages(library(dplyr))
First, let's take the mean of all everything at once for breaks
:
summarise(dat, breaks_mean=mean(breaks))
## breaks_mean
## 1 28.14815
That's about as simple as you get. To get the flavor for piping, we
rewrite this with a %>%
operator that takes what is on the left side
(which may be a calculation result), and sticks it in as the first
argument of the function on it's right side:
dat %>% summarise(breaks_mean=mean(breaks))
## breaks_mean
## 1 28.14815
Compare the syntax, and the answers should be identical.
Now, to do this summarization by groups, we need to tell the dataframe
what columns to group by, and then pass it to the summarise()
function:
dat %>%
group_by(wool, tension) %>%
summarise(breaks_mean=mean(breaks))
## breaks_mean
## 1 28.14815
Note the piping structure. That is equivalent to the following code, which is written in recognizable nested function calls, but the former is arguably easier to read:
summarise(group_by(dat, wool, tension), breaks_mean=mean(breaks))
## breaks_mean
## 1 28.14815
In this case, you have to read from the inside out ("group_by is the
inner call, then summarise is the outer one..."). Compare that to the
previous example, where you can read %>%
as the word 'then', and your
flow becomes, "Take dat, then group it by wool and tension, then
summarise the breaks column with a mean."
I am sure there are more examples, but this document should take you through some of the more common ways of summarizing data you may see. In addtion, it should give you a cookbook of how you may start to make some of your summary code.