The dplyr
package is a great toolset to do data manipulation and
exploration. It presents a well-crafted api that allows you to flow data
through a chain of computations that can reduce code complexity (and
cognitive overhead). It also reduces typing because it can use
non-standard evaluation (NSE), which means, in practical terms, that you
can refer to things like column names in a dataframe without having to
surround it in quotes, and make computations on those columns just as
you would without using dplyr
.
NSE is great for interactive work. It is especially suited to times when
you know exactly which columns of a data frame you want to compute on.
For instance, let's look at the CO2
dataset, which comes with R
:
CO2 %>% tbl_df()
## Source: local data frame [84 x 5]
##
## Plant Type Treatment conc uptake
## <fctr> <fctr> <fctr> <dbl> <dbl>
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## 7 Qn1 Quebec nonchilled 1000 39.7
## 8 Qn2 Quebec nonchilled 95 13.6
## 9 Qn2 Quebec nonchilled 175 27.3
## 10 Qn2 Quebec nonchilled 250 37.1
## .. ... ... ... ... ...
If you want to compute the mean of the conc
field with dplyr
interactively, that is very straightforward:
CO2 %>%
summarise(metric_mean = mean(conc))
## metric_mean
## 1 435
However, what happens when you want to configure or dynamically discover
the columns you want to compute the mean on, or loop over all numeric
columns of a dataframe, and compute the means of all of the numeric
columns? That isn't straightforward with dplyr
as we know it so far:
## Configure the metric you want to compute on
target_metric <- "conc"
## Won't work: tries to find column in CO2 called 'target_metric',
## doesn't resolve 'target_metric' to the column name 'conc' and
## try to take the mean of the 'conc' column.
tryCatch({
CO2 %>%
summarise(metric_mean = mean(target_metric))
}, error=function(e) print(e))
## Warning in mean.default("conc"): argument is not numeric or logical: returning NA
## metric_mean
## 1 NA
That is the problem we address here -- how to get that second, dynamic
example to work with dplyr
.
This solution depends on the differences between Non-Standard Evaluation (NSE) and Standard Evaluation (SE). You can read about NSE and SE at the following links:
- https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html
- http://adv-r.had.co.nz/Computing-on-the-language.html
Those links are for more comprehensive understandings of what is going on, and I leave it to them to do the explaining with that. Here, we will try to understand as much as we can, but focus on finding forms and templates that will get the job done. To that end, we will be giving mostly a set of examples that can serve as templates, with explanations being sparse and deferred to the links above.
This problem is well understood by the dplyr
author (Hadley Wickham),
and is addressed at the links above. He frames his solution in terms of
having a NSE function and a corresponding SE function. The NSE functions
are the ones that are the common ones, used in interactive analysis
(like select()
, mutate()
, summarise()
), and each of those
functions has a corresponding SE escape hatch that is a function with
the same name, but with a trailing underscore (_
) at the end of the
name (which are select_()
, mutate_()
, and summarise_()
,
respectively, for the examples above).
The nice thing about this is that it is pretty straightforward to
translate a NSE statement into a SE one by changing the NSE function to
it's corresponding SE one (by adding the _
) and modifying the function
arguments slightly.
To the examples...
The key thing to observe when translating from NSE to SE is moving from
select()
to select_()
and changing putting in bare column names to
passing a vector of string column names to the .dots
parameter of
select_()
.
NSE:
select(CO2, Plant, Type) %>% tbl_df()
## Source: local data frame [84 x 2]
##
## Plant Type
## <fctr> <fctr>
## 1 Qn1 Quebec
## 2 Qn1 Quebec
## 3 Qn1 Quebec
## 4 Qn1 Quebec
## 5 Qn1 Quebec
## 6 Qn1 Quebec
## 7 Qn1 Quebec
## 8 Qn2 Quebec
## 9 Qn2 Quebec
## 10 Qn2 Quebec
## .. ... ...
SE:
target_cols <- c("Plant", "Type")
select_(CO2, .dots=target_cols) %>% tbl_df()
## Source: local data frame [84 x 2]
##
## Plant Type
## <fctr> <fctr>
## 1 Qn1 Quebec
## 2 Qn1 Quebec
## 3 Qn1 Quebec
## 4 Qn1 Quebec
## 5 Qn1 Quebec
## 6 Qn1 Quebec
## 7 Qn1 Quebec
## 8 Qn2 Quebec
## 9 Qn2 Quebec
## 10 Qn2 Quebec
## .. ... ...
I prefer to use the magrittr
piping style of using dplyr
, as do many
others, and will rewrite these examples in that style, and do future
examples in this style.
## NSE
CO2 %>%
select(Plant, Type) %>%
tbl_df()
## Source: local data frame [84 x 2]
##
## Plant Type
## <fctr> <fctr>
## 1 Qn1 Quebec
## 2 Qn1 Quebec
## 3 Qn1 Quebec
## 4 Qn1 Quebec
## 5 Qn1 Quebec
## 6 Qn1 Quebec
## 7 Qn1 Quebec
## 8 Qn2 Quebec
## 9 Qn2 Quebec
## 10 Qn2 Quebec
## .. ... ...
## SE:
target_cols <- c("Plant", "Type")
CO2 %>%
select_(.dots=target_cols) %>%
tbl_df()
## Source: local data frame [84 x 2]
##
## Plant Type
## <fctr> <fctr>
## 1 Qn1 Quebec
## 2 Qn1 Quebec
## 3 Qn1 Quebec
## 4 Qn1 Quebec
## 5 Qn1 Quebec
## 6 Qn1 Quebec
## 7 Qn1 Quebec
## 8 Qn2 Quebec
## 9 Qn2 Quebec
## 10 Qn2 Quebec
## .. ... ...
Here we want to compute some summarized columns. I'll show the NSE way, and then a few different possible SE approaches.
NSE way:
CO2 %>%
summarise(metric_mean = mean(conc)
, metrix_max = max(uptake)) %>%
tbl_df()
## Source: local data frame [1 x 2]
##
## metric_mean metrix_max
## <dbl> <dbl>
## 1 435 45.5
Some things to note:
summarise()
computes two separate summary columns:metric_mean
andmetric_max
.- The computations are all bare code, not explicitly quoted or escaped in any way. This is what reduces typing.
SE solution 1: creating code as a string to be evaluated.
mean_col_to_compute <- "conc"
max_col_to_compute <- "uptake"
CO2 %>%
summarise_(.dots=c(metric_mean = sprintf("mean(%s)", mean_col_to_compute)
, metrix_max = sprintf("max(%s)", max_col_to_compute))) %>%
tbl_df()
## Source: local data frame [1 x 2]
##
## metric_mean metrix_max
## <dbl> <dbl>
## 1 435 45.5
Note:
- We use
sprintf()
to construct a string that has theR
expression that ends up getting evaluated. - The argument to
.dots
is a named vector, which mixes bare names (metric_mean
) and a string containing the code to evaluate.
Evaluating a string gives very good flexibility in constructing customizible functions to compute.
SE solution 2: Using formulas
A second approach is to use formulas rather than evaluated strings. One of the advantages to this approach is that there is that formulas can take an environment, and can leverage environment hierarchies to get at variable values in complex situations. I don't leverage this (yet), but it is good to be aware of:
mean_col_to_compute <- "conc"
max_col_to_compute <- "uptake"
dots <- setNames(
# Formulas to compute
list(lazyeval::interp(~ mean(x), x=as.name(mean_col_to_compute))
, lazyeval::interp(~ max(x), x=as.name(max_col_to_compute)))
# names to assign to formula computations
, c("metric_mean", "metric_max"))
## Do the computation
CO2 %>%
summarise_(.dots=dots) %>%
tbl_df()
## Source: local data frame [1 x 2]
##
## metric_mean metric_max
## <dbl> <dbl>
## 1 435 45.5
Notes:
- We create
dots
as it's own object, and use that object to pass directly to the.dots
argument ofsummarise_()
. - We create
dots
as a named list, rather than a named vector, as in the preivous string example. - The key function is
lazyeval::interp()
, and you need to install thelazyeval
package. This is where the specific construction of the formula (function) happens.
dplyr
isn't the only package that deals with formulas. You may want to
construct formulas for other functions. For a reason I don't yet know,
you cannot always use lazyeval::interp()
, but you can use a different
function from a different package (see below).
NSE-like example (using a formula):
aggregate( uptake ~ Type
, data = CO2
, mean)
## Type uptake
## 1 Quebec 33.54286
## 2 Mississippi 20.88333
What if you want to dynamically construct the first formula?
tvar <- "uptake"
bvar <- "Type"
aggregate( eval(pryr::subs(tvar ~ bvar, list(tvar = as.name(tvar), bvar = as.name(bvar))))
, data = CO2
, mean)
## Type uptake
## 1 Quebec 33.54286
## 2 Mississippi 20.88333
Notes:
pryr::subs()
is what we use instead oflazyeval::interp()
. I cannot yet explain yet why the latter won't work.- You need to
eval
the return value ofsubs()
, becausesubs()
returns a quoted value. Thequote()
/eval()
semantics takes some work to get to know when you need a quoted vs. an evaluated expression. The first two links help you start to get a feel for this.
One common use case I have is that I get a data set with a lot of columns, and I'd like to repeat a type of analysis on each column of a certain type, such as all of the character columns, or all of the numeric columns. We can use SE to repeat analysis when we don't know the column names ahead of time.
For example, let's count the unique values in the character columns of the provided data set...
One caveat: CO2
has factors, and I prefer to work with character
vectors, not factors in general, so we will convert the factor columns
to character cols and store it as dat
. Here, I use the purrr
package, another one by Hadley Wickham, and it is great for getting
tools into R
for better functional programming. See:
https://github.com/hadley/purrr .
## http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231
dat <- CO2 %>% purrr::map_if(is.factor, as.character) %>% as_data_frame()
## Observe different types of columns before and after transform.
## CO2 has factors, while dat has character columns.
CO2 %>% tbl_df()
## Source: local data frame [84 x 5]
##
## Plant Type Treatment conc uptake
## <fctr> <fctr> <fctr> <dbl> <dbl>
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## 7 Qn1 Quebec nonchilled 1000 39.7
## 8 Qn2 Quebec nonchilled 95 13.6
## 9 Qn2 Quebec nonchilled 175 27.3
## 10 Qn2 Quebec nonchilled 250 37.1
## .. ... ... ... ... ...
dat %>% tbl_df()
## Source: local data frame [84 x 5]
##
## Plant Type Treatment conc uptake
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## 7 Qn1 Quebec nonchilled 1000 39.7
## 8 Qn2 Quebec nonchilled 95 13.6
## 9 Qn2 Quebec nonchilled 175 27.3
## 10 Qn2 Quebec nonchilled 250 37.1
## .. ... ... ... ... ...
Now, let's process each of the character columns and compute how many unique values there are:
First, the NSE way.
dat %>%
summarise(nuniq_Plant = Plant %>% unique() %>% length()
, nuniq_Type = Type %>% unique() %>% length()
, nuniq_Treatment = Treatment %>% unique() %>% length())
## Source: local data frame [1 x 3]
##
## nuniq_Plant nuniq_Type nuniq_Treatment
## <int> <int> <int>
## 1 12 2 2
The obvious drawback here is that I had to already know that Plant
,
Type
, and Treatment
were the character columns. I could discover
them though. For fun, let's demo some possible ways to do this:
## Classic, sapply and match on column class
dat[,sapply(dat, class) == "character"] %>% names()
## [1] "Plant" "Type" "Treatment"
#dat[,sapply(dat, is.character)] %>% names()
## More functional, use `Filter`, still base R
## This is my preferred approach, though the api of argument values
## is backwards from dplyr, in that the data goes positionally in the second
## argument, not the first.
#Filter(function(e) class(dat[[e]]) == "character", names(dat))
#Filter(is.character, dat) %>% names()
#dat %>% {Filter(is.character, .)} %>% names()
dat %>% Filter(is.character, .) %>% names()
## [1] "Plant" "Type" "Treatment"
## A tricksy way using catcolwise from the plyr package
charnames <- catcolwise(identity)(dat) %>% names()
charnames
## [1] "Plant" "Type" "Treatment"
Now that we have the vector of strings of column names that are for characters, let's construct our functions:
## Using strings
dat %>%
summarise_(.dots=sapply(charnames, function(e) sprintf("length(unique(%s))", e))) %>%
tbl_df()
## Source: local data frame [1 x 3]
##
## Plant Type Treatment
## <int> <int> <int>
## 1 12 2 2
## Using formulas
dat %>%
summarise_(.dots=sapply(charnames, function(e) lazyeval::interp(~ length(unique(x)), x=as.name(e)))) %>%
tbl_df()
## Source: local data frame [1 x 3]
##
## Plant Type Treatment
## <int> <int> <int>
## 1 12 2 2
So, SE is a bit more convoluted than NSE, but not too badly. It is a cost I'm willing to pay in exchange for dynamically discovering the columns I want to operate on.
One thing I like to do is to loop over a set of columns and do the same
analysis on them, much like the above. Here we are going to use the
plyr
functions to wrap dplyr
operations.
In this example, we will walk over the character columns of CO2
(or
dat
) and print out the unique values found there.
## Again, recreate dat dataset with character columns
dat <- CO2 %>% purrr::map_if(is.factor, as.character) %>% as_data_frame()
## Get the names of the character columns
charnames <- dat %>% Filter(is.character, .) %>% names()
## Use dplyr function to loop over these names
## and print results.
a_ply(charnames, 1, function(e) {
cat("\n----------")
cat("\nColumn: ", e)
cat("\n----------")
cat("\n")
## Select the particular column, get the unique values,
## and print them.
dat %>%
select_(.dots=e) %>%
distinct() %>%
print()
})
##
## ----------
## Column: Plant
## ----------
## Source: local data frame [12 x 1]
##
## Plant
## <chr>
## 1 Qn1
## 2 Qn2
## 3 Qn3
## 4 Qc1
## 5 Qc2
## 6 Qc3
## 7 Mn1
## 8 Mn2
## 9 Mn3
## 10 Mc1
## 11 Mc2
## 12 Mc3
##
## ----------
## Column: Type
## ----------
## Source: local data frame [2 x 1]
##
## Type
## <chr>
## 1 Quebec
## 2 Mississippi
##
## ----------
## Column: Treatment
## ----------
## Source: local data frame [2 x 1]
##
## Treatment
## <chr>
## 1 nonchilled
## 2 chilled
Notes
- The
plyr
functiona_ply
allows you to put more complex code in the last function call, calling it once for each value in thecharnames
array. - We need to use the SE version of the functions so that they can use
the string value passed in with the
e
argument.
Non-Standard Evaluation is fantastic for interactive analysis. A few modifications need to be made to make code to work in a non-interactive setting. Those modifications are not unreasonable, but some guiding templates are nice to have, and should encourage the user to learn about NSE vs. SE at the links above.
I hope these examples are useful to the reader. Questions, comments, additions, and corrections are all welcome.