mpettis/nse-se-examples.md

## nse-se-examples.md

      
    Raw
  

              nse-se-examples.md
            
          
Introduction

Standard Evaluation escape
hatches


Examples

Selecting columns
Creating summarized computed
columns
Non-dplyr examples with
formulas
Using dplyr when discovering columns in a data
set
Example: Using plyr functions to wrap dplyr
functions


Conclusion

Introduction

The dplyr package is a great toolset to do data manipulation and
exploration. It presents a well-crafted api that allows you to flow data
through a chain of computations that can reduce code complexity (and
cognitive overhead). It also reduces typing because it can use
non-standard evaluation (NSE), which means, in practical terms, that you
can refer to things like column names in a dataframe without having to
surround it in quotes, and make computations on those columns just as
you would without using dplyr.
NSE is great for interactive work. It is especially suited to times when
you know exactly which columns of a data frame you want to compute on.
For instance, let's look at the CO2 dataset, which comes with R:
CO2 %>% tbl_df()

## Source: local data frame [84 x 5]
## 
##     Plant   Type  Treatment  conc uptake
##    <fctr> <fctr>     <fctr> <dbl>  <dbl>
## 1     Qn1 Quebec nonchilled    95   16.0
## 2     Qn1 Quebec nonchilled   175   30.4
## 3     Qn1 Quebec nonchilled   250   34.8
## 4     Qn1 Quebec nonchilled   350   37.2
## 5     Qn1 Quebec nonchilled   500   35.3
## 6     Qn1 Quebec nonchilled   675   39.2
## 7     Qn1 Quebec nonchilled  1000   39.7
## 8     Qn2 Quebec nonchilled    95   13.6
## 9     Qn2 Quebec nonchilled   175   27.3
## 10    Qn2 Quebec nonchilled   250   37.1
## ..    ...    ...        ...   ...    ...

If you want to compute the mean of the conc field with dplyr
interactively, that is very straightforward:
CO2 %>%
  summarise(metric_mean = mean(conc))

##   metric_mean
## 1         435

However, what happens when you want to configure or dynamically discover
the columns you want to compute the mean on, or loop over all numeric
columns of a dataframe, and compute the means of all of the numeric
columns? That isn't straightforward with dplyr as we know it so far:
  ## Configure the metric you want to compute on
target_metric <- "conc"

  ## Won't work: tries to find column in CO2 called 'target_metric',
  ## doesn't resolve 'target_metric' to the column name 'conc' and
  ## try to take the mean of the 'conc' column.
tryCatch({
  CO2 %>%
    summarise(metric_mean = mean(target_metric))
  }, error=function(e) print(e))

## Warning in mean.default("conc"): argument is not numeric or logical: returning NA

##   metric_mean
## 1          NA

That is the problem we address here -- how to get that second, dynamic
example to work with dplyr.
This solution depends on the differences between Non-Standard Evaluation
(NSE) and Standard Evaluation (SE). You can read about NSE and SE at the
following links:

https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html
http://adv-r.had.co.nz/Computing-on-the-language.html

Those links are for more comprehensive understandings of what is going
on, and I leave it to them to do the explaining with that. Here, we will
try to understand as much as we can, but focus on finding forms and
templates that will get the job done. To that end, we will be giving
mostly a set of examples that can serve as templates, with explanations
being sparse and deferred to the links above.
Standard Evaluation escape hatches

This problem is well understood by the dplyr author (Hadley Wickham),
and is addressed at the links above. He frames his solution in terms of
having a NSE function and a corresponding SE function. The NSE functions
are the ones that are the common ones, used in interactive analysis
(like select(), mutate(), summarise()), and each of those
functions has a corresponding SE escape hatch that is a function with
the same name, but with a trailing underscore (_) at the end of the
name (which are select_(), mutate_(), and summarise_(),
respectively, for the examples above).
The nice thing about this is that it is pretty straightforward to
translate a NSE statement into a SE one by changing the NSE function to
it's corresponding SE one (by adding the _) and modifying the function
arguments slightly.
To the examples...
Examples

Selecting columns

The key thing to observe when translating from NSE to SE is moving from
select() to select_() and changing putting in bare column names to
passing a vector of string column names to the .dots parameter of
select_().
NSE:
select(CO2, Plant, Type) %>% tbl_df()

## Source: local data frame [84 x 2]
## 
##     Plant   Type
##    <fctr> <fctr>
## 1     Qn1 Quebec
## 2     Qn1 Quebec
## 3     Qn1 Quebec
## 4     Qn1 Quebec
## 5     Qn1 Quebec
## 6     Qn1 Quebec
## 7     Qn1 Quebec
## 8     Qn2 Quebec
## 9     Qn2 Quebec
## 10    Qn2 Quebec
## ..    ...    ...

SE:
target_cols <- c("Plant", "Type")
select_(CO2, .dots=target_cols) %>% tbl_df()

## Source: local data frame [84 x 2]
## 
##     Plant   Type
##    <fctr> <fctr>
## 1     Qn1 Quebec
## 2     Qn1 Quebec
## 3     Qn1 Quebec
## 4     Qn1 Quebec
## 5     Qn1 Quebec
## 6     Qn1 Quebec
## 7     Qn1 Quebec
## 8     Qn2 Quebec
## 9     Qn2 Quebec
## 10    Qn2 Quebec
## ..    ...    ...

I prefer to use the magrittr piping style of using dplyr, as do many
others, and will rewrite these examples in that style, and do future
examples in this style.
  ## NSE
CO2 %>%
  select(Plant, Type) %>%
  tbl_df()

## Source: local data frame [84 x 2]
## 
##     Plant   Type
##    <fctr> <fctr>
## 1     Qn1 Quebec
## 2     Qn1 Quebec
## 3     Qn1 Quebec
## 4     Qn1 Quebec
## 5     Qn1 Quebec
## 6     Qn1 Quebec
## 7     Qn1 Quebec
## 8     Qn2 Quebec
## 9     Qn2 Quebec
## 10    Qn2 Quebec
## ..    ...    ...

  ## SE:
target_cols <- c("Plant", "Type")
CO2 %>%
  select_(.dots=target_cols) %>%
  tbl_df()

## Source: local data frame [84 x 2]
## 
##     Plant   Type
##    <fctr> <fctr>
## 1     Qn1 Quebec
## 2     Qn1 Quebec
## 3     Qn1 Quebec
## 4     Qn1 Quebec
## 5     Qn1 Quebec
## 6     Qn1 Quebec
## 7     Qn1 Quebec
## 8     Qn2 Quebec
## 9     Qn2 Quebec
## 10    Qn2 Quebec
## ..    ...    ...

Creating summarized computed columns

Here we want to compute some summarized columns. I'll show the NSE way,
and then a few different possible SE approaches.
NSE way:
CO2 %>%
  summarise(metric_mean = mean(conc)
            , metrix_max = max(uptake)) %>%
  tbl_df()

## Source: local data frame [1 x 2]
## 
##   metric_mean metrix_max
##         <dbl>      <dbl>
## 1         435       45.5

Some things to note:

summarise() computes two separate summary columns: metric_mean
and metric_max.
The computations are all bare code, not explicitly quoted or escaped
in any way. This is what reduces typing.

SE solution 1: creating code as a string to be evaluated.
mean_col_to_compute <- "conc"
max_col_to_compute <- "uptake"

CO2 %>%
  summarise_(.dots=c(metric_mean = sprintf("mean(%s)", mean_col_to_compute)
            , metrix_max = sprintf("max(%s)", max_col_to_compute))) %>%
  tbl_df()

## Source: local data frame [1 x 2]
## 
##   metric_mean metrix_max
##         <dbl>      <dbl>
## 1         435       45.5

Note:

We use sprintf() to construct a string that has the R expression
that ends up getting evaluated.
The argument to .dots is a named vector, which mixes bare names
(metric_mean) and a string containing the code to evaluate.

Evaluating a string gives very good flexibility in constructing
customizible functions to compute.
SE solution 2: Using formulas
A second approach is to use formulas rather than evaluated strings. One
of the advantages to this approach is that there is that formulas can
take an environment, and can leverage environment hierarchies to get at
variable values in complex situations. I don't leverage this (yet), but
it is good to be aware of:
mean_col_to_compute <- "conc"
max_col_to_compute <- "uptake"

dots <- setNames(
    # Formulas to compute
  list(lazyeval::interp(~ mean(x), x=as.name(mean_col_to_compute))
       , lazyeval::interp(~ max(x), x=as.name(max_col_to_compute)))
    # names to assign to formula computations
  , c("metric_mean", "metric_max"))


  ## Do the computation
CO2 %>%
  summarise_(.dots=dots) %>%
  tbl_df()

## Source: local data frame [1 x 2]
## 
##   metric_mean metric_max
##         <dbl>      <dbl>
## 1         435       45.5

Notes:

We create dots as it's own object, and use that object to pass
directly to the .dots argument of summarise_().
We create dots as a named list, rather than a named vector, as in
the preivous string example.
The key function is lazyeval::interp(), and you need to install
the lazyeval package. This is where the specific construction of
the formula (function) happens.

Non-dplyr examples with formulas

dplyr isn't the only package that deals with formulas. You may want to
construct formulas for other functions. For a reason I don't yet know,
you cannot always use lazyeval::interp(), but you can use a different
function from a different package (see below).
NSE-like example (using a formula):
aggregate( uptake ~ Type
          , data = CO2
          , mean)

##          Type   uptake
## 1      Quebec 33.54286
## 2 Mississippi 20.88333

What if you want to dynamically construct the first formula?
tvar <- "uptake"
bvar <- "Type"
aggregate( eval(pryr::subs(tvar ~ bvar, list(tvar = as.name(tvar), bvar = as.name(bvar))))
          , data = CO2
          , mean)

##          Type   uptake
## 1      Quebec 33.54286
## 2 Mississippi 20.88333

Notes:

pryr::subs() is what we use instead of lazyeval::interp(). I
cannot yet explain yet why the latter won't work.
You need to eval the return value of subs(), because subs()
returns a quoted value. The quote()/eval() semantics takes some
work to get to know when you need a quoted vs. an
evaluated expression. The first two links help you start to get a
feel for this.

Using dplyr when discovering columns in a data set

One common use case I have is that I get a data set with a lot of
columns, and I'd like to repeat a type of analysis on each column of a
certain type, such as all of the character columns, or all of the
numeric columns. We can use SE to repeat analysis when we don't know the
column names ahead of time.
For example, let's count the unique values in the character columns of
the provided data set...
One caveat: CO2 has factors, and I prefer to work with character
vectors, not factors in general, so we will convert the factor columns
to character cols and store it as dat. Here, I use the purrr
package, another one by Hadley Wickham, and it is great for getting
tools into R for better functional programming. See:
https://github.com/hadley/purrr .
  ## http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231
dat <- CO2 %>% purrr::map_if(is.factor, as.character) %>% as_data_frame()

  ## Observe different types of columns before and after transform.
  ## CO2 has factors, while dat has character columns.
CO2 %>% tbl_df()

## Source: local data frame [84 x 5]
## 
##     Plant   Type  Treatment  conc uptake
##    <fctr> <fctr>     <fctr> <dbl>  <dbl>
## 1     Qn1 Quebec nonchilled    95   16.0
## 2     Qn1 Quebec nonchilled   175   30.4
## 3     Qn1 Quebec nonchilled   250   34.8
## 4     Qn1 Quebec nonchilled   350   37.2
## 5     Qn1 Quebec nonchilled   500   35.3
## 6     Qn1 Quebec nonchilled   675   39.2
## 7     Qn1 Quebec nonchilled  1000   39.7
## 8     Qn2 Quebec nonchilled    95   13.6
## 9     Qn2 Quebec nonchilled   175   27.3
## 10    Qn2 Quebec nonchilled   250   37.1
## ..    ...    ...        ...   ...    ...

dat %>% tbl_df()

## Source: local data frame [84 x 5]
## 
##    Plant   Type  Treatment  conc uptake
##    <chr>  <chr>      <chr> <dbl>  <dbl>
## 1    Qn1 Quebec nonchilled    95   16.0
## 2    Qn1 Quebec nonchilled   175   30.4
## 3    Qn1 Quebec nonchilled   250   34.8
## 4    Qn1 Quebec nonchilled   350   37.2
## 5    Qn1 Quebec nonchilled   500   35.3
## 6    Qn1 Quebec nonchilled   675   39.2
## 7    Qn1 Quebec nonchilled  1000   39.7
## 8    Qn2 Quebec nonchilled    95   13.6
## 9    Qn2 Quebec nonchilled   175   27.3
## 10   Qn2 Quebec nonchilled   250   37.1
## ..   ...    ...        ...   ...    ...

Now, let's process each of the character columns and compute how many
unique values there are:
First, the NSE way.
dat %>%
  summarise(nuniq_Plant = Plant %>% unique() %>% length()
            , nuniq_Type = Type %>% unique() %>% length()
            , nuniq_Treatment = Treatment %>% unique() %>% length())

## Source: local data frame [1 x 3]
## 
##   nuniq_Plant nuniq_Type nuniq_Treatment
##         <int>      <int>           <int>
## 1          12          2               2

The obvious drawback here is that I had to already know that Plant,
Type, and Treatment were the character columns. I could discover
them though. For fun, let's demo some possible ways to do this:
  ## Classic, sapply and match on column class
dat[,sapply(dat, class) == "character"] %>% names()

## [1] "Plant"     "Type"      "Treatment"

#dat[,sapply(dat, is.character)] %>% names()

  ## More functional, use `Filter`, still base R
  ## This is my preferred approach, though the api of argument values
  ## is backwards from dplyr, in that the data goes positionally in the second
  ## argument, not the first.
#Filter(function(e) class(dat[[e]]) == "character", names(dat))
#Filter(is.character, dat) %>% names()
#dat %>% {Filter(is.character, .)} %>% names()
dat %>% Filter(is.character, .) %>% names()

## [1] "Plant"     "Type"      "Treatment"

  ## A tricksy way using catcolwise from the plyr package
charnames <- catcolwise(identity)(dat) %>% names()
charnames

## [1] "Plant"     "Type"      "Treatment"

Now that we have the vector of strings of column names that are for
characters, let's construct our functions:
  ## Using strings
dat %>%
  summarise_(.dots=sapply(charnames, function(e) sprintf("length(unique(%s))", e))) %>%
  tbl_df()

## Source: local data frame [1 x 3]
## 
##   Plant  Type Treatment
##   <int> <int>     <int>
## 1    12     2         2

  ## Using formulas
dat %>%
  summarise_(.dots=sapply(charnames, function(e) lazyeval::interp(~ length(unique(x)), x=as.name(e)))) %>%
  tbl_df()

## Source: local data frame [1 x 3]
## 
##   Plant  Type Treatment
##   <int> <int>     <int>
## 1    12     2         2

So, SE is a bit more convoluted than NSE, but not too badly. It is a
cost I'm willing to pay in exchange for dynamically discovering the
columns I want to operate on.
Example: Using plyr functions to wrap dplyr functions

One thing I like to do is to loop over a set of columns and do the same
analysis on them, much like the above. Here we are going to use the
plyr functions to wrap dplyr operations.
In this example, we will walk over the character columns of CO2 (or
dat) and print out the unique values found there.
  ## Again, recreate dat dataset with character columns
dat <- CO2 %>% purrr::map_if(is.factor, as.character) %>% as_data_frame()

  ## Get the names of the character columns
charnames <- dat %>% Filter(is.character, .) %>% names()

  ## Use dplyr function to loop over these names
  ## and print results.
a_ply(charnames, 1, function(e) {
  cat("\n----------")
  cat("\nColumn: ", e)
  cat("\n----------")
  cat("\n")

    ## Select the particular column, get the unique values,
    ## and print them.
  dat %>%
    select_(.dots=e) %>%
    distinct() %>%
    print()
})

## 
## ----------
## Column:  Plant
## ----------
## Source: local data frame [12 x 1]
## 
##    Plant
##    <chr>
## 1    Qn1
## 2    Qn2
## 3    Qn3
## 4    Qc1
## 5    Qc2
## 6    Qc3
## 7    Mn1
## 8    Mn2
## 9    Mn3
## 10   Mc1
## 11   Mc2
## 12   Mc3
## 
## ----------
## Column:  Type
## ----------
## Source: local data frame [2 x 1]
## 
##          Type
##         <chr>
## 1      Quebec
## 2 Mississippi
## 
## ----------
## Column:  Treatment
## ----------
## Source: local data frame [2 x 1]
## 
##    Treatment
##        <chr>
## 1 nonchilled
## 2    chilled

Notes

The plyr function a_ply allows you to put more complex code in
the last function call, calling it once for each value in the
charnames array.
We need to use the SE version of the functions so that they can use
the string value passed in with the e argument.

Conclusion

Non-Standard Evaluation is fantastic for interactive analysis. A few
modifications need to be made to make code to work in a non-interactive
setting. Those modifications are not unreasonable, but some guiding
templates are nice to have, and should encourage the user to learn about
NSE vs. SE at the links above.
I hope these examples are useful to the reader. Questions, comments,
additions, and corrections are all welcome.