Skip to content

Instantly share code, notes, and snippets.

@mpettis
Created July 29, 2016 03:00
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mpettis/5fc5e2d2d7d54291e047e59fb5baf1ea to your computer and use it in GitHub Desktop.
Save mpettis/5fc5e2d2d7d54291e047e59fb5baf1ea to your computer and use it in GitHub Desktop.
nse-se-examples

Introduction

The dplyr package is a great toolset to do data manipulation and exploration. It presents a well-crafted api that allows you to flow data through a chain of computations that can reduce code complexity (and cognitive overhead). It also reduces typing because it can use non-standard evaluation (NSE), which means, in practical terms, that you can refer to things like column names in a dataframe without having to surround it in quotes, and make computations on those columns just as you would without using dplyr.

NSE is great for interactive work. It is especially suited to times when you know exactly which columns of a data frame you want to compute on. For instance, let's look at the CO2 dataset, which comes with R:

CO2 %>% tbl_df()

## Source: local data frame [84 x 5]
## 
##     Plant   Type  Treatment  conc uptake
##    <fctr> <fctr>     <fctr> <dbl>  <dbl>
## 1     Qn1 Quebec nonchilled    95   16.0
## 2     Qn1 Quebec nonchilled   175   30.4
## 3     Qn1 Quebec nonchilled   250   34.8
## 4     Qn1 Quebec nonchilled   350   37.2
## 5     Qn1 Quebec nonchilled   500   35.3
## 6     Qn1 Quebec nonchilled   675   39.2
## 7     Qn1 Quebec nonchilled  1000   39.7
## 8     Qn2 Quebec nonchilled    95   13.6
## 9     Qn2 Quebec nonchilled   175   27.3
## 10    Qn2 Quebec nonchilled   250   37.1
## ..    ...    ...        ...   ...    ...

If you want to compute the mean of the conc field with dplyr interactively, that is very straightforward:

CO2 %>%
  summarise(metric_mean = mean(conc))

##   metric_mean
## 1         435

However, what happens when you want to configure or dynamically discover the columns you want to compute the mean on, or loop over all numeric columns of a dataframe, and compute the means of all of the numeric columns? That isn't straightforward with dplyr as we know it so far:

  ## Configure the metric you want to compute on
target_metric <- "conc"

  ## Won't work: tries to find column in CO2 called 'target_metric',
  ## doesn't resolve 'target_metric' to the column name 'conc' and
  ## try to take the mean of the 'conc' column.
tryCatch({
  CO2 %>%
    summarise(metric_mean = mean(target_metric))
  }, error=function(e) print(e))

## Warning in mean.default("conc"): argument is not numeric or logical: returning NA

##   metric_mean
## 1          NA

That is the problem we address here -- how to get that second, dynamic example to work with dplyr.

This solution depends on the differences between Non-Standard Evaluation (NSE) and Standard Evaluation (SE). You can read about NSE and SE at the following links:

Those links are for more comprehensive understandings of what is going on, and I leave it to them to do the explaining with that. Here, we will try to understand as much as we can, but focus on finding forms and templates that will get the job done. To that end, we will be giving mostly a set of examples that can serve as templates, with explanations being sparse and deferred to the links above.

Standard Evaluation escape hatches

This problem is well understood by the dplyr author (Hadley Wickham), and is addressed at the links above. He frames his solution in terms of having a NSE function and a corresponding SE function. The NSE functions are the ones that are the common ones, used in interactive analysis (like select(), mutate(), summarise()), and each of those functions has a corresponding SE escape hatch that is a function with the same name, but with a trailing underscore (_) at the end of the name (which are select_(), mutate_(), and summarise_(), respectively, for the examples above).

The nice thing about this is that it is pretty straightforward to translate a NSE statement into a SE one by changing the NSE function to it's corresponding SE one (by adding the _) and modifying the function arguments slightly.

To the examples...

Examples

Selecting columns

The key thing to observe when translating from NSE to SE is moving from select() to select_() and changing putting in bare column names to passing a vector of string column names to the .dots parameter of select_().

NSE:

select(CO2, Plant, Type) %>% tbl_df()

## Source: local data frame [84 x 2]
## 
##     Plant   Type
##    <fctr> <fctr>
## 1     Qn1 Quebec
## 2     Qn1 Quebec
## 3     Qn1 Quebec
## 4     Qn1 Quebec
## 5     Qn1 Quebec
## 6     Qn1 Quebec
## 7     Qn1 Quebec
## 8     Qn2 Quebec
## 9     Qn2 Quebec
## 10    Qn2 Quebec
## ..    ...    ...

SE:

target_cols <- c("Plant", "Type")
select_(CO2, .dots=target_cols) %>% tbl_df()

## Source: local data frame [84 x 2]
## 
##     Plant   Type
##    <fctr> <fctr>
## 1     Qn1 Quebec
## 2     Qn1 Quebec
## 3     Qn1 Quebec
## 4     Qn1 Quebec
## 5     Qn1 Quebec
## 6     Qn1 Quebec
## 7     Qn1 Quebec
## 8     Qn2 Quebec
## 9     Qn2 Quebec
## 10    Qn2 Quebec
## ..    ...    ...

I prefer to use the magrittr piping style of using dplyr, as do many others, and will rewrite these examples in that style, and do future examples in this style.

  ## NSE
CO2 %>%
  select(Plant, Type) %>%
  tbl_df()

## Source: local data frame [84 x 2]
## 
##     Plant   Type
##    <fctr> <fctr>
## 1     Qn1 Quebec
## 2     Qn1 Quebec
## 3     Qn1 Quebec
## 4     Qn1 Quebec
## 5     Qn1 Quebec
## 6     Qn1 Quebec
## 7     Qn1 Quebec
## 8     Qn2 Quebec
## 9     Qn2 Quebec
## 10    Qn2 Quebec
## ..    ...    ...

  ## SE:
target_cols <- c("Plant", "Type")
CO2 %>%
  select_(.dots=target_cols) %>%
  tbl_df()

## Source: local data frame [84 x 2]
## 
##     Plant   Type
##    <fctr> <fctr>
## 1     Qn1 Quebec
## 2     Qn1 Quebec
## 3     Qn1 Quebec
## 4     Qn1 Quebec
## 5     Qn1 Quebec
## 6     Qn1 Quebec
## 7     Qn1 Quebec
## 8     Qn2 Quebec
## 9     Qn2 Quebec
## 10    Qn2 Quebec
## ..    ...    ...

Creating summarized computed columns

Here we want to compute some summarized columns. I'll show the NSE way, and then a few different possible SE approaches.

NSE way:

CO2 %>%
  summarise(metric_mean = mean(conc)
            , metrix_max = max(uptake)) %>%
  tbl_df()

## Source: local data frame [1 x 2]
## 
##   metric_mean metrix_max
##         <dbl>      <dbl>
## 1         435       45.5

Some things to note:

  • summarise() computes two separate summary columns: metric_mean and metric_max.
  • The computations are all bare code, not explicitly quoted or escaped in any way. This is what reduces typing.

SE solution 1: creating code as a string to be evaluated.

mean_col_to_compute <- "conc"
max_col_to_compute <- "uptake"

CO2 %>%
  summarise_(.dots=c(metric_mean = sprintf("mean(%s)", mean_col_to_compute)
            , metrix_max = sprintf("max(%s)", max_col_to_compute))) %>%
  tbl_df()

## Source: local data frame [1 x 2]
## 
##   metric_mean metrix_max
##         <dbl>      <dbl>
## 1         435       45.5

Note:

  • We use sprintf() to construct a string that has the R expression that ends up getting evaluated.
  • The argument to .dots is a named vector, which mixes bare names (metric_mean) and a string containing the code to evaluate.

Evaluating a string gives very good flexibility in constructing customizible functions to compute.

SE solution 2: Using formulas

A second approach is to use formulas rather than evaluated strings. One of the advantages to this approach is that there is that formulas can take an environment, and can leverage environment hierarchies to get at variable values in complex situations. I don't leverage this (yet), but it is good to be aware of:

mean_col_to_compute <- "conc"
max_col_to_compute <- "uptake"

dots <- setNames(
    # Formulas to compute
  list(lazyeval::interp(~ mean(x), x=as.name(mean_col_to_compute))
       , lazyeval::interp(~ max(x), x=as.name(max_col_to_compute)))
    # names to assign to formula computations
  , c("metric_mean", "metric_max"))


  ## Do the computation
CO2 %>%
  summarise_(.dots=dots) %>%
  tbl_df()

## Source: local data frame [1 x 2]
## 
##   metric_mean metric_max
##         <dbl>      <dbl>
## 1         435       45.5

Notes:

  • We create dots as it's own object, and use that object to pass directly to the .dots argument of summarise_().
  • We create dots as a named list, rather than a named vector, as in the preivous string example.
  • The key function is lazyeval::interp(), and you need to install the lazyeval package. This is where the specific construction of the formula (function) happens.

Non-dplyr examples with formulas

dplyr isn't the only package that deals with formulas. You may want to construct formulas for other functions. For a reason I don't yet know, you cannot always use lazyeval::interp(), but you can use a different function from a different package (see below).

NSE-like example (using a formula):

aggregate( uptake ~ Type
          , data = CO2
          , mean)

##          Type   uptake
## 1      Quebec 33.54286
## 2 Mississippi 20.88333

What if you want to dynamically construct the first formula?

tvar <- "uptake"
bvar <- "Type"
aggregate( eval(pryr::subs(tvar ~ bvar, list(tvar = as.name(tvar), bvar = as.name(bvar))))
          , data = CO2
          , mean)

##          Type   uptake
## 1      Quebec 33.54286
## 2 Mississippi 20.88333

Notes:

  • pryr::subs() is what we use instead of lazyeval::interp(). I cannot yet explain yet why the latter won't work.
  • You need to eval the return value of subs(), because subs() returns a quoted value. The quote()/eval() semantics takes some work to get to know when you need a quoted vs. an evaluated expression. The first two links help you start to get a feel for this.

Using dplyr when discovering columns in a data set

One common use case I have is that I get a data set with a lot of columns, and I'd like to repeat a type of analysis on each column of a certain type, such as all of the character columns, or all of the numeric columns. We can use SE to repeat analysis when we don't know the column names ahead of time.

For example, let's count the unique values in the character columns of the provided data set...

One caveat: CO2 has factors, and I prefer to work with character vectors, not factors in general, so we will convert the factor columns to character cols and store it as dat. Here, I use the purrr package, another one by Hadley Wickham, and it is great for getting tools into R for better functional programming. See: https://github.com/hadley/purrr .

  ## http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231
dat <- CO2 %>% purrr::map_if(is.factor, as.character) %>% as_data_frame()

  ## Observe different types of columns before and after transform.
  ## CO2 has factors, while dat has character columns.
CO2 %>% tbl_df()

## Source: local data frame [84 x 5]
## 
##     Plant   Type  Treatment  conc uptake
##    <fctr> <fctr>     <fctr> <dbl>  <dbl>
## 1     Qn1 Quebec nonchilled    95   16.0
## 2     Qn1 Quebec nonchilled   175   30.4
## 3     Qn1 Quebec nonchilled   250   34.8
## 4     Qn1 Quebec nonchilled   350   37.2
## 5     Qn1 Quebec nonchilled   500   35.3
## 6     Qn1 Quebec nonchilled   675   39.2
## 7     Qn1 Quebec nonchilled  1000   39.7
## 8     Qn2 Quebec nonchilled    95   13.6
## 9     Qn2 Quebec nonchilled   175   27.3
## 10    Qn2 Quebec nonchilled   250   37.1
## ..    ...    ...        ...   ...    ...

dat %>% tbl_df()

## Source: local data frame [84 x 5]
## 
##    Plant   Type  Treatment  conc uptake
##    <chr>  <chr>      <chr> <dbl>  <dbl>
## 1    Qn1 Quebec nonchilled    95   16.0
## 2    Qn1 Quebec nonchilled   175   30.4
## 3    Qn1 Quebec nonchilled   250   34.8
## 4    Qn1 Quebec nonchilled   350   37.2
## 5    Qn1 Quebec nonchilled   500   35.3
## 6    Qn1 Quebec nonchilled   675   39.2
## 7    Qn1 Quebec nonchilled  1000   39.7
## 8    Qn2 Quebec nonchilled    95   13.6
## 9    Qn2 Quebec nonchilled   175   27.3
## 10   Qn2 Quebec nonchilled   250   37.1
## ..   ...    ...        ...   ...    ...

Now, let's process each of the character columns and compute how many unique values there are:

First, the NSE way.

dat %>%
  summarise(nuniq_Plant = Plant %>% unique() %>% length()
            , nuniq_Type = Type %>% unique() %>% length()
            , nuniq_Treatment = Treatment %>% unique() %>% length())

## Source: local data frame [1 x 3]
## 
##   nuniq_Plant nuniq_Type nuniq_Treatment
##         <int>      <int>           <int>
## 1          12          2               2

The obvious drawback here is that I had to already know that Plant, Type, and Treatment were the character columns. I could discover them though. For fun, let's demo some possible ways to do this:

  ## Classic, sapply and match on column class
dat[,sapply(dat, class) == "character"] %>% names()

## [1] "Plant"     "Type"      "Treatment"

#dat[,sapply(dat, is.character)] %>% names()

  ## More functional, use `Filter`, still base R
  ## This is my preferred approach, though the api of argument values
  ## is backwards from dplyr, in that the data goes positionally in the second
  ## argument, not the first.
#Filter(function(e) class(dat[[e]]) == "character", names(dat))
#Filter(is.character, dat) %>% names()
#dat %>% {Filter(is.character, .)} %>% names()
dat %>% Filter(is.character, .) %>% names()

## [1] "Plant"     "Type"      "Treatment"

  ## A tricksy way using catcolwise from the plyr package
charnames <- catcolwise(identity)(dat) %>% names()
charnames

## [1] "Plant"     "Type"      "Treatment"

Now that we have the vector of strings of column names that are for characters, let's construct our functions:

  ## Using strings
dat %>%
  summarise_(.dots=sapply(charnames, function(e) sprintf("length(unique(%s))", e))) %>%
  tbl_df()

## Source: local data frame [1 x 3]
## 
##   Plant  Type Treatment
##   <int> <int>     <int>
## 1    12     2         2

  ## Using formulas
dat %>%
  summarise_(.dots=sapply(charnames, function(e) lazyeval::interp(~ length(unique(x)), x=as.name(e)))) %>%
  tbl_df()

## Source: local data frame [1 x 3]
## 
##   Plant  Type Treatment
##   <int> <int>     <int>
## 1    12     2         2

So, SE is a bit more convoluted than NSE, but not too badly. It is a cost I'm willing to pay in exchange for dynamically discovering the columns I want to operate on.

Example: Using plyr functions to wrap dplyr functions

One thing I like to do is to loop over a set of columns and do the same analysis on them, much like the above. Here we are going to use the plyr functions to wrap dplyr operations.

In this example, we will walk over the character columns of CO2 (or dat) and print out the unique values found there.

  ## Again, recreate dat dataset with character columns
dat <- CO2 %>% purrr::map_if(is.factor, as.character) %>% as_data_frame()

  ## Get the names of the character columns
charnames <- dat %>% Filter(is.character, .) %>% names()

  ## Use dplyr function to loop over these names
  ## and print results.
a_ply(charnames, 1, function(e) {
  cat("\n----------")
  cat("\nColumn: ", e)
  cat("\n----------")
  cat("\n")

    ## Select the particular column, get the unique values,
    ## and print them.
  dat %>%
    select_(.dots=e) %>%
    distinct() %>%
    print()
})

## 
## ----------
## Column:  Plant
## ----------
## Source: local data frame [12 x 1]
## 
##    Plant
##    <chr>
## 1    Qn1
## 2    Qn2
## 3    Qn3
## 4    Qc1
## 5    Qc2
## 6    Qc3
## 7    Mn1
## 8    Mn2
## 9    Mn3
## 10   Mc1
## 11   Mc2
## 12   Mc3
## 
## ----------
## Column:  Type
## ----------
## Source: local data frame [2 x 1]
## 
##          Type
##         <chr>
## 1      Quebec
## 2 Mississippi
## 
## ----------
## Column:  Treatment
## ----------
## Source: local data frame [2 x 1]
## 
##    Treatment
##        <chr>
## 1 nonchilled
## 2    chilled

Notes

  • The plyr function a_ply allows you to put more complex code in the last function call, calling it once for each value in the charnames array.
  • We need to use the SE version of the functions so that they can use the string value passed in with the e argument.

Conclusion

Non-Standard Evaluation is fantastic for interactive analysis. A few modifications need to be made to make code to work in a non-interactive setting. Those modifications are not unreasonable, but some guiding templates are nice to have, and should encourage the user to learn about NSE vs. SE at the links above.

I hope these examples are useful to the reader. Questions, comments, additions, and corrections are all welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment