Creating custom functions to streamline a series of operations involving tidyverse
functions.
One of the key skills in R programming involves creating custom functions to encapsulate complicated and repeated tasks or operations. Custom functions with clear and descriptive names enable us to write our R code in a concise and intuitive fashion.
The basics of defining a function in R are as follows:
- Use
function()
to define a new function - Write the function body consisting of R expressions to be evaluated
- Assign the resulting function to a name to make it easier to use
The last step is optional, and we can create an anonymous function without a binding. In practice, creating a new function looks like the following:
# single-line function body
add_one <- function(x) x + 1
# multi-line function body
add_one <- function(x) {
y <- x + 1
y
}
Using well-defined (pun intended!) R functions with carefully chosen names, we can organize our code in a way that the code as a whole clearly and succinctly expresses our intent.
When we are working with tabular data, we are likely using dplyr
, a popular R package for intuitive data manipulation, perhaps along with other tidyverse
packages. Also, if we are working with dplyr
(and tidyverse
in general), we are likely using the magrittr
/dplyr
pipe operator, %>%
, to chain a series of operations we want to apply to our data object in order to make the whole process easier to follow along and reason about.
Embracing the word of wisdom we found just a section above ("use functions to organize your code"), we would like to wrap a certain routine into an R function, which might look like this:
# create a data frame to play with
df <- data.frame(x = c(1, NA, 3, 4), y = 5:8)
df
#> x y
#> 1 1 5
#> 2 NA 6
#> 3 3 7
#> 4 4 8
# install.packages("dplyr")
library(dplyr)
apply_routine <- function(df) {
df %>%
filter(!is.na(x)) %>%
mutate(z = x + y) %>%
select(x, z)
}
That is a perfectly reasonable choice. However, if the function 1) takes only a single art and 2) consists solely of a series of operations on it, then there is an alternative syntax to encapsulate such a "pipeline" via "functional sequence". For example, the same series of operations in apply_routine()
can be re-implemented as the following functional sequence:
apply_routine_fseq <- . %>%
filter(!is.na(x)) %>%
mutate(z = x + y) %>%
select(x, z)
A functional sequence must start with a special placeholder, .
, followed by a "piped" chain of operations.
While apply_routine()
and apply_routine_fseq()
are of different classes, in effect, they do exactly the same thing. That is, applying the same series of operations to a data.frame
object. See below:
class(apply_routine)
#> [1] "function"
class(apply_routine_fseq)
#> [1] "fseq" "function"
apply_routine(df)
#> x z
#> 1 1 6
#> 2 3 10
#> 3 4 12
apply_routine_fseq(df)
#> x z
#> 1 1 6
#> 2 3 10
#> 3 4 12
Sometimes, we want to encapsulate a more flexible pipeline using tidyverse
functions, which cannot (easily) be implemented as a functional sequence.
Consider the following function:
apply_routine_flex <- function(df, col1, col2, col3) {
df %>%
filter(!is.na(col1)) %>%
mutate(col3 = col1 + col2) %>%
select(col1, col3)
}
Both apply_routine_flex()
and apply_routine()
offer basically the same pipeline, but the former comes with greater flexibility. That is, apply_routine_flex()
allows us to choose specific column names to use. If we try to use apply_routine_flex()
, however, it fails:
apply_routine_flex(df, x, y, z)
#> Error in filter_impl(.data, quo): Evaluation error: object 'x' not found.
Why? Because dplyr
functions, as well as some other tidyverse
functions, rely on what is called "tidy evaluation", tidyverse
's version of non-standard evaluation. In fact, tidy evaluation is what enables us to pass raw symbols to dplyr
functions as arguments. Tidy evaluation is a somewhat advanced topic and explaining how it work is beyond the scope of this writing. However, the key point here is that the function arguments meant for column names (col1
, col2
, and col3
) are not in the format expected by the dplyr
functions when the dplyr
functions within apply_routine_flex()
receive them .
More technically, this particular error has to do with R trying to 1) evaluate the function arguments as the dplyr
function (filter
) in the function body accesses them and 2) subsequently failing to do so because R cannot find the first evaluated argument, x
, in the search path. However, even if all evaluated function arguments existed somewhere in the search path, filter()
would either 1) still complain because they are not proper expressions filter()
expects as its own arguments or 2) return an unexpected output without throwing an error.
To help the dplyr
functions within our custom function to receive the function arguments as intended, we can use utility functions provided by the rlang
package. One implementation of apply_routine_flex()
, now called apply_routine_tidy()
, is as follows:
# install.packages("rlang")
library(rlang)
apply_routine_tidy <- function(df, col1, col2, col3) {
col1 <- enquo(col1)
col2 <- enquo(col2)
col3 <- enquo(col3)
col3_name <- quo_name(col3)
df %>%
filter(!is.na(!!col1)) %>%
mutate(!!col3_name := !!col1 + !!col2) %>%
select(!!col1, !!col3)
}
Here, crudely speaking, enquo()
sort of wraps ("quotes") the function arguments before R tries to evaluate them, and the !!
operator (pronounced "bang-bang") unwraps the "quoted" arguments so that dplyr
functions can use them as their own arguments. If a function argument is to be used as a name, for instance, for a newly created column within a dplyr
function, we wrap the "quoted" expression once more by using quo_name()
; if a function argument is supposed to be an already existing column in the data object to be manipulated, enquo()
will do. Last but not least, the :=
operator, instead of the normal =
operator, should be used for assigning some expression to a new column when tidy evaluation is involved.
This is certainly a lot more to chew on, but the reward can be well worth the effort: a flexible and programmable pipeline with dplyr
functions!
# original use
apply_routine_tidy(df, x, y, z)
#> x z
#> 1 1 6
#> 2 3 10
#> 3 4 12
# being flexible and trying something different
apply_routine_tidy(df, x, x, x2)
#> x x2
#> 1 1 2
#> 2 3 6
#> 3 4 8
Lastly, please note that there are alternatives to the rlang
syntax for handling dplyr
's tidy evaluation. One such alternative is provided by the wrapr::let()
function (originally replyr::let()
). Implementing apply_routine_flex()
with wrapr::let()
might look like the following:
# install.packages("wrapr")
apply_routine_let <- function(df, col1, col2, col3) {
params <- c(
col1 = substitute(col1),
col2 = substitute(col2),
col3 = substitute(col3)
)
wrapr::let(
params,
df %>%
filter(!is.na(col1)) %>%
mutate(col3 = col1 + col2) %>%
select(col1, col3)
)
}
You can read this blog post and this article to learn more about using wrapr::let()
.
- Use custom functions to organize your code into reusable pipelines
- Learn basics of tidy evaluation to create flexible functions with
dplyr
On R function basics:
- "R functions" section in the R basics module from my ICJIA R Workshop.
- Chatper 19 "Functions" in R for Data Science by Garrett Grolemund and Hadely Wickham
On functional sequence:
On tidy evaluation:
- Chapters 18 to 20 in Advanced R by Hadley Wickham.
- "Let" by Nina Zumel, John Mount.
- "Programming with dplyr" on
dplyr
website. - "Tidy evaluation" on
rlang
website. - "Tidy eval: Programming with dplyr, tidyr, and ggplot2" (video) by Hadley Wickham at rstudio::conf 2018.
- "Using replyr::let to Parameterize dplyr Expressions" by Nina Zumel