bobaekang/creating-custom-functions-for-data-manipulation-r.md

## creating-custom-functions-for-data-manipulation-r.md

      
    Raw
  

              creating-custom-functions-for-data-manipulation-r.md
            
          
    Summary

Creating custom functions to streamline a series of operations involving tidyverse functions.
Main

R function basics

One of the key skills in R programming involves creating custom functions to encapsulate complicated and repeated tasks or operations. Custom functions with clear and descriptive names enable us to write our R code in a concise and intuitive fashion.
The basics of defining a function in R are as follows:

Use function() to define a new function
Write the function body consisting of R expressions to be evaluated
Assign the resulting function to a name to make it easier to use

The last step is optional, and we can create an anonymous function without a binding. In practice, creating a new function looks like the following:
# single-line function body
add_one <- function(x) x + 1

# multi-line function body
add_one <- function(x) {
  y <- x + 1
  y
}
Using well-defined (pun intended!) R functions with carefully chosen names, we can organize our code in a way that the code as a whole clearly and succinctly expresses our intent.
Custom function and functional sequence

When we are working with tabular data, we are likely using dplyr, a popular R package for intuitive data manipulation, perhaps along with other tidyverse packages. Also, if we are working with dplyr (and tidyverse in general), we are likely using the magrittr/dplyr pipe operator, %>%, to chain a series of operations we want to apply to our data object in order to make the whole process easier to follow along and reason about.
Embracing the word of wisdom we found just a section above ("use functions to organize your code"), we would like to wrap a certain routine into an R function, which might look like this:
# create a data frame to play with
df <- data.frame(x = c(1, NA, 3, 4), y = 5:8)
df                                           
#>    x y
#> 1  1 5
#> 2 NA 6
#> 3  3 7
#> 4  4 8

# install.packages("dplyr")
library(dplyr)

apply_routine <- function(df) {
  df %>%
    filter(!is.na(x)) %>%
    mutate(z = x + y) %>%
    select(x, z)
}
That is a perfectly reasonable choice. However, if the function 1) takes only a single art and 2) consists solely of a series of operations on it, then there is an alternative syntax to encapsulate such a "pipeline" via "functional sequence". For example, the same series of operations in apply_routine() can be re-implemented as the following functional sequence:
apply_routine_fseq <- . %>%
    filter(!is.na(x)) %>%
    mutate(z = x + y) %>%
    select(x, z)
A functional sequence must start with a special placeholder, ., followed by a "piped" chain of operations.
While apply_routine() and apply_routine_fseq() are of different classes, in effect, they do exactly the same thing. That is, applying the same series of operations to a data.frame object. See below:
class(apply_routine)
#> [1] "function"

class(apply_routine_fseq)
#> [1] "fseq"     "function"

apply_routine(df)
#>   x z
#> 1 1 6
#> 2 3 10
#> 3 4 12

apply_routine_fseq(df)
#>   x z
#> 1 1 6
#> 2 3 10
#> 3 4 12
Tidy evaluation and flexible pipelines

Sometimes, we want to encapsulate a more flexible pipeline using tidyverse functions, which cannot (easily) be implemented as a functional sequence.
Consider the following function:
apply_routine_flex <- function(df, col1, col2, col3) {
  df %>%
    filter(!is.na(col1)) %>%
    mutate(col3 = col1 + col2) %>%
    select(col1, col3)
}
Both apply_routine_flex() and apply_routine() offer basically the same pipeline, but the former comes with greater flexibility. That is, apply_routine_flex() allows us to choose specific column names to use. If we try to use apply_routine_flex(), however, it fails:
apply_routine_flex(df, x, y, z)                       
#> Error in filter_impl(.data, quo): Evaluation error: object 'x' not found.
Why? Because dplyr functions, as well as some other tidyverse functions, rely on what is called "tidy evaluation", tidyverse's version of non-standard evaluation. In fact, tidy evaluation is what enables us to pass raw symbols to dplyr functions as arguments. Tidy evaluation is a somewhat advanced topic and explaining how it work is beyond the scope of this writing. However, the key point here is that the function arguments meant for column names (col1, col2, and col3) are not in the format expected by the dplyr functions when the dplyr functions within apply_routine_flex() receive them .
More technically, this particular error has to do with R trying to 1) evaluate the function arguments as the dplyr function (filter) in the function body accesses them and 2) subsequently failing to do so because R cannot find the first evaluated argument, x, in the search path. However, even if all evaluated function arguments existed somewhere in the search path, filter() would either 1) still complain because they are not proper expressions filter() expects as its own arguments or 2) return an unexpected output without throwing an error.
To help the dplyr functions within our custom function to receive the function arguments as intended, we can use utility functions provided by the rlang package. One implementation of apply_routine_flex(), now called apply_routine_tidy(), is as follows:
# install.packages("rlang")
library(rlang)

apply_routine_tidy <- function(df, col1, col2, col3) {
  col1 <- enquo(col1)
  col2 <- enquo(col2)
  col3 <- enquo(col3)
  col3_name <- quo_name(col3)
  
  df %>%
    filter(!is.na(!!col1)) %>%
    mutate(!!col3_name := !!col1 + !!col2) %>%
    select(!!col1, !!col3)
}
Here, crudely speaking, enquo() sort of wraps ("quotes") the function arguments before R tries to evaluate them, and the !! operator (pronounced "bang-bang") unwraps the "quoted" arguments so that dplyr functions can use them as their own arguments. If a function argument is to be used as a name, for instance, for a newly created column within a dplyr function, we wrap the "quoted" expression once more by using quo_name(); if a function argument is supposed to be an already existing column in the data object to be manipulated, enquo() will do. Last but not least, the := operator, instead of the normal = operator, should be used for assigning some expression to a new column when tidy evaluation is involved.
This is certainly a lot more to chew on, but the reward can be well worth the effort: a flexible and programmable pipeline with dplyr functions!
# original use
apply_routine_tidy(df, x, y, z)                       
#>   x  z
#> 1 1  6
#> 2 3 10
#> 3 4 12

# being flexible and trying something different
apply_routine_tidy(df, x, x, x2)                       
#>   x x2
#> 1 1 2
#> 2 3 6
#> 3 4 8
Lastly, please note that there are alternatives to the rlang syntax for handling  dplyr's tidy evaluation. One such alternative is provided by the wrapr::let() function (originally replyr::let()). Implementing apply_routine_flex() with wrapr::let() might look like the following:
# install.packages("wrapr")

apply_routine_let <- function(df, col1, col2, col3) {
    params <- c(
      col1 = substitute(col1),
      col2 = substitute(col2),
      col3 = substitute(col3)
    )
    
    wrapr::let(
      params,
      df %>%
        filter(!is.na(col1)) %>%
        mutate(col3 = col1 + col2) %>%
        select(col1, col3)
    )
  }
You can read this blog post and this article to learn more about using wrapr::let().
TL;DR


Use custom functions to organize your code into reusable pipelines
Learn basics of tidy evaluation to create flexible functions with dplyr

Resources

On R function basics:

"R functions" section in the R basics module from my ICJIA R Workshop.
Chatper 19 "Functions" in R for Data Science by Garrett Grolemund and Hadely Wickham

On functional sequence:

magrittr website

On tidy evaluation:

Chapters 18 to 20 in Advanced R by Hadley Wickham.
"Let" by Nina Zumel, John Mount.
"Programming with dplyr" on dplyr website.
"Tidy evaluation" on rlang website.
"Tidy eval: Programming with dplyr, tidyr, and ggplot2" (video) by Hadley Wickham at rstudio::conf 2018.
"Using replyr::let to Parameterize dplyr Expressions" by Nina Zumel