Skip to content

Instantly share code, notes, and snippets.

@dgrapov
Created April 28, 2015 23:21
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save dgrapov/18b1a50216fda9ccfec9 to your computer and use it in GitHub Desktop.
Save dgrapov/18b1a50216fda9ccfec9 to your computer and use it in GitHub Desktop.
hands on with dplyr
---
title: "Hands-on with dplyr"
author: "Dmitry Grapov"
output:
html_document:
keep_md: yes
---
## Introduction
This is meant to be an introduction to [dplyr](http://cran.r-project.org/web/packages/dplyr/index.html) which covers dplyr basics, gets into a little bit of programming with dplyr and ends with brief mention of some gotchas and a benchmarking comparison to base for the split-apply strategy. You might also find [Rstudio](http://www.rstudio.com/)'s [Data Wrangling Cheat Sheet](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) featuring dplyr useful (this is also where I borrowed some of the images used in this tutorial).
The [dplyr](http://cran.r-project.org/web/packages/dplyr/index.html) package from [Hadley Wickham](http://had.co.nz/) is plain awesome. It features consistent and succinct syntax, is computationally fast and getting better with every release. The `dplyr` package has replaced many common more verbose R idioms which I had to previously rely upon for most common data analysis tasks.
For example, many data analysis tasks involve the procedure of splitting the data set based on a grouping variable and then applying a function to each of the groups (split-apply). Lets say I want to calculate the median values for a few parameters for cars with different numbers of cylinders using the [mtcars](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html) data set.
Set up the data for the example.
```{r,warning=FALSE,message=FALSE}
#some data prep
data(mtcars)
data<-mtcars
data$cyl<-factor(data$cyl)
```
Split-lapply-apply in `base`:
```{r,warning=FALSE,message=FALSE}
#select some variable of interest
vars<-c("mpg","wt","qsec")
tmp.data<-data[,colnames(data)%in%vars]
#split the data on the number of cylinders
big.l<-split(tmp.data,data$cyl)
#apply some function of interest to all columns
results<-lapply(big.l, function(x) apply(x,2,median))
#bind results and add splitting info
data.frame(cyl=names(results),do.call("rbind",results))
```
Now the same process using `dplyr`:
```{r,warning=FALSE,message=FALSE}
suppressPackageStartupMessages(library(dplyr))
#variables of interest
vars<-c("mpg","wt","qsec")
data %>% group_by(cyl) %>% select(one_of(vars)) %>% summarise_each(funs(median(.)))
```
Switching from `base` to `dplyr` for data manipulation feels a little like this:
`base` ![](images/BD_wasted.gif)
`dplyr`![](images/AD_wasted.gif)
Each of the individual `dplyr` verbs are discussed in more detail below, but the use of `%>%` or the pipe operator is worth mentioning now. The `%>%` operator is imported from `magrittr` and for the purpose of this tutorial we can simply think of it as `then`. From the cheatsheet referenced above:
![](images/pipe.png)
****
## Overview of common dplyr functions and verbs
I highly recommend that you take a look at the `dplyr` [vignetts](http://cran.r-project.org/web/packages/dplyr/index.html) for more detailed description of all of this packages capabilities.
One immediate addition in `dplyr` you might notice is `tbl_df` which is a local data frame and mostly behaves like the classical `data.frame` but is more convenient for working with large data.
```{r}
tbl_df(mtcars)
# control the number of rows
print(tbl_df(mtcars),n=5)
```
You can can make sure all columns are output to the screen using `options(dplyr.width = Inf)`.
`glimpse` is another useful function which is an analogue of `str` but tries to show you more of the data.
```{r}
str(mtcars)
glimpse(mtcars)
```
For the purpose of this tutorial we will be mostly working with data.frames, however it should be noted that dplyr syntax abstracts away the need to specify the kind of object being manipulated and most everything we will cover can also be applied to interact with a variety of database objects.
The most common `dplyr` functions also referred to as verbs are as follows (see more in the [introduction](http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) vignette):
* filter() and slice()
* arrange()
* select() and rename()
* mutate() and transmute()
* summarise()
* group_by()
The following commands will be demonstrated using the [hflights](http://cran.r-project.org/web/packages/hflights/index.html) data set.
```{r}
suppressPackageStartupMessages(library(hflights))
(flights <- tbl_df(hflights))
```
Which contains `r nrow(flights)` records for `r ncol(flights)` variables for flights departing Houston airport for 2011.
### Filter
Use `filter` to keep or select rows matching some criteria or condition(s).
![](images/filter.png)
`base`
```{r}
flights[flights$Month == 1 & flights$DayofMonth == 1, ]
#could have also used subset
subset(flights, Month == 1 & DayofMonth == 1)
#be wary of using subset programmatically: http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset
```
`dplyr`
```{r}
#comma is the same as an ampersand (&)
filter(flights, Month == 1, DayofMonth == 1)
# use pipe(|) for or
filter(flights, Month == 1 | DayofMonth == 1)
```
We can also include any of the following operators in filter.
![](images/logic.png)
Slice is a variant of filter used to extract rows based on position.
`base`
```{r}
flights[1:10,]
```
`dplyr`
```{r}
slice(flights, 1:10)
```
****
## Arrange
Order data based on specified columns.
`base`
```{r}
flights[order(flights$Month),]
```
`dplyr`
```{r}
arrange(flights,Month)
#decreasing order
arrange(flights,desc(Month))
#break ties using more columns
arrange(flights,desc(Month),DayOfWeek)
```
****
### Select
Select columns from the data.
`base`
```{r}
flights[,colnames(flights)%in%c("Month","DayOfWeek")]
```
`dplyr`
```{r}
select(flights,Month,DayOfWeek)
#select using a dynamic variable
variables<-c("Month","DayOfWeek")
select(flights,one_of(variables))
#remove variables
select(flights,one_of(variables),-Month)
```
Select also provides many regular expression wrappers.
![](images/select.png)
Use `rename` to change column names.
```{r}
rename(flights,diverted=Diverted)
```
****
### Mutate and Transmute
Use theses verbs to create a new column variable which in the case of `mutate` will be added to the or created as a stand-alone variables `transmute`.
![](images/mutate_1.png)
Lets calculate the wait time based on the difference between `ArrTime` and `DepTime`.
`base`
```{r}
#transmute like
head(wait<-flights$ArrTime - flights$DepTime)
#mutate like
head(flights2<-cbind(flights,wait))
```
`dplyr`
```{r}
#stand alone
transmute(flights,diff = ArrTime - DepTime)
#added to data
mutate(flights,diff = ArrTime - DepTime)
```
Many dplyr functions will let you use newly create variables in the same function which is creating the variable in the first place.
![](images/inception.png)
```{r}
transmute(flights,diff = ArrTime - DepTime, ratio = ArrTime/diff, ratio2 = diff/ratio)
```
The `mutate_each` function can be used to apply a function to every column in the dataframe. Lets bin each column into quartiles using the `ntile` function.
```{r}
glimpse(mutate_each(flights,funs(ntile(.,n=4))))
```
Here is a list of some other `dplyr` convenience functions.
![](images/mutate_2.png)
****
### Summarise
Carry out a function on the data frame returning a single value.
![](images/summarise_1.png)
Lets calculate the range for wait times and use this as an opportunity to involve `%>%`.
```{r}
transmute(flights,diff = ArrTime - DepTime) %>% summarise(.,min=min(diff,na.rm=TRUE),max=max(diff,na.rm=TRUE))
```
Either we just discovered time travel or we should have been referencing the day and time in our calculation of the difference between arrival and departure. Lets count how many calculations we may have screwed up.
```{r}
(bad<-transmute(flights,diff = ArrTime - DepTime) %>% filter(diff<0) %>% count(.))
```
An error rate of `r round(bad/nrow(flights)*100,1)` %may or may not be acceptable.
We can use `summarise_each` to apply a function to every column in the data set. Lets calculate the median and median absolute deviation for every numeric variable.
```{r}
vars<-colnames(flights)[sapply(flights,is.numeric)]
flights %>% select(one_of(vars)) %>% summarise_each(.,funs(median=median(.,na.rm=TRUE),mad=mad(.,na.rm=TRUE)))
```
Here are some additional functions which can be used with summarise.
![](images/summarise_2.png)
*****
##Group_by
Break the data sets into groups of rows.
![](images/group_by.png)
`group_by` adds the final piece of the puzzle we need to execute the split-apply strategy to our hearts content. This function becomes very powerful when combined with the previously discussed `dplyr` verbs. For example lets calculate which day of the week has the most cancellations.
```{r}
flights %>% group_by(DayOfWeek) %>%
select(Cancelled) %>% summarise_each(funs(canceled=sum(.,na.rm=TRUE),
total=n(),
percent_cancelled=round(canceled/total*100,1)))
```
We can also use `group_by` to generate groups using more than one variable. For example lets calculate the median `AirTime` times by `Distance` and `TailNum`.
```{r}
(slowest<-flights %>% group_by(Distance,TailNum) %>%
select(AirTime) %>% summarise_each(funs(mean(.,na.rm=TRUE))))
```
Lets identify the 3 slowest and fastest planes based on `AirTime` for some arbitrary `Distance`.
```{r}
#choose arbitrary distance
tmp<-slowest %>% na.omit(.) %>%
filter(.,Distance==781) %>%
arrange(AirTime)
#not clear why, but the results can't be bound directly
tmp %>% head(.,3)
tmp %>% tail(.,3) %>% arrange(desc(.))
```
A common data analysis task might be to carry out some group-wise normalization or adjustments of the data. For example we may want to calculate the day of the week with the slowest flights, but also adjust for differences between individual planes. To do this we will start by calculating the average speed for each plane.
```{r}
#calculate speed
flights<-flights %>% mutate(hrs=AirTime/60, speed=Distance/hrs)
(averages<-flights %>%
group_by(TailNum) %>%
select(.,speed) %>%
summarise_each(funs(mean(.,na.rm=TRUE))) %>%
rename(.,mean_speed=speed))
```
Next lets express the overall speed for each plane as ratio to the mean plane speed. To do this we will use one of the powerful join capabilities in `dplyr`.
![](images/join_1.png)
We will join with the original data set based on `TailNum` and calculate the plane-adjusted measure of speed.
```{r}
right_join(flights,averages,by="TailNum") %>%
mutate(norm_speed = speed / mean_speed) %>%
group_by(DayOfWeek) %>%
select(contains("speed")) %>%
summarise_each(funs(mean(.,na.rm=TRUE)))
```
So if things worked out like we expected it looks like Saturday flights are fastest and Tuesday the slowest.
### Using dplyr programmatically
Most of the examples up to this point featured using `dplyr` in interactive mode. However there are variants of nearly every verb which are best suited for use inside other functions. To see what theses are take a look at `verb_` versions of each function (e.g. `summarise_`).
### Benchmarking split-apply in base and dplyr
Finally I will wrap with a relatively non-sophisticated benchmarking head-to-head comparison of `base` and `dplyr` speed for the almighty split-apply strategy.
Lets set up the data.
```{r,warning=FALSE,message=FALSE}
rows<-10000
cols<-100
groups<-100
samples<-rows/groups
tmp.data<-data.frame(matrix(rnorm(rows),rows,cols))
tmp.data$group<-rep(1:groups,each=samples)
```
This data set has `r rows` rows, `r cols` columns and `r groups` groups with `r samples` samples each.
`base` ![](images/sad_face.png)
```{r,warning=FALSE,message=FALSE}
ptm <- Sys.time()
#split the data on the number of cylinders
big.l<-split(tmp.data,tmp.data$group)
#apply some function of interest to all columns
results<-lapply(big.l, function(x) apply(x,2,median))
#bind results and add splitting info
results<-data.frame(group=names(results),do.call("rbind",results))
#elapsed time
(bd<-Sys.time()-ptm )
```
`dplyr` ![](images/happy_face.gif)
```{r,warning=FALSE,message=FALSE}
ptm <- Sys.time()
results<-tmp.data %>% group_by(group) %>% summarise_each(funs(median(.)))
#elapsed time
(ad<-Sys.time()-ptm )
```
Wow I just saved `r bd - ad ` seconds of my life!
Create a benchmark visualization comparing base to dplyr for differing number of groups, rows and columns. Uncomment the code in the appendix below and modify as needed to re-run the benchmark.
```{r,warning=FALSE,message=FALSE,fig.width = 10,fig.width = 10}
load(file="benchmark results")
#create a plot
library(reshape2)
library(ggplot2)
tmp.data<-melt(res,id.vars=c("rows","columns","groups","samples")) %>% mutate(seconds=value*60)
ggplot(tmp.data, aes(y=seconds,x=groups,group=variable,color=variable)) + geom_line() +geom_point()+ facet_grid(rows ~ columns) +scale_y_log10()
```
The plot above shows the calculation time for 10 replications in seconds (y-axis) for calculating the median of varying number of groups (x-axis), rows (y-facet) and columns (y-facet).
*****
### Gotchas
1. In `dplyr` rownames are a second class citizen and are not stored.
2. Even though nearly every `dplyr` tutorial features pipes (`%>%`) it may be easier to learn both `dplyr` and `%>%` separately.
3. Most `dplyr` functions only work on objects coercible to ~ `data.frames`. Lots of my debugging sessions start with trying to understand the data structure of objects I am passing to `dplyr`.
*****
### Appendix
```{r,warning=FALSE,message=FALSE}
#
# #set up functions to time
# base_fun<-function(data){
# #split the data on the number of cylinders
# big.l<-split(data,data$group)
#
# #apply some function of interest to all columns
# results<-lapply(big.l, function(x) apply(x,2,median))
#
# #bind results and add splitting info
# data.frame(group=names(results),do.call("rbind",results))
# }
#
# dplyr_fun<-function(data){
#
# data %>% group_by(group) %>% summarise_each(funs(median(.)))
# }
#
# #benchmark function
# benchmark_fun<-function(rows,cols,groups){
#
# #set up data
# samples<-floor(rows/groups)
# tmp.data<-data.frame(matrix(rnorm(rows),rows,cols))
# tmp.data$group<-rep(1:groups,length.out=rows)
#
# #base
# base.time<-system.time(replicate(10,base_fun(tmp.data)))
#
# #dplyr
# dplyr.time<-system.time(replicate(10,dplyr_fun(tmp.data)))
#
# data.frame(rows=rows,columns=cols,groups=groups,samples=samples,base=signif(base.time["elapsed"]/10,3),dplyr=signif(dplyr.time["elapsed"]/10,3))
#
# }
#
# #run benchmarks
# len<-5
# groups<-seq(5, 100,length.out=len) %>% signif(.,0)
# rows<-seq(100, 10000,length.out=len) %>% signif(.,0)
# cols<-seq(10, 100,length.out=len) %>% signif(.,0)
#
# #benchmarks
# results<-list()
# counter<-1
# for(i in 1:length(groups)){
# .group<-groups[i]
# for(j in 1:length(rows)){
# .row<-rows[j]
# for(k in 1:length(cols)){
# .col<-cols[k]
# results[[counter]]<- benchmark_fun(.row,.col,.group)
# counter<-counter+1
# }
# }
# }
#
#
# res<-do.call("rbind",results)
# save(res,file="benchmark results")
# #create a plot
# library(reshape2)
# library(ggplot2)
# tmp.data<-melt(res,id.vars=c("rows","columns","groups","samples")) %>% mutate(seconds=value*60)
#
# ggplot(tmp.data, aes(y=seconds,x=groups,group=variable,color=variable)) + geom_line() +geom_point()+ facet_grid(columns ~ rows) +scale_y_log10()
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment