Skip to content

Instantly share code, notes, and snippets.

@swayson
Created March 15, 2015 12:43
Show Gist options
  • Save swayson/362025719f1841512280 to your computer and use it in GitHub Desktop.
Save swayson/362025719f1841512280 to your computer and use it in GitHub Desktop.
---
title: 'Going deeper with dplyr: New features in 0.3 and 0.4'
output: html_document
---
## Introduction
In August 2014, I created a [40-minute video tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc) introducing the key functionality of the dplyr package in R, using dplyr version 0.2. Since then, there have been two significant updates to dplyr (0.3 and 0.4), introducing a ton of new features.
This document (created in March 2015) covers the most useful new features in 0.3 and 0.4, as well as other functionality that I didn't cover last time (though it is not necessarily new). My [new video tutorial](https://www.youtube.com/watch?v=2mh1PqfsXVI) walks through the code below in detail.
**If you have not watched the [previous tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc)**, I recommend you do so first since it covers some dplyr basics that will not be covered in this tutorial.
## Loading dplyr and the nycflights13 dataset
Although my last tutorial used data from the hflights package, Hadley Wickham has rewritten the [dplyr vignettes](http://cran.r-project.org/web/packages/dplyr/index.html) to use the nycflights13 package instead, and so I'm also using nycflights13 for the sake of consistency.
```{r eval=FALSE}
# remove flights data if you just finished my previous tutorial
rm(flights)
```
```{r}
# load packages
suppressMessages(library(dplyr))
library(nycflights13)
# print the flights dataset from nycflights13
flights
```
## Choosing columns: select, rename
```{r}
# besides just using select() to pick columns...
flights %>% select(carrier, flight)
# ...you can use the minus sign to hide columns
flights %>% select(-month, -day)
```
```{r results='hide'}
# hide a range of columns
flights %>% select(-(dep_time:arr_delay))
# hide any column with a matching name
flights %>% select(-contains("time"))
```
```{r}
# pick columns using a character vector of column names
cols <- c("carrier", "flight", "tailnum")
flights %>% select(one_of(cols))
```
```{r}
# select() can be used to rename columns, though all columns not mentioned are dropped
flights %>% select(tail = tailnum)
# rename() does the same thing, except all columns not mentioned are kept
flights %>% rename(tail = tailnum)
```
## Choosing rows: filter, between, slice, sample_n, top_n, distinct
```{r}
# filter() supports the use of multiple conditions
flights %>% filter(dep_time >= 600, dep_time <= 605)
```
```{r results='hide'}
# between() is a concise alternative for determing if numeric values fall in a range
flights %>% filter(between(dep_time, 600, 605))
# side note: is.na() can also be useful when filtering
flights %>% filter(!is.na(dep_time))
```
```{r}
# slice() filters rows by position
flights %>% slice(1000:1005)
# keep the first three rows within each group
flights %>% group_by(month, day) %>% slice(1:3)
# sample three rows from each group
flights %>% group_by(month, day) %>% sample_n(3)
# keep three rows from each group with the top dep_delay
flights %>% group_by(month, day) %>% top_n(3, dep_delay)
# also sort by dep_delay within each group
flights %>% group_by(month, day) %>% top_n(3, dep_delay) %>% arrange(desc(dep_delay))
```
```{r}
# unique rows can be identified using unique() from base R
flights %>% select(origin, dest) %>% unique()
```
```{r results='hide'}
# dplyr provides an alternative that is more "efficient"
flights %>% select(origin, dest) %>% distinct()
# side note: when chaining, you don't have to include the parentheses if there are no arguments
flights %>% select(origin, dest) %>% distinct
```
## Adding new variables: mutate, transmute, add_rownames
```{r}
# mutate() creates a new variable (and keeps all existing variables)
flights %>% mutate(speed = distance/air_time*60)
# transmute() only keeps the new variables
flights %>% transmute(speed = distance/air_time*60)
```
```{r}
# example data frame with row names
mtcars %>% head()
# add_rownames() turns row names into an explicit variable
mtcars %>% add_rownames("model") %>% head()
# side note: dplyr no longer prints row names (ever) for local data frames
mtcars %>% tbl_df()
```
## Grouping and counting: summarise, tally, count, group_size, n_groups, ungroup
```{r}
# summarise() can be used to count the number of rows in each group
flights %>% group_by(month) %>% summarise(cnt = n())
```
```{r results='hide'}
# tally() and count() can do this more concisely
flights %>% group_by(month) %>% tally()
flights %>% count(month)
```
```{r}
# you can sort by the count
flights %>% group_by(month) %>% summarise(cnt = n()) %>% arrange(desc(cnt))
```
```{r results='hide'}
# tally() and count() have a sort parameter for this purpose
flights %>% group_by(month) %>% tally(sort=TRUE)
flights %>% count(month, sort=TRUE)
```
```{r}
# you can sum over a specific variable instead of simply counting rows
flights %>% group_by(month) %>% summarise(dist = sum(distance))
```
```{r results='hide'}
# tally() and count() have a wt parameter for this purpose
flights %>% group_by(month) %>% tally(wt = distance)
flights %>% count(month, wt = distance)
```
```{r}
# group_size() returns the counts as a vector
flights %>% group_by(month) %>% group_size()
# n_groups() simply reports the number of groups
flights %>% group_by(month) %>% n_groups()
```
```{r}
# group by two variables, summarise, arrange (output is possibly confusing)
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) %>% print(n = 40)
# ungroup() before arranging to arrange across all groups
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% ungroup() %>% arrange(desc(cnt))
```
## Creating data frames: data_frame
`data_frame()` is a better way than `data.frame()` for creating data frames. Benefits of `data_frame()`:
* You can use previously defined columns to compute new columns.
* It never coerces column types.
* It never munges column names.
* It never adds row names.
* It only recycles length 1 input.
* It returns a local data frame (a tbl_df).
```{r}
# data_frame() example
data_frame(a = 1:6, b = a*2, c = 'string', 'd+e' = 1) %>% glimpse()
# data.frame() example
data.frame(a = 1:6, c = 'string', 'd+e' = 1) %>% glimpse()
```
## Joining (merging) tables: left_join, right_join, inner_join, full_join, semi_join, anti_join
```{r}
# create two simple data frames
(a <- data_frame(color = c("green","yellow","red"), num = 1:3))
(b <- data_frame(color = c("green","yellow","pink"), size = c("S","M","L")))
# only include observations found in both "a" and "b" (automatically joins on variables that appear in both tables)
inner_join(a, b)
# include observations found in either "a" or "b"
full_join(a, b)
# include all observations found in "a"
left_join(a, b)
# include all observations found in "b"
right_join(a, b)
# right_join(a, b) is identical to left_join(b, a) except for column ordering
left_join(b, a)
# filter "a" to only show observations that match "b"
semi_join(a, b)
# filter "a" to only show observations that don't match "b"
anti_join(a, b)
```
```{r}
# sometimes matching variables don't have identical names
b <- b %>% rename(col = color)
# specify that the join should occur by matching "color" in "a" with "col" in "b"
inner_join(a, b, by=c("color" = "col"))
```
## Viewing more output: print, View
```{r}
# specify that you want to see more rows
flights %>% print(n = 15)
```
```{r eval=FALSE}
# specify that you want to see ALL rows (don't run this!)
flights %>% print(n = Inf)
```
```{r}
# specify that you want to see all columns
flights %>% print(width = Inf)
```
```{r eval=FALSE}
# show up to 1000 rows and all columns
flights %>% View()
# set option to see all columns and fewer rows
options(dplyr.width = Inf, dplyr.print_min = 6)
# reset options (or just close R)
options(dplyr.width = NULL, dplyr.print_min = 10)
```
## Resources
* Release announcements for [version 0.3](http://blog.rstudio.org/2014/10/13/dplyr-0-3-2/) and [version 0.4](http://blog.rstudio.org/2015/01/09/dplyr-0-4-0/)
* [dplyr reference manual and vignettes](http://cran.r-project.org/web/packages/dplyr/)
* [Two-table vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html) covering joins and set operations
* [RStudio's Data Wrangling Cheat Sheet](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) for dplyr and tidyr
* [dplyr GitHub repo](https://github.com/hadley/dplyr) and [list of releases](https://github.com/hadley/dplyr/releases)
## Data School
* [Blog](http://www.dataschool.io/)
* [Email newsletter](http://www.dataschool.io/subscribe/)
* [YouTube channel](http://youtube.com/dataschool)
< END OF DOCUMENT >
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment