MrFlick/dplyr_tidyverse.Rmd

## dplyr_tidyverse.Rmd
---
title: "Dplyr Introduction"
author: "Matthew Flickinger"
date: "July 12, 2017"
output:
  pdf_document: default
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error=TRUE)
```

## Introduction to Dplyr

This document gives an overview of many of the features of the dplyr library include in the "tidyverse" of related R pacakges. First we will load the library and a sample dataset.

```{r init}
#install.packages("tidyverse")
library(tidyverse)
#install.packages("nycflights13")
library(nycflights13)
# Show fewer rows by default in this document
options(tibble.print_min = 5L, tibble.print_max = 5L)
```

We will primarly be using the `flights` data

```{r flights}
flights
```

## Filtering Rows
Find all flights from Detroit in June (in 2013)
```{r filtering1}
# Same as using base R
# flights[flights$dest=="DTW" & flights$month==6, ]
# subset(flights, dest=="DTW" & month==6)
filter(flights, dest=="DTW" & month==6)
```
`filter()` expects a data source as the first parameter, and a single expression as the second parameter. Combine multiple criteria with `&` for "and" -- `|` for "or".

## Selecting Columns

```{r selection1}
# List columns
select(flights, dep_time, arr_time, carrier)
# Exclude columns
select(flights, -year, -tailnum)
# Select column range (in data.frame order)
select(flights, month:dep_delay)
# Name starts with
select(flights, starts_with("d"))
# Name ends with
select(flights, ends_with("time"))
# Name contains
select(flights, contains("arr"))
# Name doesn't start with
select(flights, -starts_with("d"))
# Move column to the beginning
select(flights, flight, everything())
```

Look at the `?select` help page for a list of function to help you select multiple columns.

## Verb composition with pipes

Traditioanlly, we combine functions via nesting, which works but is hard to read

```{r pipe1}
select(filter(flights, dest=="DTW"), carrier)
```
The `%>%` allows us to take an object, and pass it as the first parameter to another function. The above is the same as
```{r pipe2}
flights %>%
  filter(dest=="DTW") %>%
  select(carrier)
```
You can unroll any function with this operator
```{r pipe3}
round(exp(sin(.5)),2)
.5 %>% sin() %>% exp %>% round(2)
```

## Sorting Data
Use `arrange()` to sort data. You just specify the column names you want to sort by, use `desc()` to reverse the sort order for a given column.
```{r arrange1}
flights %>% arrange(sched_dep_time)
flights %>% arrange(month, desc(day))
flights %>% arrange(desc(dep_time-sched_dep_time ))
```

## Creating New Variables

Use `mutate()` to create columns from existing columns or values

```{r mutate1}
flights %>%
  mutate(speed = distance/(air_time/60)) %>%
  arrange(desc(speed)) %>%
  select(flight, speed)
```

You can create multiple columns by separating them with a comma; you can use any previously created columns as well

```{r mutate2}
flights %>%
  mutate(
    dist_km = distance * 1.61,
    hours = air_time / 60,
    kph = dist_km/hours ) %>%
  select(flight, kph)
```

Use `summarize()` to collapse observations (only keeps columns for which you specified a summarization strategy)
```{r mutate3}
flights %>%
  filter(!is.na(arr_delay)) %>%
  summarize(avg_arr_delay = mean(arr_delay))
```

## Grouping Data

Perhaps the most powerful feature of `dplyr` is its grouping abilities. You can specify a column (or columns) for which `mutate()` and `summarize()` happen independently for each unique value in that column (or unique combination or values).

Using `summarize()` will reduce the total number of rows
```{r groupby1}
flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(carrier) %>%
  summarize(avg_arr_delay = mean(arr_delay))
```
Using `mutate()` will keep the same number of rows and won't drop any columns
```{r groupby2}
flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(carrier) %>%
  mutate(avg_arr_delay = mean(arr_delay)) %>%
  select(carrier, arr_delay, avg_arr_delay)
```

## Joining data

When finding carriers with the largest flight delay, we were left with a carrier code rather than a carrier name; but who exactly is `9E`?

```{r join1}
flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(carrier) %>%
  summarize(avg_arr_delay = mean(arr_delay))
```

There is another table that has a lookup from carrier code to carrier name called airlines

```{r join2}
airlines
```

We can use `left_join` to merge in the carrier name
```{r join3}
flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(carrier) %>%
  summarize(avg_arr_delay = mean(arr_delay)) %>%
  left_join(airlines)
```
Here we use two sample tables `x` and `y` to demonstrate the other types of joins
```{r join4}
x <- tribble(
  ~key, ~xval,
  1, "x1",
  2, "x2",
  3, "x3")
y <- tribble(
  ~key, ~yval,
  1, "y1",
  2, "y2",
  4, "y3")
inner_join(x, y)
left_join(x, y)
right_join(x, y)
full_join(x, y)
```

And you can use non-merging joins to keep or drop rows that match keys from another table. Note that no new columns are added, just the rows of the input tables are filtered
```{r join5}
z <- tribble(
  ~key, ~zval,
  1, "z1",
  3, "z2")
semi_join(x,z)
semi_join(y,z)
anti_join(x,z)
```

The join commands will join on all matching column names. You can more explicitly control this as well. The planes table has information about the aircraft used during the flight. It also happens to have a column named "year" indicating when the aircraft was built. When joining this data to flights, we only want to join on "tailnum" -- not "tailnum" and "year".
```{r join6}
flights %>%
  inner_join(planes) %>%
  nrow()  # wrong, only planes from 2013 are selected
flights %>%
  inner_join(planes, "tailnum") %>%
  nrow() # right
```

## Subsetting functions

`distinct()` will return unique combinations of column values and nothing else
```{r}
flights %>%
  distinct(tailnum, carrier)
```

The `count()` is like `distinct()` except it also returns the number of times each value was observed. It's basically a shortcut for `group_by() %>% summarize()`. For example
```{r}
flights %>% count(carrier)
flights %>% group_by(carrier) %>% summarize(n=n())
```


`sample_n()` will randomly choose a set of rows from your table (different each time)
```{r}
flights %>% sample_n(3)
flights %>% sample_n(3)
```

You might also consider `anti_join` and `semi_join` to be subsetting commands rather than joining commands.

## _at/_if/_all

The `summarize()`, `mutate()` and `group_by()` functions all have `_all()`, `_at()` and `_if()` variants that make it easier to apply the same function or functions to multiple columns.


`mutate_at()`, `summarize_at()` and `group_by_at()` allow you to choose columns in the same way you can do with select using the `vars()` helper function. This will take the mean of all columns that end in "_time"

```{r}
flights %>%
  summarize_at(vars(ends_with("time")), mean, na.rm=T)
```

`mutate_if()`, `summarize_if()` and `group_by_if()` allow you run a function on each column to choose only columns that meet a certain criteria. This can use useful for extracting columns of a certain class. Note you can also apply more than one function to these columns if you use the `funs()` helper function. This example will calculate the mean and variance for all numeric columns.

```{r}
flights %>%
  summarize_if(is.numeric, funs(mean, var), na.rm=T)

```

The `_all()` versions of these functions will apply the same transformations to call non-grouped columns in the data source.

## Other useful functions

The `lead()` and `lag()` functions are useful for selecting the next or previous values in a sequence (especially for time series data).
```{r}
x<-1:5
lead(x)
lag(x)
```

The `coalesce()` function will return the first non-missing value from the vectors you pass to it. This is useful when you have multiple columns where only one column contains a value and you want to collapse them to a single vector
```{r}
coalesce(c(NA,2,NA), c(1, NA, NA), 3)
```

When using other `dplyr` verbs, the `n()` and `n_distinct()` functions will return the total number of observations or the number of unique observations respectively. In this example we look at the tail number for each plane to see how many total flights it took and also look at the number of distinct flight numbers that plane was a part of

```{r}
flights %>%
  group_by(tailnum) %>%
  summarize(flights=n(), routes=n_distinct(flight))
```

The `recode()` function allows you to swap out certain values in a vector with different values.
```{r}
recode(letters[1:5], b="boo")
```

The `case_when()` function allows more complex transformations than `recode()`. It's a good alternative to a bunch of nested `ifelse()` calls that you might need to use in base R. Each parameter should be a formula with a left-hand side value that evaulates to TRUE or FALSE and a right-hand side to return when that boolean value is TRUE. Only the value for the first TRUE is returned.

Here's a classic example of the "fizz buzz" problem where you are supposed to return the numbers 1-50 but replace all those values divisible by 5 with "fizz" and the values divisible by 7 with "buzz" and those divisible by both 5 and 7 by "fizz buzz"
```{r}
x <- 1:50
case_when(
  x %% 35 == 0 ~ "fizz buzz",
  x %% 5 == 0 ~ "fizz",
  x %% 7 == 0 ~ "buzz",
  TRUE ~ as.character(x)
)
```

## Combining data frames

The `bind_rows()` and `bind_columns()` functions are alternatives to the base functions `rbind()` and `cbind()` that are list-friendly. Many times you end up with data.frames in a list that you want to combine in a single data.frame. These functions can help.

In this example, we have a list of two tibbles. We can combine them with `bind_rows`

```{r}
x <- list(
  data_frame(a=1:2, z=letters[1:2]),
  data_frame(a=14:20, z=letters[14:20])
)
bind_rows(x)
bind_rows(x[[1]], x[[2]])
```

## Programming with dplyr

Since dplyr uses non-standard evaluation to allow you to soecify data.frame column names without quotes, it can be tricky to write functions that use dply commands. Note that the first attempt at writting a function doesn't work

```{r}
# Normal command, works fun
flights %>%
  group_by(carrier) %>%
  summarize(delay=mean(arr_delay, na.rm=T))

# DOESN'T WORK
f <- function(x) {
  flights %>%
    group_by(x) %>%
    summarize(delay=mean(arr_delay, na.rm=T))
}
f(carrier)
```

The latest version of dplyr (0.7) introduced new way to write functions. Previously you would use the standard-evaluation version of functions that ended in an underscore (use `mutate_` rather than `mutate`); but the new version now uses "quosures" to allow you to pass column names. Here are two examples of functions that will work

```{r}
f <- function(x) {
  flights %>% group_by(!!x) %>%
  summarize(delay = mean(arr_delay, na.rm=T))
}
f(quo(carrier))

g <- function(x) {
  x <- enquo(x)
  flights %>% group_by(!!x) %>%
  summarize(delay = mean(arr_delay, na.rm=T))
}
g(carrier)
```

We can either use `quo()` to create our own quosure with the column name, or we can use `enquo()` to turn a function parameter into a quosure.

Finally, in base R it's complicated to dynamically set the name of a parameter to a function (the name being the part to the left of the `=` in a call like `f(a=b)`). The latest dplyr functions now also allow you to use the value of a variable as a parameter name if you use `:=` rather than `=`. For example

```{r}
h <- function(x) {
  x <- enquo(x)
  outname <- paste(quo_name(x), "delay", sep="_")
  flights %>% group_by(!!x) %>%
  summarize(!!outname := mean(arr_delay, na.rm=T))
}
h(carrier)
```
	---
	title: "Dplyr Introduction"
	author: "Matthew Flickinger"
	date: "July 12, 2017"
	output:
	pdf_document: default
	html_document: default
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE, error=TRUE)
	```

	## Introduction to Dplyr

	This document gives an overview of many of the features of the dplyr library include in the "tidyverse" of related R pacakges. First we will load the library and a sample dataset.

	```{r init}
	#install.packages("tidyverse")
	library(tidyverse)
	#install.packages("nycflights13")
	library(nycflights13)
	# Show fewer rows by default in this document
	options(tibble.print_min = 5L, tibble.print_max = 5L)
	```

	We will primarly be using the `flights` data

	```{r flights}
	flights
	```

	## Filtering Rows
	Find all flights from Detroit in June (in 2013)
	```{r filtering1}
	# Same as using base R
	# flights[flights$dest=="DTW" & flights$month==6, ]
	# subset(flights, dest=="DTW" & month==6)
	filter(flights, dest=="DTW" & month==6)
	```
	`filter()` expects a data source as the first parameter, and a single expression as the second parameter. Combine multiple criteria with `&` for "and" -- `\|` for "or".

	## Selecting Columns

	```{r selection1}
	# List columns
	select(flights, dep_time, arr_time, carrier)
	# Exclude columns
	select(flights, -year, -tailnum)
	# Select column range (in data.frame order)
	select(flights, month:dep_delay)
	# Name starts with
	select(flights, starts_with("d"))
	# Name ends with
	select(flights, ends_with("time"))
	# Name contains
	select(flights, contains("arr"))
	# Name doesn't start with
	select(flights, -starts_with("d"))
	# Move column to the beginning
	select(flights, flight, everything())
	```

	Look at the `?select` help page for a list of function to help you select multiple columns.

	## Verb composition with pipes

	Traditioanlly, we combine functions via nesting, which works but is hard to read

	```{r pipe1}
	select(filter(flights, dest=="DTW"), carrier)
	```
	The `%>%` allows us to take an object, and pass it as the first parameter to another function. The above is the same as
	```{r pipe2}
	flights %>%
	filter(dest=="DTW") %>%
	select(carrier)
	```
	You can unroll any function with this operator
	```{r pipe3}
	round(exp(sin(.5)),2)
	.5 %>% sin() %>% exp %>% round(2)
	```

	## Sorting Data
	Use `arrange()` to sort data. You just specify the column names you want to sort by, use `desc()` to reverse the sort order for a given column.
	```{r arrange1}
	flights %>% arrange(sched_dep_time)
	flights %>% arrange(month, desc(day))
	flights %>% arrange(desc(dep_time-sched_dep_time ))
	```

	## Creating New Variables

	Use `mutate()` to create columns from existing columns or values

	```{r mutate1}
	flights %>%
	mutate(speed = distance/(air_time/60)) %>%
	arrange(desc(speed)) %>%
	select(flight, speed)
	```

	You can create multiple columns by separating them with a comma; you can use any previously created columns as well

	```{r mutate2}
	flights %>%
	mutate(
	dist_km = distance * 1.61,
	hours = air_time / 60,
	kph = dist_km/hours ) %>%
	select(flight, kph)
	```

	Use `summarize()` to collapse observations (only keeps columns for which you specified a summarization strategy)
	```{r mutate3}
	flights %>%
	filter(!is.na(arr_delay)) %>%
	summarize(avg_arr_delay = mean(arr_delay))
	```

	## Grouping Data

	Perhaps the most powerful feature of `dplyr` is its grouping abilities. You can specify a column (or columns) for which `mutate()` and `summarize()` happen independently for each unique value in that column (or unique combination or values).

	Using `summarize()` will reduce the total number of rows
	```{r groupby1}
	flights %>%
	filter(!is.na(arr_delay)) %>%
	group_by(carrier) %>%
	summarize(avg_arr_delay = mean(arr_delay))
	```
	Using `mutate()` will keep the same number of rows and won't drop any columns
	```{r groupby2}
	flights %>%
	filter(!is.na(arr_delay)) %>%
	group_by(carrier) %>%
	mutate(avg_arr_delay = mean(arr_delay)) %>%
	select(carrier, arr_delay, avg_arr_delay)
	```

	## Joining data

	When finding carriers with the largest flight delay, we were left with a carrier code rather than a carrier name; but who exactly is `9E`?

	```{r join1}
	flights %>%
	filter(!is.na(arr_delay)) %>%
	group_by(carrier) %>%
	summarize(avg_arr_delay = mean(arr_delay))
	```

	There is another table that has a lookup from carrier code to carrier name called airlines

	```{r join2}
	airlines
	```

	We can use `left_join` to merge in the carrier name
	```{r join3}
	flights %>%
	filter(!is.na(arr_delay)) %>%
	group_by(carrier) %>%
	summarize(avg_arr_delay = mean(arr_delay)) %>%
	left_join(airlines)
	```
	Here we use two sample tables `x` and `y` to demonstrate the other types of joins
	```{r join4}
	x <- tribble(
	~key, ~xval,
	1, "x1",
	2, "x2",
	3, "x3")
	y <- tribble(
	~key, ~yval,
	1, "y1",
	2, "y2",
	4, "y3")
	inner_join(x, y)
	left_join(x, y)
	right_join(x, y)
	full_join(x, y)
	```

	And you can use non-merging joins to keep or drop rows that match keys from another table. Note that no new columns are added, just the rows of the input tables are filtered
	```{r join5}
	z <- tribble(
	~key, ~zval,
	1, "z1",
	3, "z2")
	semi_join(x,z)
	semi_join(y,z)
	anti_join(x,z)
	```

	The join commands will join on all matching column names. You can more explicitly control this as well. The planes table has information about the aircraft used during the flight. It also happens to have a column named "year" indicating when the aircraft was built. When joining this data to flights, we only want to join on "tailnum" -- not "tailnum" and "year".
	```{r join6}
	flights %>%
	inner_join(planes) %>%
	nrow() # wrong, only planes from 2013 are selected
	flights %>%
	inner_join(planes, "tailnum") %>%
	nrow() # right
	```

	## Subsetting functions

	`distinct()` will return unique combinations of column values and nothing else
	```{r}
	flights %>%
	distinct(tailnum, carrier)
	```

	The `count()` is like `distinct()` except it also returns the number of times each value was observed. It's basically a shortcut for `group_by() %>% summarize()`. For example
	```{r}
	flights %>% count(carrier)
	flights %>% group_by(carrier) %>% summarize(n=n())
	```


	`sample_n()` will randomly choose a set of rows from your table (different each time)
	```{r}
	flights %>% sample_n(3)
	flights %>% sample_n(3)
	```

	You might also consider `anti_join` and `semi_join` to be subsetting commands rather than joining commands.

	## _at/_if/_all

	The `summarize()`, `mutate()` and `group_by()` functions all have `_all()`, `_at()` and `_if()` variants that make it easier to apply the same function or functions to multiple columns.


	`mutate_at()`, `summarize_at()` and `group_by_at()` allow you to choose columns in the same way you can do with select using the `vars()` helper function. This will take the mean of all columns that end in "_time"

	```{r}
	flights %>%
	summarize_at(vars(ends_with("time")), mean, na.rm=T)
	```

	`mutate_if()`, `summarize_if()` and `group_by_if()` allow you run a function on each column to choose only columns that meet a certain criteria. This can use useful for extracting columns of a certain class. Note you can also apply more than one function to these columns if you use the `funs()` helper function. This example will calculate the mean and variance for all numeric columns.

	```{r}
	flights %>%
	summarize_if(is.numeric, funs(mean, var), na.rm=T)

	```

	The `_all()` versions of these functions will apply the same transformations to call non-grouped columns in the data source.

	## Other useful functions

	The `lead()` and `lag()` functions are useful for selecting the next or previous values in a sequence (especially for time series data).
	```{r}
	x<-1:5
	lead(x)
	lag(x)
	```

	The `coalesce()` function will return the first non-missing value from the vectors you pass to it. This is useful when you have multiple columns where only one column contains a value and you want to collapse them to a single vector
	```{r}
	coalesce(c(NA,2,NA), c(1, NA, NA), 3)
	```

	When using other `dplyr` verbs, the `n()` and `n_distinct()` functions will return the total number of observations or the number of unique observations respectively. In this example we look at the tail number for each plane to see how many total flights it took and also look at the number of distinct flight numbers that plane was a part of

	```{r}
	flights %>%
	group_by(tailnum) %>%
	summarize(flights=n(), routes=n_distinct(flight))
	```

	The `recode()` function allows you to swap out certain values in a vector with different values.
	```{r}
	recode(letters[1:5], b="boo")
	```

	The `case_when()` function allows more complex transformations than `recode()`. It's a good alternative to a bunch of nested `ifelse()` calls that you might need to use in base R. Each parameter should be a formula with a left-hand side value that evaulates to TRUE or FALSE and a right-hand side to return when that boolean value is TRUE. Only the value for the first TRUE is returned.

	Here's a classic example of the "fizz buzz" problem where you are supposed to return the numbers 1-50 but replace all those values divisible by 5 with "fizz" and the values divisible by 7 with "buzz" and those divisible by both 5 and 7 by "fizz buzz"
	```{r}
	x <- 1:50
	case_when(
	x %% 35 == 0 ~ "fizz buzz",
	x %% 5 == 0 ~ "fizz",
	x %% 7 == 0 ~ "buzz",
	TRUE ~ as.character(x)
	)
	```

	## Combining data frames

	The `bind_rows()` and `bind_columns()` functions are alternatives to the base functions `rbind()` and `cbind()` that are list-friendly. Many times you end up with data.frames in a list that you want to combine in a single data.frame. These functions can help.

	In this example, we have a list of two tibbles. We can combine them with `bind_rows`

	```{r}
	x <- list(
	data_frame(a=1:2, z=letters[1:2]),
	data_frame(a=14:20, z=letters[14:20])
	)
	bind_rows(x)
	bind_rows(x[[1]], x[[2]])
	```

	## Programming with dplyr

	Since dplyr uses non-standard evaluation to allow you to soecify data.frame column names without quotes, it can be tricky to write functions that use dply commands. Note that the first attempt at writting a function doesn't work

	```{r}
	# Normal command, works fun
	flights %>%
	group_by(carrier) %>%
	summarize(delay=mean(arr_delay, na.rm=T))

	# DOESN'T WORK
	f <- function(x) {
	flights %>%
	group_by(x) %>%
	summarize(delay=mean(arr_delay, na.rm=T))
	}
	f(carrier)
	```

	The latest version of dplyr (0.7) introduced new way to write functions. Previously you would use the standard-evaluation version of functions that ended in an underscore (use `mutate_` rather than `mutate`); but the new version now uses "quosures" to allow you to pass column names. Here are two examples of functions that will work

	```{r}
	f <- function(x) {
	flights %>% group_by(!!x) %>%
	summarize(delay = mean(arr_delay, na.rm=T))
	}
	f(quo(carrier))

	g <- function(x) {
	x <- enquo(x)
	flights %>% group_by(!!x) %>%
	summarize(delay = mean(arr_delay, na.rm=T))
	}
	g(carrier)
	```

	We can either use `quo()` to create our own quosure with the column name, or we can use `enquo()` to turn a function parameter into a quosure.

	Finally, in base R it's complicated to dynamically set the name of a parameter to a function (the name being the part to the left of the `=` in a call like `f(a=b)`). The latest dplyr functions now also allow you to use the value of a variable as a parameter name if you use `:=` rather than `=`. For example

	```{r}
	h <- function(x) {
	x <- enquo(x)
	outname <- paste(quo_name(x), "delay", sep="_")
	flights %>% group_by(!!x) %>%
	summarize(!!outname := mean(arr_delay, na.rm=T))
	}
	h(carrier)
	```