datalove/preso.R

## preso.R
```{r, echo = FALSE}
library(knitr)
library(plyr)
library(data.table)
library(dplyr)

gears <- mtcars$gear
mtcars <- mtcars[,1:6]
mtcars$gear <- gears

```
Welcome to dplyr
========================================================
author: Tommy M O'Dell (tommy.odell@gmail.com)
date: September 10th, 2015
transition: rotate
transition-speed: fast
width: 1440
height: 900

dplyr?
========================================================
Friction-less manipulation of data frames in R
***
Package goals:

1. simple interface
2. good performance
3. same interface for many 'backends'

<!-- Notes:
- the ordering of the points is relevant (hadley will trade off performance for a clean interface)
- familiar with plyr? Think of dplyr as plyr specialised for data frames
-->

Data manipulation in base R
========================================================
type: section


Filter the rows of a data frame
========================================================

Keep rows where mpg > 30 and cyl >= 4
```{r}
mtcars[mtcars$mpg > 30 & mtcars$cyl >= 4,]
```

Add or modify columns
========================================================

Create a new column for displacement per cylinder
```{r}
mtcars$disp_cyl <- mtcars$disp / mtcars$cyl
head(mtcars)
```

Sort a data frame by its values
========================================================

```{r}
mt <- mtcars[order(mtcars$mpg, mtcars$cyl),]
head(mt)
```

Select columns from a data frame
========================================================

Method 1 - Numerical indices
```{r}
head(mtcars[,1:4])
```

Select columns from a data frame
========================================================
title:false

Method 2 - Named indices
```{r}
head(mtcars[,c('mpg','cyl','disp','gear')])
```

Summarise (aggregate) a data frame
========================================================

```{r}
aggregate(mpg ~ cyl + gear, data = mtcars, mean)
```

What about CRAN?
========================================================
type: section


========================================================

Using `ddply` from **plyr** to get the mean mpg per cylinder and gear
```{r}
ddply(
  mtcars,         # data frame
  .(cyl, gear),   # grouping columns
  summarise,      # type of
  mpg = mean(mpg) # aggregations
)
```

========================================================

Using `data.table` to get the mean mpg per cylinder and gear
```{r}
dtcars <- as.data.table(mtcars)
dtcars[, mean(mpg), by = list(cyl,gear)]
```

If it aint broke?
========================================================
type: section

Are any of these both **readable** and **fast**?


The dplyr promise
========================================================
99% of data manipulation can be described 6 key operations ('verbs')

1. **filter**: filter the rows of a data frame
2. **mutate**: modify or create new columns
3. **group by**: set grouping variables
4. **summarise**: aggregate a data frame
5. **arrange**: sort columns of a data frame
6. **select**: select a set of columns

```{r, echo = FALSE}
mtcars <- tbl_df(mtcars)
```


filter
========================================================
```{r}
filter(mtcars, mpg > 30, cyl >= 4)
```

mutate
========================================================
(modify or create columns)

Same as previous
```{r}
mutate(mtcars, disp_cyl = disp/cyl)
```

========================================================
Multiple columns in one
```{r}
mutate(
  mtcars,
  disp_cyl = disp/cyl,
  kw = hp/0.746
)
```

========================================================
Can even refer to newly created columns immediately...
```{r}
mutate(
  mtcars,
  disp_cyl = disp/cyl,
  k_watt = hp/0.746,
  watts  = k_watt*1000
)
```

Group by and Summarise
========================================================
```{r}
mtcars <- group_by(mtcars, cyl, gear)
mtcars
```

========================================================
```{r}
summarise(mtcars, mpg = mean(mpg)) # uses the grouping set previously
```

========================================================
Multiple aggregations in one
```{r}
summarise(
  mtcars,
  mpg = mean(mpg),
  hp = mean(hp)
)
```

arrange
========================================================
```{r}
arrange(mtcars, mpg, cyl)
```

========================================================
```{r}
arrange(mtcars, mpg, desc(cyl)) # desending!
```

select
========================================================
```{r, echo = FALSE}
mtcars <- ungroup(mtcars)
```

```{r}
select(mtcars, 1:4)
```

========================================================
```{r}
select(mtcars, mpg, cyl, disp, gear)
```

========================================================
```{r}
select(mtcars, mpg:hp)
```

========================================================
```{r}
select(mtcars, contains('a'))
```

========================================================
```{r}
select(mtcars, starts_with('d'))
```

========================================================
```{r}
select(mtcars, -starts_with('d'))
```

Putting it all together
========================================================
type: section


========================================================
Let's say during our exploratory analysis we want to create new column, fitler on that new column,
then get the mean of that new column for each cylinder and gear

```{r}
mt <- mutate(mtcars, disp_cyl = disp/cyl)
mt <- filter(mt, disp_cyl > 30, mpg < 25)
mt <- group_by(mt, cyl, gear)
```

```{r}
summarise(mt, avg_d_cyl = mean(disp_cyl), min_d_cyl = min(disp_cyl))

```


That's a lot of repetition, and we now have an extra variable **mt** sitting around taking up space...

========================================================

If we just want to print the answer without intermediary variables...
```{r}
summarise(group_by(filter(mutate(mtcars, disp_cyl = disp/cyl), disp_cyl > 30,  mpg > 23), cyl, gear), avg_d_cyl = mean(disp_cyl), min_d_cyl = min(disp_cyl))
```
(Say what???!)

========================================================

We can make that a bit easier to read... but not great
```{r}
summarise(
  group_by(
    filter(
      mutate(mtcars, disp_cyl = disp/cyl),
      disp_cyl > 30,
      mpg > 23
    ),
    cyl,
    gear
  ),
  avg_d_cyl = mean(disp_cyl),
  min_d_cyl = min(disp_cyl)
)
```
(Butt ugly!)


Ceci n'est pas une pipe
=======================================================
type: section


=======================================================
Our last example was barely readable. What can we do? **Pipes** to the rescue!

  * Introduced through the **magrittr** package and **dplyr** package around the same time
  * Inspired by unix pipes, and F-sharp pipes

A pipe ('`%>%`') takes the left-hand side and passes it to the right-hand side as the first argument.

=======================================================
```{r}
summarise(
  group_by(
    filter(
      mutate(
        mtcars,
        disp_cyl = disp/cyl
      ),
      disp_cyl > 30,
      mpg > 23
    ),
    cyl,
    gear
  ),
  avg_d_cyl = mean(disp_cyl),
  min_d_cyl = min(disp_cyl)
)
```
***
```{r}
mtcars %>%
  mutate(disp_cyl = disp/cyl) %>%
  filter(disp_cyl>30, mpg>23) %>%
  group_by(cyl, gear) %>%
  summarise(
    avg_d_cyl = mean(disp_cyl),
    min_d_cyl = min(disp_cyl)
  )
```

That's not where it ends...
==========================
type: section


==========================
Let's load up a bigger dat set
```{r}
library(hflights) # to load the flights data set
tbl_df(hflights)
```


==========================
```{r, eval = FALSE}
hflights %>%
  mutate(ArrEarly = ArrDelay < 0) %>%
  filter(DepDelay > 60, Distance > 200) %>%
  mutate()

```
***
```{r}

```


What is this black magic??
==========================
type: section

Questions?
========================================================
type: prompt
	```{r, echo = FALSE}
	library(knitr)
	library(plyr)
	library(data.table)
	library(dplyr)

	gears <- mtcars$gear
	mtcars <- mtcars[,1:6]
	mtcars$gear <- gears

	```
	Welcome to dplyr
	========================================================
	author: Tommy M O'Dell (tommy.odell@gmail.com)
	date: September 10th, 2015
	transition: rotate
	transition-speed: fast
	width: 1440
	height: 900

	dplyr?
	========================================================
	Friction-less manipulation of data frames in R
	***
	Package goals:

	1. simple interface
	2. good performance
	3. same interface for many 'backends'

	<!-- Notes:
	- the ordering of the points is relevant (hadley will trade off performance for a clean interface)
	- familiar with plyr? Think of dplyr as plyr specialised for data frames
	-->

	Data manipulation in base R
	========================================================
	type: section


	Filter the rows of a data frame
	========================================================

	Keep rows where mpg > 30 and cyl >= 4
	```{r}
	mtcars[mtcars$mpg > 30 & mtcars$cyl >= 4,]
	```

	Add or modify columns
	========================================================

	Create a new column for displacement per cylinder
	```{r}
	mtcars$disp_cyl <- mtcars$disp / mtcars$cyl
	head(mtcars)
	```

	Sort a data frame by its values
	========================================================

	```{r}
	mt <- mtcars[order(mtcars$mpg, mtcars$cyl),]
	head(mt)
	```

	Select columns from a data frame
	========================================================

	Method 1 - Numerical indices
	```{r}
	head(mtcars[,1:4])
	```

	Select columns from a data frame
	========================================================
	title:false

	Method 2 - Named indices
	```{r}
	head(mtcars[,c('mpg','cyl','disp','gear')])
	```

	Summarise (aggregate) a data frame
	========================================================

	```{r}
	aggregate(mpg ~ cyl + gear, data = mtcars, mean)
	```

	What about CRAN?
	========================================================
	type: section


	========================================================

	Using `ddply` from plyr to get the mean mpg per cylinder and gear
	```{r}
	ddply(
	mtcars, # data frame
	.(cyl, gear), # grouping columns
	summarise, # type of
	mpg = mean(mpg) # aggregations
	)
	```

	========================================================

	Using `data.table` to get the mean mpg per cylinder and gear
	```{r}
	dtcars <- as.data.table(mtcars)
	dtcars[, mean(mpg), by = list(cyl,gear)]
	```

	If it aint broke?
	========================================================
	type: section

	Are any of these both readable and fast?



	The dplyr promise
	========================================================
	99% of data manipulation can be described 6 key operations ('verbs')

	1. filter: filter the rows of a data frame
	2. mutate: modify or create new columns
	3. group by: set grouping variables
	4. summarise: aggregate a data frame
	5. arrange: sort columns of a data frame
	6. select: select a set of columns

	```{r, echo = FALSE}
	mtcars <- tbl_df(mtcars)
	```


	filter
	========================================================
	```{r}
	filter(mtcars, mpg > 30, cyl >= 4)
	```

	mutate
	========================================================
	(modify or create columns)

	Same as previous
	```{r}
	mutate(mtcars, disp_cyl = disp/cyl)
	```

	========================================================
	Multiple columns in one
	```{r}
	mutate(
	mtcars,
	disp_cyl = disp/cyl,
	kw = hp/0.746
	)
	```

	========================================================
	Can even refer to newly created columns immediately...
	```{r}
	mutate(
	mtcars,
	disp_cyl = disp/cyl,
	k_watt = hp/0.746,
	watts = k_watt*1000
	)
	```

	Group by and Summarise
	========================================================
	```{r}
	mtcars <- group_by(mtcars, cyl, gear)
	mtcars
	```

	========================================================
	```{r}
	summarise(mtcars, mpg = mean(mpg)) # uses the grouping set previously
	```

	========================================================
	Multiple aggregations in one
	```{r}
	summarise(
	mtcars,
	mpg = mean(mpg),
	hp = mean(hp)
	)
	```

	arrange
	========================================================
	```{r}
	arrange(mtcars, mpg, cyl)
	```

	========================================================
	```{r}
	arrange(mtcars, mpg, desc(cyl)) # desending!
	```

	select
	========================================================
	```{r, echo = FALSE}
	mtcars <- ungroup(mtcars)
	```

	```{r}
	select(mtcars, 1:4)
	```

	========================================================
	```{r}
	select(mtcars, mpg, cyl, disp, gear)
	```

	========================================================
	```{r}
	select(mtcars, mpg:hp)
	```

	========================================================
	```{r}
	select(mtcars, contains('a'))
	```

	========================================================
	```{r}
	select(mtcars, starts_with('d'))
	```

	========================================================
	```{r}
	select(mtcars, -starts_with('d'))
	```

	Putting it all together
	========================================================
	type: section


	========================================================
	Let's say during our exploratory analysis we want to create new column, fitler on that new column,
	then get the mean of that new column for each cylinder and gear

	```{r}
	mt <- mutate(mtcars, disp_cyl = disp/cyl)
	mt <- filter(mt, disp_cyl > 30, mpg < 25)
	mt <- group_by(mt, cyl, gear)
	```

	```{r}
	summarise(mt, avg_d_cyl = mean(disp_cyl), min_d_cyl = min(disp_cyl))

	```


	That's a lot of repetition, and we now have an extra variable mt sitting around taking up space...

	========================================================

	If we just want to print the answer without intermediary variables...
	```{r}
	summarise(group_by(filter(mutate(mtcars, disp_cyl = disp/cyl), disp_cyl > 30, mpg > 23), cyl, gear), avg_d_cyl = mean(disp_cyl), min_d_cyl = min(disp_cyl))
	```
	(Say what???!)

	========================================================

	We can make that a bit easier to read... but not great
	```{r}
	summarise(
	group_by(
	filter(
	mutate(mtcars, disp_cyl = disp/cyl),
	disp_cyl > 30,
	mpg > 23
	),
	cyl,
	gear
	),
	avg_d_cyl = mean(disp_cyl),
	min_d_cyl = min(disp_cyl)
	)
	```
	(Butt ugly!)


	Ceci n'est pas une pipe
	=======================================================
	type: section


	=======================================================
	Our last example was barely readable. What can we do? Pipes to the rescue!

	* Introduced through the magrittr package and dplyr package around the same time
	* Inspired by unix pipes, and F-sharp pipes

	A pipe ('`%>%`') takes the left-hand side and passes it to the right-hand side as the first argument.

	=======================================================
	```{r}
	summarise(
	group_by(
	filter(
	mutate(
	mtcars,
	disp_cyl = disp/cyl
	),
	disp_cyl > 30,
	mpg > 23
	),
	cyl,
	gear
	),
	avg_d_cyl = mean(disp_cyl),
	min_d_cyl = min(disp_cyl)
	)
	```
	***
	```{r}
	mtcars %>%
	mutate(disp_cyl = disp/cyl) %>%
	filter(disp_cyl>30, mpg>23) %>%
	group_by(cyl, gear) %>%
	summarise(
	avg_d_cyl = mean(disp_cyl),
	min_d_cyl = min(disp_cyl)
	)
	```

	That's not where it ends...
	==========================
	type: section


	==========================
	Let's load up a bigger dat set
	```{r}
	library(hflights) # to load the flights data set
	tbl_df(hflights)
	```


	==========================
	```{r, eval = FALSE}
	hflights %>%
	mutate(ArrEarly = ArrDelay < 0) %>%
	filter(DepDelay > 60, Distance > 200) %>%
	mutate()

	```
	***
	```{r}

	```




	What is this black magic??
	==========================
	type: section

	Questions?
	========================================================
	type: prompt