Skip to content

Instantly share code, notes, and snippets.

@datalove
Created September 9, 2015 22:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save datalove/a1828247918f978608c7 to your computer and use it in GitHub Desktop.
Save datalove/a1828247918f978608c7 to your computer and use it in GitHub Desktop.
R preso for WARG
```{r, echo = FALSE}
library(knitr)
library(plyr)
library(data.table)
library(dplyr)
gears <- mtcars$gear
mtcars <- mtcars[,1:6]
mtcars$gear <- gears
```
Welcome to dplyr
========================================================
author: Tommy M O'Dell (tommy.odell@gmail.com)
date: September 10th, 2015
transition: rotate
transition-speed: fast
width: 1440
height: 900
dplyr?
========================================================
Friction-less manipulation of data frames in R
***
Package goals:
1. simple interface
2. good performance
3. same interface for many 'backends'
<!-- Notes:
- the ordering of the points is relevant (hadley will trade off performance for a clean interface)
- familiar with plyr? Think of dplyr as plyr specialised for data frames
-->
Data manipulation in base R
========================================================
type: section
Filter the rows of a data frame
========================================================
Keep rows where mpg > 30 and cyl >= 4
```{r}
mtcars[mtcars$mpg > 30 & mtcars$cyl >= 4,]
```
Add or modify columns
========================================================
Create a new column for displacement per cylinder
```{r}
mtcars$disp_cyl <- mtcars$disp / mtcars$cyl
head(mtcars)
```
Sort a data frame by its values
========================================================
```{r}
mt <- mtcars[order(mtcars$mpg, mtcars$cyl),]
head(mt)
```
Select columns from a data frame
========================================================
Method 1 - Numerical indices
```{r}
head(mtcars[,1:4])
```
Select columns from a data frame
========================================================
title:false
Method 2 - Named indices
```{r}
head(mtcars[,c('mpg','cyl','disp','gear')])
```
Summarise (aggregate) a data frame
========================================================
```{r}
aggregate(mpg ~ cyl + gear, data = mtcars, mean)
```
What about CRAN?
========================================================
type: section
========================================================
Using `ddply` from **plyr** to get the mean mpg per cylinder and gear
```{r}
ddply(
mtcars, # data frame
.(cyl, gear), # grouping columns
summarise, # type of
mpg = mean(mpg) # aggregations
)
```
========================================================
Using `data.table` to get the mean mpg per cylinder and gear
```{r}
dtcars <- as.data.table(mtcars)
dtcars[, mean(mpg), by = list(cyl,gear)]
```
If it aint broke?
========================================================
type: section
Are any of these both **readable** and **fast**?
The dplyr promise
========================================================
99% of data manipulation can be described 6 key operations ('verbs')
1. **filter**: filter the rows of a data frame
2. **mutate**: modify or create new columns
3. **group by**: set grouping variables
4. **summarise**: aggregate a data frame
5. **arrange**: sort columns of a data frame
6. **select**: select a set of columns
```{r, echo = FALSE}
mtcars <- tbl_df(mtcars)
```
filter
========================================================
```{r}
filter(mtcars, mpg > 30, cyl >= 4)
```
mutate
========================================================
(modify or create columns)
Same as previous
```{r}
mutate(mtcars, disp_cyl = disp/cyl)
```
========================================================
Multiple columns in one
```{r}
mutate(
mtcars,
disp_cyl = disp/cyl,
kw = hp/0.746
)
```
========================================================
Can even refer to newly created columns immediately...
```{r}
mutate(
mtcars,
disp_cyl = disp/cyl,
k_watt = hp/0.746,
watts = k_watt*1000
)
```
Group by and Summarise
========================================================
```{r}
mtcars <- group_by(mtcars, cyl, gear)
mtcars
```
========================================================
```{r}
summarise(mtcars, mpg = mean(mpg)) # uses the grouping set previously
```
========================================================
Multiple aggregations in one
```{r}
summarise(
mtcars,
mpg = mean(mpg),
hp = mean(hp)
)
```
arrange
========================================================
```{r}
arrange(mtcars, mpg, cyl)
```
========================================================
```{r}
arrange(mtcars, mpg, desc(cyl)) # desending!
```
select
========================================================
```{r, echo = FALSE}
mtcars <- ungroup(mtcars)
```
```{r}
select(mtcars, 1:4)
```
========================================================
```{r}
select(mtcars, mpg, cyl, disp, gear)
```
========================================================
```{r}
select(mtcars, mpg:hp)
```
========================================================
```{r}
select(mtcars, contains('a'))
```
========================================================
```{r}
select(mtcars, starts_with('d'))
```
========================================================
```{r}
select(mtcars, -starts_with('d'))
```
Putting it all together
========================================================
type: section
========================================================
Let's say during our exploratory analysis we want to create new column, fitler on that new column,
then get the mean of that new column for each cylinder and gear
```{r}
mt <- mutate(mtcars, disp_cyl = disp/cyl)
mt <- filter(mt, disp_cyl > 30, mpg < 25)
mt <- group_by(mt, cyl, gear)
```
```{r}
summarise(mt, avg_d_cyl = mean(disp_cyl), min_d_cyl = min(disp_cyl))
```
That's a lot of repetition, and we now have an extra variable **mt** sitting around taking up space...
========================================================
If we just want to print the answer without intermediary variables...
```{r}
summarise(group_by(filter(mutate(mtcars, disp_cyl = disp/cyl), disp_cyl > 30, mpg > 23), cyl, gear), avg_d_cyl = mean(disp_cyl), min_d_cyl = min(disp_cyl))
```
(Say what???!)
========================================================
We can make that a bit easier to read... but not great
```{r}
summarise(
group_by(
filter(
mutate(mtcars, disp_cyl = disp/cyl),
disp_cyl > 30,
mpg > 23
),
cyl,
gear
),
avg_d_cyl = mean(disp_cyl),
min_d_cyl = min(disp_cyl)
)
```
(Butt ugly!)
Ceci n'est pas une pipe
=======================================================
type: section
=======================================================
Our last example was barely readable. What can we do? **Pipes** to the rescue!
* Introduced through the **magrittr** package and **dplyr** package around the same time
* Inspired by unix pipes, and F-sharp pipes
A pipe ('`%>%`') takes the left-hand side and passes it to the right-hand side as the first argument.
=======================================================
```{r}
summarise(
group_by(
filter(
mutate(
mtcars,
disp_cyl = disp/cyl
),
disp_cyl > 30,
mpg > 23
),
cyl,
gear
),
avg_d_cyl = mean(disp_cyl),
min_d_cyl = min(disp_cyl)
)
```
***
```{r}
mtcars %>%
mutate(disp_cyl = disp/cyl) %>%
filter(disp_cyl>30, mpg>23) %>%
group_by(cyl, gear) %>%
summarise(
avg_d_cyl = mean(disp_cyl),
min_d_cyl = min(disp_cyl)
)
```
That's not where it ends...
==========================
type: section
==========================
Let's load up a bigger dat set
```{r}
library(hflights) # to load the flights data set
tbl_df(hflights)
```
==========================
```{r, eval = FALSE}
hflights %>%
mutate(ArrEarly = ArrDelay < 0) %>%
filter(DepDelay > 60, Distance > 200) %>%
mutate()
```
***
```{r}
```
What is this black magic??
==========================
type: section
Questions?
========================================================
type: prompt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment