This demo is meant to give you a flavor of dplyr
's functions for data
manipulation in R.
After this demo, you may consider using dplyr
to make your data analysis
easier, faster, and more fun.
If you want to quickly get started with dplyr
, download (and print) the below reference PDF, and just have it next to you
as you are working in R.
In my opinion, the PDF cheet sheet contains about 75% of all data-related functions that you may need to use. It is THAT good.
If you need more, just read on!
I will be referencing this PDF file throughout. Here is a backup link.
You can probably guess what the following sample dplyr
code is doing.
NB: iris
is a dataset of flower data, which is included in base R
iris %>%
group_by(Species) %>%
summarise(avg = mean(Sepal.Width)) %>%
arrange(avg)
Comment on advantages/disadvantages of these two chunks (they do the same thing):
paste0("(", gsub(".", "_", tolower(names(iris)), fixed=TRUE), ")")
vs:
names(iris) %>%
tolower() %>%
gsub(".", "_", ., fixed=TRUE) %>%
paste0("(", ., ")")
Notice how the pipe %>%
sends left hand side items to the right hand side
functions. It helps in readability and reduces nested parentheses. These are
all equivalent:
tolower(names(iris))
names(iris) %>% tolower()
iris %>% names() %>% tolower()
See the PDF, page 1, bottom left.
You may follow along with some sample data - download the CSV file from here (RIGHT click on "Raw" and click "Download linked file..." or similar) and read it with:
We will convert the data frame to a "tibble" so it doesn't print the whole
thing if you accidentally type df
.
library(dplyr)
df <- read.csv("inpatient_small.csv") %>% as_tibble()
library(dplyr)
df %>% glimpse()
df %>% select(Provider.Id, Total.Discharges)
df %>% select(-starts_with("DRG"))
df %>% arrange(Provider.State)
df %>% arrange(desc(Provider.State))
df %>% filter(DRG.code==233)
df %>% summarise(total_avg_payments=sum(Average.Total.Payments))
df %>% mutate(thousands=Average.Total.Payments/1000)
Details
select()
columns, andfilter()
rowssummarise()
collapses data to (fewer) summary rows.mutate()
just makes new variables (preserving row number, unless you alsogroup_by
)- See the PDF.
df %>%
group_by(DRG.code) %>%
summarise(providers=n(), total_payments=sum(Average.Total.Payments))
df %>%
group_by(Provider.Id) %>%
mutate(thousands=Average.Total.Payments/1000)
df %>% dim()
n_distinct(df$Provider.Id)
## or:
df$Provider.Id %>% n_distinct()
df %>% count(Provider.Id)
df %>% distinct(Provider.State, Provider.Id)
## subsample data
df %>% sample_n(10)
df %>% sample_frac(0.2)
## bin data into buckets with ntile()
df %>%
select(DRG.code, Provider.Id, Average.Total.Payments) %>%
group_by(DRG.code) %>%
mutate(payment_quartile = ntile(Average.Total.Payments, 4))
## combine multiple operations into easy to read code chunks:
df %>%
filter(DRG.code==247) %>%
group_by(Provider.State) %>%
summarise(min_discharges=min(Total.Discharges),
median_discharges=median(Total.Discharges),
max_discharges=max(Total.Discharges)) %>%
arrange(desc(median_discharges))
## also check out lead() and lag() [see PDF, page 2, middle]
(Separate)
Reshape data from long to wide with gather()
and spread()
from tidyr
.
library(tidyr)
## usage: df %>% gather(money, value, 6:7)
## Here's an advanced reshape to produce a neat summary table
df %>%
select(starts_with("Average")) %>%
summarise_each(funs(n(), mean, sd, min, max)) %>%
gather(variable, value) %>%
separate(variable, c("var", "stat"), sep="\\_") %>%
spread(var, value)
Here is the official dplyr
website
As of recently, you may consider loading library(tidyverse)
instead of
individually loading dplyr
, tidyr
, ggplot2
, etc
This is what I think of dplyr
.
Quick link for this page: bit.ly/dplyr-demo
Leave comments/thoughts below!