Skip to content

Instantly share code, notes, and snippets.

@plpxsk
Last active June 4, 2018 20:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save plpxsk/4f7bbaf30af800905e21e079fba57cf0 to your computer and use it in GitHub Desktop.
Save plpxsk/4f7bbaf30af800905e21e079fba57cf0 to your computer and use it in GitHub Desktop.
This demo will give you a flavor of `dplyr` http://bit.ly/dplyr-demo

A dplyr demo

This demo is meant to give you a flavor of dplyr's functions for data manipulation in R.

After this demo, you may consider using dplyr to make your data analysis easier, faster, and more fun.

Quick Start

If you want to quickly get started with dplyr, download (and print) the below reference PDF, and just have it next to you as you are working in R.

In my opinion, the PDF cheet sheet contains about 75% of all data-related functions that you may need to use. It is THAT good.

If you need more, just read on!

Reference PDF

I will be referencing this PDF file throughout. Here is a backup link.

Motivation

You can probably guess what the following sample dplyr code is doing.

NB: iris is a dataset of flower data, which is included in base R

iris %>%
	group_by(Species) %>%
	summarise(avg = mean(Sepal.Width)) %>%
	arrange(avg)

Comment on advantages/disadvantages of these two chunks (they do the same thing):

paste0("(", gsub(".", "_", tolower(names(iris)), fixed=TRUE), ")")

vs:

names(iris) %>%
	tolower() %>%
	gsub(".", "_", ., fixed=TRUE) %>%
	paste0("(", ., ")")

Note on the pipe

Notice how the pipe %>% sends left hand side items to the right hand side functions. It helps in readability and reduces nested parentheses. These are all equivalent:

tolower(names(iris))

names(iris) %>% tolower()

iris %>% names() %>% tolower()

See the PDF, page 1, bottom left.

Concepts

You may follow along with some sample data - download the CSV file from here (RIGHT click on "Raw" and click "Download linked file..." or similar) and read it with:

We will convert the data frame to a "tibble" so it doesn't print the whole thing if you accidentally type df.

library(dplyr)

df <- read.csv("inpatient_small.csv") %>% as_tibble()

Basic operations

library(dplyr)

df %>% glimpse()

df %>% select(Provider.Id, Total.Discharges)
df %>% select(-starts_with("DRG"))

df %>% arrange(Provider.State)
df %>% arrange(desc(Provider.State))

df %>% filter(DRG.code==233)

df %>% summarise(total_avg_payments=sum(Average.Total.Payments))

df %>% mutate(thousands=Average.Total.Payments/1000)

Details

  • select() columns, and filter() rows
  • summarise() collapses data to (fewer) summary rows.
  • mutate() just makes new variables (preserving row number, unless you also group_by)
  • See the PDF.

Group-by combinations

df %>%
	group_by(DRG.code) %>%
	summarise(providers=n(), total_payments=sum(Average.Total.Payments))

df %>%
	group_by(Provider.Id) %>%
	mutate(thousands=Average.Total.Payments/1000)

Some recipes

df %>% dim()

n_distinct(df$Provider.Id)
## or:
df$Provider.Id %>% n_distinct()

df %>% count(Provider.Id)

df %>% distinct(Provider.State, Provider.Id)
    
## subsample data 
df %>% sample_n(10)
df %>% sample_frac(0.2)

## bin data into buckets with ntile()
df %>%
    select(DRG.code, Provider.Id, Average.Total.Payments) %>%
    group_by(DRG.code) %>%
    mutate(payment_quartile = ntile(Average.Total.Payments, 4))

## combine multiple operations into easy to read code chunks:
df %>%
	filter(DRG.code==247) %>%
	group_by(Provider.State) %>% 
	summarise(min_discharges=min(Total.Discharges),
			  median_discharges=median(Total.Discharges),
			  max_discharges=max(Total.Discharges)) %>%
    	arrange(desc(median_discharges))

## also check out lead() and lag() [see PDF, page 2, middle]

Project Demo

(Separate)

Reshape data from long to wide with tidyr

Reshape data from long to wide with gather() and spread() from tidyr.

See this cookbook for more.

library(tidyr)

## usage: df %>% gather(money, value, 6:7) 

## Here's an advanced reshape to produce a neat summary table
df %>% 
    select(starts_with("Average")) %>% 
    summarise_each(funs(n(), mean, sd, min, max))  %>%
    gather(variable, value)  %>%
    separate(variable, c("var", "stat"), sep="\\_") %>%
    spread(var, value)

Final notes

Here is the official dplyr website

As of recently, you may consider loading library(tidyverse) instead of individually loading dplyr, tidyr, ggplot2, etc

This is what I think of dplyr.

Quick link for this page: bit.ly/dplyr-demo

Leave comments/thoughts below!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment