plpxsk/dplyr-demo.md

## dplyr-demo.md

      
    Raw
  

              dplyr-demo.md
            
          
    A dplyr demo

This demo is meant to give you a flavor of dplyr's functions for data
manipulation in R.
After this demo, you may consider using dplyr to make your data analysis
easier, faster, and more fun.
Quick Start

If you want to quickly get started with dplyr, download (and print) the below reference PDF, and just have it next to you
as you are working in R.
In my opinion, the PDF cheet sheet contains about 75% of all data-related functions that you may
need to use. It is THAT good.
If you need more, just read on!
Reference PDF

I will be referencing
this PDF file
throughout. Here is a backup link.
Motivation

You can probably guess what the following sample dplyr code is doing.
NB: iris is a dataset of flower data, which is included in base R
iris %>%
	group_by(Species) %>%
	summarise(avg = mean(Sepal.Width)) %>%
	arrange(avg)

Comment on advantages/disadvantages of these two chunks (they do the same
thing):
paste0("(", gsub(".", "_", tolower(names(iris)), fixed=TRUE), ")")

vs:
names(iris) %>%
	tolower() %>%
	gsub(".", "_", ., fixed=TRUE) %>%
	paste0("(", ., ")")

Note on the pipe

Notice how the pipe %>% sends left hand side items to the right hand side
functions. It helps in readability and reduces nested parentheses. These are
all equivalent:
tolower(names(iris))

names(iris) %>% tolower()

iris %>% names() %>% tolower()

See the PDF, page 1, bottom left.
Concepts

You may follow along with some sample data - download the CSV file from here (RIGHT click on "Raw" and click "Download linked file..." or similar) and read it with:
We will convert the data frame to a "tibble" so it doesn't print the whole
thing if you accidentally type df.
library(dplyr)

df <- read.csv("inpatient_small.csv") %>% as_tibble()

Basic operations

library(dplyr)

df %>% glimpse()

df %>% select(Provider.Id, Total.Discharges)
df %>% select(-starts_with("DRG"))

df %>% arrange(Provider.State)
df %>% arrange(desc(Provider.State))

df %>% filter(DRG.code==233)

df %>% summarise(total_avg_payments=sum(Average.Total.Payments))

df %>% mutate(thousands=Average.Total.Payments/1000)

Details

select() columns, and filter() rows
summarise() collapses data to (fewer) summary rows.
mutate() just makes new variables (preserving row number, unless you also group_by)
See the PDF.

Group-by combinations

df %>%
	group_by(DRG.code) %>%
	summarise(providers=n(), total_payments=sum(Average.Total.Payments))

df %>%
	group_by(Provider.Id) %>%
	mutate(thousands=Average.Total.Payments/1000)

Some recipes

df %>% dim()

n_distinct(df$Provider.Id)
## or:
df$Provider.Id %>% n_distinct()

df %>% count(Provider.Id)

df %>% distinct(Provider.State, Provider.Id)
    
## subsample data 
df %>% sample_n(10)
df %>% sample_frac(0.2)

## bin data into buckets with ntile()
df %>%
    select(DRG.code, Provider.Id, Average.Total.Payments) %>%
    group_by(DRG.code) %>%
    mutate(payment_quartile = ntile(Average.Total.Payments, 4))

## combine multiple operations into easy to read code chunks:
df %>%
	filter(DRG.code==247) %>%
	group_by(Provider.State) %>% 
	summarise(min_discharges=min(Total.Discharges),
			  median_discharges=median(Total.Discharges),
			  max_discharges=max(Total.Discharges)) %>%
    	arrange(desc(median_discharges))

## also check out lead() and lag() [see PDF, page 2, middle]

Project Demo

(Separate)
Reshape data from long to wide with tidyr

Reshape data from long to wide with gather() and spread() from tidyr.
See this cookbook for more.
library(tidyr)

## usage: df %>% gather(money, value, 6:7) 

## Here's an advanced reshape to produce a neat summary table
df %>% 
    select(starts_with("Average")) %>% 
    summarise_each(funs(n(), mean, sd, min, max))  %>%
    gather(variable, value)  %>%
    separate(variable, c("var", "stat"), sep="\\_") %>%
    spread(var, value)

Final notes

Here is the official dplyr website
As of recently, you may consider loading library(tidyverse) instead of
individually loading dplyr, tidyr, ggplot2, etc

See https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/

This is what I think of dplyr.
Quick link for this page: bit.ly/dplyr-demo
Leave comments/thoughts below!