mpettis/R-user-group_tidyr-talk.Rmd

## R-user-group_tidyr-talk.Rmd
---
title: "R-user-group_tidyr-talk"
author: "Matt Pettis"
date: "November 22, 2017"
always_allow_html: yes
output:
  html_document:
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Summary
This is a demo of some of the features of the `tidyr` package.  The intent of this is to give some practical examples of how and why you would use the functionality of `tidyr`.


# Introduction

I would like to introduce my own `Rule 34 of data analysis`

> If there is a perverse way of storing data, someone has done it, and you will have to deal with it one day.

When people collect data, they usually think about the easiest or most natural way to group and record data.  This is often not compatible with the easiest way to analyze data.  This is one of the reasons that a lot of data science isn't about modeling, but about data cleaning and wrangling.

In all of our professional lives, we are faced with decisions about where we put forth our effort in terms of learning tools and skills.  Is it worth investing in skill `X` or technology `Y`?  Will I need to use it more than once or a handful of times?  Will be in demand, or even around, in 5 years?  Our time is finite, and we are rightfully paranoid of painting ourselves into a corner by acquiring skills or technology that make us unemployable in the future.

If you intend to work with data for the near to mid-range future, it is a safe bet that learning how to manipulate and reshape data is a skill well worth learning, and if staying in `R`, learning `tidyr` is a technology worth investing in.


# A Simple Example

Charting is an important part of data analysis.  Depending on what graphing package you are using, your data may have to be shaped differently.  We will talk about data being in `wide` or `long` format.  Intuitively, data in `wide` format has a different column for each feature of an observation, and there is usually one observation per row.  In `long` format, one column contains the name of the feature being recorded in the row, and another column to record the value.  In this case, an observation can be spread across multiple rows.

Here, let's create 3 simple sine waves to plot.

```{r}
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(xts))
suppressPackageStartupMessages(library(dygraphs))
suppressPackageStartupMessages(library(ggplot2))

## Parameters
n_cycles <- 3 # Number of cycles
tau <- 25 # period -- or samples per cycle
phase <- 0 # Phase offset
ampl <- 10 # sine amplitude.  1/2 of peak-to-valley amplitude.
lvl <- 300 # Mean level of sin wave

pts <- 1:(n_cycles * tau)

## The dataframe
df_wide <- data_frame(
  DateTime = ymd_hms("2017-07-01T15:00:00") + (pts - 1)*60 # Initial time column
  , X1 = ampl * sin(2 * pi * pts / tau - phase) + lvl
  , X2 = ampl * sin(2 * pi * pts / tau - (phase + pi/2)) + lvl - 30
  , X3 = ampl * sin(2 * pi * pts / tau - (phase + pi  )) + lvl - 60)
```


First, let's use `dygraphs`, which requires *wide* data.  `dygraphs` is an excellent package for charting time series that you can interact with:
```{r}
# https://stackoverflow.com/questions/11890600/time-zone-period-apply-in-xts-using-r
Sys.setenv(TZ=Sys.timezone())

## Also requires it to be a timeseries-like object
# xts_df <- df_wide %>%
#   as.data.frame() %>% # xts doesn't like class data_frame
#   {xts(.[,-1], order.by = .[,1], tzone=Sys.timezone())}

xts_df <- as.data.frame(df_wide)
xts_df <- xts(xts_df[,-1], order.by = xts_df[,1], tzone=Sys.timezone())

head(xts_df)
tail(xts_df)

# dygraph(xts_df, main = "Our df timeseries") %>%
#   dyRangeSelector(height = 20, strokeColor = "") %>%
#   dyLegend(show = "follow")
dyLegend(dyRangeSelector(dygraph(xts_df, main = "Our df timeseries")
                         , height = 20, strokeColor = "")
         , show = "follow")
```

<br/>

However, `ggplot` likes *long* data when plotting multiple series.  Often you want to use both charting packages on the same data, but you can't use the same data format, and have to reshape it.  Here, we reshape it a hard way, which is not robust against arbitrary changes to the number of columns to chart:
```{r}
## Make a 'long' dataframe from the 'wide' dataframe.
df_long <- bind_rows(
  data.frame(DateTime = df_wide$DateTime, key="X1", value=df_wide$X1, stringsAsFactors = FALSE)
  , data.frame(DateTime = df_wide$DateTime, key="X2", value=df_wide$X2, stringsAsFactors = FALSE)
  , data.frame(DateTime = df_wide$DateTime, key="X3", value=df_wide$X3, stringsAsFactors = FALSE)
)

## Look at the dataframe
head(df_long)
tail(df_long)

## Plot it
p <- ggplot(df_long, aes(x=DateTime, y=value, group=key, color=key)) +
  geom_line() +
  ggtitle("Our df timeseries")
print(p)
```


What if we started with a *long* dataframe and needed to put it into a *wide* format?

```{r}
df_wide2 <- merge(
  setNames(df_long[df_long$key == "X1",c("DateTime", "value")], c("DateTime", "X1"))
  , setNames(df_long[df_long$key == "X2",c("DateTime", "value")], c("DateTime", "X2"))
  , by = "DateTime")

df_wide2 <- merge(
  df_wide2
  , setNames(df_long[df_long$key == "X3",c("DateTime", "value")], c("DateTime", "X3"))
  , by = "DateTime")

head(df_wide2)
```

![](tape-measure-fail.gif)
![](bread-doorstop.gif)

That's just plain painful.  Granted, there are better algorithmic ways to do this in base `R`, but they are not much better in terms of conciseness, robustness, readability, maintainability, etc.  Unless you use a package with functions specifically targeted to do this.  Like `tidyr`.


# Enter tidyr

## Typical use case
`tidyr` is a package written by Hadley Wickham that, among other things, helps you easily go back and forth between *wide* and *long* data formats.  Without further ado, let's just use it:


```{r}
## Long to wide
head(df_long)

df_wide_tidy <- spread(df_long, key, value)
head(df_wide_tidy)


## Wide to long
head(df_wide)

df_long_tidy <- gather(df_wide, key, value, X1:X3)
head(df_long_tidy)


## Alternate:
df_long_tidy <- gather(df_wide, key, value, -DateTime)
head(df_long_tidy)

## Alternate:
df_long_tidy <- gather(df_wide, key, value, c("X1", "X2", "X3"))
head(df_long_tidy)
```

We use two complementary functions here: `spread` and `gather`.  And they do things like they sound.

`spread` will 'spread' data across columns.  So it takes a long dataset and 'spreads' it to a wide data set.  Above, in the `spread` function, the second argument (`key`) is the name of the column whose cell entries contain the strings that should become the new 'wide' dataframe column names.  The third argument (`value`) is the name of the column that contains the values that will go into the corresponding cell under the correct column.

`gather` goes the opposite direction: it identifies a set of columns (the last argument: `X1:X3`), and then identifies a column to store the original column names in (here, the `key` column will store the column names `X1`, `X2`, `X3`), and `value` will store the value under those columns.


### Common errors

If, after you pick the columns that contain the key and value and spread them, the remaining columns don't have unique values for each row, you get an error.  After spreading, your remaining columns must have unique combinations of values for each of the rows.

```{r}
dat <- data.frame(x=c("A", "B", "B", "C", "A", "B", "B", "C"), y=c("a", "b", "b", "c", "a", "b", "b", "c"), key=c("X1", "X1", "X1", "X1", "X2", "X2", "X2", "X2"), value=1:8, stringsAsFactors = FALSE)
print(dat)

## Oops, you can't spread because the 'x' and 'y' columns do not have unique combinations to form single rows
tryCatch(spread(dat, key, value), error=print)
```


### A trick to fill in missing values

In the code below, we have attributes of two people.  They both have a `band` attribute, but Brian is missing a `drink` feature, and Matt is missing a `color` feature.  I'd like my long dataframe to be aware that these features are missing for each person.  I can leverage `spread` and `gather` to get this:

```{r}
dat <- data.frame(person=c("Brian", "Brian", "Matt", "Matt"), feature=c("band", "color", "band", "drink"), entry=c("Echo and the Bunnymen", "fleshtone", "ABBA", "White Russian"), stringsAsFactors = FALSE)
dat

## By making it wide, each person gets one column for the union of features
dat_wide <- spread(dat, feature, entry)
dat_wide

## If we now gather this back together, those missing values appear in our list
dat_long <- gather(dat_wide, feature, entry, -person)
dat_long
```

More recent versions `tidyr` have explicit functions to deal with this (see documentation for function `complete`), but before that, this trick was what I used.


## Splitting a single column of values to multiple columns

Another case in data tidying is when you want data that is crammed into one column to be split across multiple columns.  Functions to aide that are here:

```{r}
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df

separate(df, x, c("A", "B"))
```


## Making multiple rows by splitting values in the columns

This is a lot like the previous example, but now we want each of those entries in a cell to split into multiple rows.  Here's a quick example from the documentation:


```{r}
df <- data.frame(
  x = 1:3,
  y = c("a", "d,e,f", "g,h"),
  z = c("1", "2,3,4", "5,6"),
  stringsAsFactors = FALSE
)
df

separate_rows(df, y, z, convert = TRUE)
```


# References

* _Tidy Data_, the Hadley Wickham paper that lays out his philosophy about these things, data formats, and his first two packages that did this work (`reshape` and `reshape2`) : http://vita.had.co.nz/papers/tidy-data.pdf
* Main documentation page for the `tidyr` package: http://tidyr.tidyverse.org/
* 38 animated gifs of people dramatically failing for the purpose of infomercials: https://www.buzzfeed.com/julianbrand/40-gifs-of-stupid-infomercial-people-6eof?utm_term=.pkxql4k0n
	---
	title: "R-user-group_tidyr-talk"
	author: "Matt Pettis"
	date: "November 22, 2017"
	always_allow_html: yes
	output:
	html_document:
	toc: true
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	```


	# Summary
	This is a demo of some of the features of the `tidyr` package. The intent of this is to give some practical examples of how and why you would use the functionality of `tidyr`.





	# Introduction

	I would like to introduce my own `Rule 34 of data analysis`

	> If there is a perverse way of storing data, someone has done it, and you will have to deal with it one day.

	When people collect data, they usually think about the easiest or most natural way to group and record data. This is often not compatible with the easiest way to analyze data. This is one of the reasons that a lot of data science isn't about modeling, but about data cleaning and wrangling.

	In all of our professional lives, we are faced with decisions about where we put forth our effort in terms of learning tools and skills. Is it worth investing in skill `X` or technology `Y`? Will I need to use it more than once or a handful of times? Will be in demand, or even around, in 5 years? Our time is finite, and we are rightfully paranoid of painting ourselves into a corner by acquiring skills or technology that make us unemployable in the future.

	If you intend to work with data for the near to mid-range future, it is a safe bet that learning how to manipulate and reshape data is a skill well worth learning, and if staying in `R`, learning `tidyr` is a technology worth investing in.





	# A Simple Example

	Charting is an important part of data analysis. Depending on what graphing package you are using, your data may have to be shaped differently. We will talk about data being in `wide` or `long` format. Intuitively, data in `wide` format has a different column for each feature of an observation, and there is usually one observation per row. In `long` format, one column contains the name of the feature being recorded in the row, and another column to record the value. In this case, an observation can be spread across multiple rows.

	Here, let's create 3 simple sine waves to plot.

	```{r}
	suppressPackageStartupMessages(library(tidyr))
	suppressPackageStartupMessages(library(dplyr))
	suppressPackageStartupMessages(library(lubridate))
	suppressPackageStartupMessages(library(stringr))
	suppressPackageStartupMessages(library(xts))
	suppressPackageStartupMessages(library(dygraphs))
	suppressPackageStartupMessages(library(ggplot2))

	## Parameters
	n_cycles <- 3 # Number of cycles
	tau <- 25 # period -- or samples per cycle
	phase <- 0 # Phase offset
	ampl <- 10 # sine amplitude. 1/2 of peak-to-valley amplitude.
	lvl <- 300 # Mean level of sin wave

	pts <- 1:(n_cycles * tau)

	## The dataframe
	df_wide <- data_frame(
	DateTime = ymd_hms("2017-07-01T15:00:00") + (pts - 1)*60 # Initial time column
	, X1 = ampl * sin(2 * pi * pts / tau - phase) + lvl
	, X2 = ampl * sin(2 * pi * pts / tau - (phase + pi/2)) + lvl - 30
	, X3 = ampl * sin(2 * pi * pts / tau - (phase + pi )) + lvl - 60)
	```


	First, let's use `dygraphs`, which requires wide data. `dygraphs` is an excellent package for charting time series that you can interact with:
	```{r}
	# https://stackoverflow.com/questions/11890600/time-zone-period-apply-in-xts-using-r
	Sys.setenv(TZ=Sys.timezone())

	## Also requires it to be a timeseries-like object
	# xts_df <- df_wide %>%
	# as.data.frame() %>% # xts doesn't like class data_frame
	# {xts(.[,-1], order.by = .[,1], tzone=Sys.timezone())}

	xts_df <- as.data.frame(df_wide)
	xts_df <- xts(xts_df[,-1], order.by = xts_df[,1], tzone=Sys.timezone())

	head(xts_df)
	tail(xts_df)

	# dygraph(xts_df, main = "Our df timeseries") %>%
	# dyRangeSelector(height = 20, strokeColor = "") %>%
	# dyLegend(show = "follow")
	dyLegend(dyRangeSelector(dygraph(xts_df, main = "Our df timeseries")
	, height = 20, strokeColor = "")
	, show = "follow")
	```

	<br/>

	However, `ggplot` likes long data when plotting multiple series. Often you want to use both charting packages on the same data, but you can't use the same data format, and have to reshape it. Here, we reshape it a hard way, which is not robust against arbitrary changes to the number of columns to chart:
	```{r}
	## Make a 'long' dataframe from the 'wide' dataframe.
	df_long <- bind_rows(
	data.frame(DateTime = df_wide$DateTime, key="X1", value=df_wide$X1, stringsAsFactors = FALSE)
	, data.frame(DateTime = df_wide$DateTime, key="X2", value=df_wide$X2, stringsAsFactors = FALSE)
	, data.frame(DateTime = df_wide$DateTime, key="X3", value=df_wide$X3, stringsAsFactors = FALSE)
	)

	## Look at the dataframe
	head(df_long)
	tail(df_long)

	## Plot it
	p <- ggplot(df_long, aes(x=DateTime, y=value, group=key, color=key)) +
	geom_line() +
	ggtitle("Our df timeseries")
	print(p)
	```


	What if we started with a long dataframe and needed to put it into a wide format?

	```{r}
	df_wide2 <- merge(
	setNames(df_long[df_long$key == "X1",c("DateTime", "value")], c("DateTime", "X1"))
	, setNames(df_long[df_long$key == "X2",c("DateTime", "value")], c("DateTime", "X2"))
	, by = "DateTime")

	df_wide2 <- merge(
	df_wide2
	, setNames(df_long[df_long$key == "X3",c("DateTime", "value")], c("DateTime", "X3"))
	, by = "DateTime")

	head(df_wide2)
	```

	![](tape-measure-fail.gif)
	![](bread-doorstop.gif)

	That's just plain painful. Granted, there are better algorithmic ways to do this in base `R`, but they are not much better in terms of conciseness, robustness, readability, maintainability, etc. Unless you use a package with functions specifically targeted to do this. Like `tidyr`.



	# Enter tidyr

	## Typical use case
	`tidyr` is a package written by Hadley Wickham that, among other things, helps you easily go back and forth between wide and long data formats. Without further ado, let's just use it:


	```{r}
	## Long to wide
	head(df_long)

	df_wide_tidy <- spread(df_long, key, value)
	head(df_wide_tidy)



	## Wide to long
	head(df_wide)

	df_long_tidy <- gather(df_wide, key, value, X1:X3)
	head(df_long_tidy)


	## Alternate:
	df_long_tidy <- gather(df_wide, key, value, -DateTime)
	head(df_long_tidy)

	## Alternate:
	df_long_tidy <- gather(df_wide, key, value, c("X1", "X2", "X3"))
	head(df_long_tidy)
	```

	We use two complementary functions here: `spread` and `gather`. And they do things like they sound.

	`spread` will 'spread' data across columns. So it takes a long dataset and 'spreads' it to a wide data set. Above, in the `spread` function, the second argument (`key`) is the name of the column whose cell entries contain the strings that should become the new 'wide' dataframe column names. The third argument (`value`) is the name of the column that contains the values that will go into the corresponding cell under the correct column.

	`gather` goes the opposite direction: it identifies a set of columns (the last argument: `X1:X3`), and then identifies a column to store the original column names in (here, the `key` column will store the column names `X1`, `X2`, `X3`), and `value` will store the value under those columns.



	### Common errors

	If, after you pick the columns that contain the key and value and spread them, the remaining columns don't have unique values for each row, you get an error. After spreading, your remaining columns must have unique combinations of values for each of the rows.

	```{r}
	dat <- data.frame(x=c("A", "B", "B", "C", "A", "B", "B", "C"), y=c("a", "b", "b", "c", "a", "b", "b", "c"), key=c("X1", "X1", "X1", "X1", "X2", "X2", "X2", "X2"), value=1:8, stringsAsFactors = FALSE)
	print(dat)

	## Oops, you can't spread because the 'x' and 'y' columns do not have unique combinations to form single rows
	tryCatch(spread(dat, key, value), error=print)
	```




	### A trick to fill in missing values

	In the code below, we have attributes of two people. They both have a `band` attribute, but Brian is missing a `drink` feature, and Matt is missing a `color` feature. I'd like my long dataframe to be aware that these features are missing for each person. I can leverage `spread` and `gather` to get this:

	```{r}
	dat <- data.frame(person=c("Brian", "Brian", "Matt", "Matt"), feature=c("band", "color", "band", "drink"), entry=c("Echo and the Bunnymen", "fleshtone", "ABBA", "White Russian"), stringsAsFactors = FALSE)
	dat

	## By making it wide, each person gets one column for the union of features
	dat_wide <- spread(dat, feature, entry)
	dat_wide

	## If we now gather this back together, those missing values appear in our list
	dat_long <- gather(dat_wide, feature, entry, -person)
	dat_long
	```

	More recent versions `tidyr` have explicit functions to deal with this (see documentation for function `complete`), but before that, this trick was what I used.


	## Splitting a single column of values to multiple columns

	Another case in data tidying is when you want data that is crammed into one column to be split across multiple columns. Functions to aide that are here:

	```{r}
	df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
	df

	separate(df, x, c("A", "B"))
	```


	## Making multiple rows by splitting values in the columns

	This is a lot like the previous example, but now we want each of those entries in a cell to split into multiple rows. Here's a quick example from the documentation:


	```{r}
	df <- data.frame(
	x = 1:3,
	y = c("a", "d,e,f", "g,h"),
	z = c("1", "2,3,4", "5,6"),
	stringsAsFactors = FALSE
	)
	df

	separate_rows(df, y, z, convert = TRUE)
	```



	# References

	* _Tidy Data_, the Hadley Wickham paper that lays out his philosophy about these things, data formats, and his first two packages that did this work (`reshape` and `reshape2`) : http://vita.had.co.nz/papers/tidy-data.pdf
	* Main documentation page for the `tidyr` package: http://tidyr.tidyverse.org/
	* 38 animated gifs of people dramatically failing for the purpose of infomercials: https://www.buzzfeed.com/julianbrand/40-gifs-of-stupid-infomercial-people-6eof?utm_term=.pkxql4k0n