mpettis/spread-group-percent.Rmd

## spread-group-percent.Rmd
---
title: "spread-group-percent"
author: "Matt Pettis"
date: "March 22, 2019"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Summary

A common pattern is to want to take relative percentages of values with a group.  Let's look at an example dataset:

```{r results='hide', message=FALSE, warning=FALSE}
library(tidyverse)

df <- tribble(
  ~ oid , ~category , ~v ,
  111   , "low"     , 1  ,
  111   , "med"     , 2  ,
  111   , "high"    , 3  ,
  222   , "low"     , 4  ,
  222   , "med"     , 5  ,
  222   , "high"    , 6)

df %>% print()
```

What we'd like to do is:

> For each group, calculate the percent that the value for `category` comprises for each group on it's `value` column.  So, for instance, for `oid` == 111, the percentatge for `cateogory == 'low'` should be $1 / (1 + 2 + 3) * 100$, which is $16.67\%$.

## Way one: pivot and calculate.

One natural way to thingk about this is to want to do the calculations row-wise.
That is, if `low`, `med`, and `high` were columns, this would be an easy
calculation... for each row, you just divide each column by the sum of the
values in the `low`, `med`, and `high` columns, and multiply by 100.

Here's how you can do that:

```{r}
dft__ <- df %>%
  group_by(oid) %>%
  spread(key = category, value = v)
dft__ %>% print()

dft__ <- dft__ %>%
  mutate(  low_pct  = low  / (low + med + high) * 100
         , med_pct  = med  / (low + med + high) * 100
         , high_pct = high / (low + med + high) * 100)
dft__ %>% print()
```

This works.  Some observations about this:

* To do this, you have to know *explicitly* the values in the `category` column
when computing the percentatges, as those values became the column names, and
those column names are used in the percentage calculation.

* There is a lot of redundancy in the formula.  We effectively repeat the
calculation, with a small modification for each individual percentage.


## Way 2: Calculate in long form, pivot at end when needed.

It can be inconvenient to have to name all of the `category` levels as columns:

* There may be a lot of distinct values in that column.  In this case, that means a lot of repeated code.
* The `category` values might have spaces and chararcters that make them not nice for use as column names.

So it would be nice if we could operate on the original dataframe and calculate
percentages without pivoting them wide first.  We can do this as follows.

```{r}
dft__ <- df %>%
  group_by(oid) %>%
  mutate(ttl = sum(v), pct = v / ttl * 100) %>%
  ungroup()
dft__ %>% print()
```

Note that:

* The new `pct` column has the same values you want, just like in the pivoted version.

* You didn't have to reference `low`, `med`, and `high` explicitly to do the
computation... you just summed over the appropriate column, and in the
appropriate group.

* You didn't have to think up new names for the new columns either.

* You can use `mutate()` with `group_by()` to add a column that has the same value wanted for the whole group -- in this case, the total of `v`.


### But what if you want the percentage values pivoted wide?

Well, we can do that from this form of the table, like so:

```{r}
dft__ %>%
  select(-v, -ttl) %>%
  spread(key = category, value = pct)
```

If you want some systematic naming convention, I like suffixing with the
operation I did on the original values, like `pct` for percent, and I can do
that as follows:

```{r}
dft__ %>%
  select(-v, -ttl) %>%
  mutate(category = sprintf("%s_pct", category)) %>%
  spread(key = category, value = pct)
```

Again some notes:

* Using the `mutate()` call, I transformed `category` to add a suffix of `_pct`
to the column names.  This is a consistent naming convention, and best of all, I
don't have to make up column names by hand!

* I had to drop the original values
to make this pivot correctly.  You may want to pivot the original `v` column as
well for this output, but I will leave that for another notebook for now.
	---
	title: "spread-group-percent"
	author: "Matt Pettis"
	date: "March 22, 2019"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	```

	# Summary

	A common pattern is to want to take relative percentages of values with a group. Let's look at an example dataset:

	```{r results='hide', message=FALSE, warning=FALSE}
	library(tidyverse)

	df <- tribble(
	~ oid , ~category , ~v ,
	111 , "low" , 1 ,
	111 , "med" , 2 ,
	111 , "high" , 3 ,
	222 , "low" , 4 ,
	222 , "med" , 5 ,
	222 , "high" , 6)

	df %>% print()
	```

	What we'd like to do is:

	> For each group, calculate the percent that the value for `category` comprises for each group on it's `value` column. So, for instance, for `oid` == 111, the percentatge for `cateogory == 'low'` should be $1 / (1 + 2 + 3) * 100$, which is $16.67\%$.

	## Way one: pivot and calculate.

	One natural way to thingk about this is to want to do the calculations row-wise.
	That is, if `low`, `med`, and `high` were columns, this would be an easy
	calculation... for each row, you just divide each column by the sum of the
	values in the `low`, `med`, and `high` columns, and multiply by 100.

	Here's how you can do that:

	```{r}
	dft__ <- df %>%
	group_by(oid) %>%
	spread(key = category, value = v)
	dft__ %>% print()

	dft__ <- dft__ %>%
	mutate( low_pct = low / (low + med + high) * 100
	, med_pct = med / (low + med + high) * 100
	, high_pct = high / (low + med + high) * 100)
	dft__ %>% print()
	```

	This works. Some observations about this:

	* To do this, you have to know explicitly the values in the `category` column
	when computing the percentatges, as those values became the column names, and
	those column names are used in the percentage calculation.

	* There is a lot of redundancy in the formula. We effectively repeat the
	calculation, with a small modification for each individual percentage.




	## Way 2: Calculate in long form, pivot at end when needed.

	It can be inconvenient to have to name all of the `category` levels as columns:

	* There may be a lot of distinct values in that column. In this case, that means a lot of repeated code.
	* The `category` values might have spaces and chararcters that make them not nice for use as column names.

	So it would be nice if we could operate on the original dataframe and calculate
	percentages without pivoting them wide first. We can do this as follows.

	```{r}
	dft__ <- df %>%
	group_by(oid) %>%
	mutate(ttl = sum(v), pct = v / ttl * 100) %>%
	ungroup()
	dft__ %>% print()
	```

	Note that:

	* The new `pct` column has the same values you want, just like in the pivoted version.

	* You didn't have to reference `low`, `med`, and `high` explicitly to do the
	computation... you just summed over the appropriate column, and in the
	appropriate group.

	* You didn't have to think up new names for the new columns either.

	* You can use `mutate()` with `group_by()` to add a column that has the same value wanted for the whole group -- in this case, the total of `v`.


	### But what if you want the percentage values pivoted wide?

	Well, we can do that from this form of the table, like so:

	```{r}
	dft__ %>%
	select(-v, -ttl) %>%
	spread(key = category, value = pct)
	```

	If you want some systematic naming convention, I like suffixing with the
	operation I did on the original values, like `pct` for percent, and I can do
	that as follows:

	```{r}
	dft__ %>%
	select(-v, -ttl) %>%
	mutate(category = sprintf("%s_pct", category)) %>%
	spread(key = category, value = pct)
	```

	Again some notes:

	* Using the `mutate()` call, I transformed `category` to add a suffix of `_pct`
	to the column names. This is a consistent naming convention, and best of all, I
	don't have to make up column names by hand!

	* I had to drop the original values
	to make this pivot correctly. You may want to pivot the original `v` column as
	well for this output, but I will leave that for another notebook for now.