Last active
March 22, 2019 20:22
-
-
Save mpettis/51acf9b5fb5d9846bf36669bdebff852 to your computer and use it in GitHub Desktop.
dplyr spread group percent
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "spread-group-percent" | |
author: "Matt Pettis" | |
date: "March 22, 2019" | |
output: html_document | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE) | |
``` | |
# Summary | |
A common pattern is to want to take relative percentages of values with a group. Let's look at an example dataset: | |
```{r results='hide', message=FALSE, warning=FALSE} | |
library(tidyverse) | |
df <- tribble( | |
~ oid , ~category , ~v , | |
111 , "low" , 1 , | |
111 , "med" , 2 , | |
111 , "high" , 3 , | |
222 , "low" , 4 , | |
222 , "med" , 5 , | |
222 , "high" , 6) | |
df %>% print() | |
``` | |
What we'd like to do is: | |
> For each group, calculate the percent that the value for `category` comprises for each group on it's `value` column. So, for instance, for `oid` == 111, the percentatge for `cateogory == 'low'` should be $1 / (1 + 2 + 3) * 100$, which is $16.67\%$. | |
## Way one: pivot and calculate. | |
One natural way to thingk about this is to want to do the calculations row-wise. | |
That is, if `low`, `med`, and `high` were columns, this would be an easy | |
calculation... for each row, you just divide each column by the sum of the | |
values in the `low`, `med`, and `high` columns, and multiply by 100. | |
Here's how you can do that: | |
```{r} | |
dft__ <- df %>% | |
group_by(oid) %>% | |
spread(key = category, value = v) | |
dft__ %>% print() | |
dft__ <- dft__ %>% | |
mutate( low_pct = low / (low + med + high) * 100 | |
, med_pct = med / (low + med + high) * 100 | |
, high_pct = high / (low + med + high) * 100) | |
dft__ %>% print() | |
``` | |
This works. Some observations about this: | |
* To do this, you have to know *explicitly* the values in the `category` column | |
when computing the percentatges, as those values became the column names, and | |
those column names are used in the percentage calculation. | |
* There is a lot of redundancy in the formula. We effectively repeat the | |
calculation, with a small modification for each individual percentage. | |
## Way 2: Calculate in long form, pivot at end when needed. | |
It can be inconvenient to have to name all of the `category` levels as columns: | |
* There may be a lot of distinct values in that column. In this case, that means a lot of repeated code. | |
* The `category` values might have spaces and chararcters that make them not nice for use as column names. | |
So it would be nice if we could operate on the original dataframe and calculate | |
percentages without pivoting them wide first. We can do this as follows. | |
```{r} | |
dft__ <- df %>% | |
group_by(oid) %>% | |
mutate(ttl = sum(v), pct = v / ttl * 100) %>% | |
ungroup() | |
dft__ %>% print() | |
``` | |
Note that: | |
* The new `pct` column has the same values you want, just like in the pivoted version. | |
* You didn't have to reference `low`, `med`, and `high` explicitly to do the | |
computation... you just summed over the appropriate column, and in the | |
appropriate group. | |
* You didn't have to think up new names for the new columns either. | |
* You can use `mutate()` with `group_by()` to add a column that has the same value wanted for the whole group -- in this case, the total of `v`. | |
### But what if you want the percentage values pivoted wide? | |
Well, we can do that from this form of the table, like so: | |
```{r} | |
dft__ %>% | |
select(-v, -ttl) %>% | |
spread(key = category, value = pct) | |
``` | |
If you want some systematic naming convention, I like suffixing with the | |
operation I did on the original values, like `pct` for percent, and I can do | |
that as follows: | |
```{r} | |
dft__ %>% | |
select(-v, -ttl) %>% | |
mutate(category = sprintf("%s_pct", category)) %>% | |
spread(key = category, value = pct) | |
``` | |
Again some notes: | |
* Using the `mutate()` call, I transformed `category` to add a suffix of `_pct` | |
to the column names. This is a consistent naming convention, and best of all, I | |
don't have to make up column names by hand! | |
* I had to drop the original values | |
to make this pivot correctly. You may want to pivot the original `v` column as | |
well for this output, but I will leave that for another notebook for now. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment