Skip to content

Instantly share code, notes, and snippets.

@mpettis
Last active March 22, 2019 20:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mpettis/51acf9b5fb5d9846bf36669bdebff852 to your computer and use it in GitHub Desktop.
Save mpettis/51acf9b5fb5d9846bf36669bdebff852 to your computer and use it in GitHub Desktop.
dplyr spread group percent
---
title: "spread-group-percent"
author: "Matt Pettis"
date: "March 22, 2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Summary
A common pattern is to want to take relative percentages of values with a group. Let's look at an example dataset:
```{r results='hide', message=FALSE, warning=FALSE}
library(tidyverse)
df <- tribble(
~ oid , ~category , ~v ,
111 , "low" , 1 ,
111 , "med" , 2 ,
111 , "high" , 3 ,
222 , "low" , 4 ,
222 , "med" , 5 ,
222 , "high" , 6)
df %>% print()
```
What we'd like to do is:
> For each group, calculate the percent that the value for `category` comprises for each group on it's `value` column. So, for instance, for `oid` == 111, the percentatge for `cateogory == 'low'` should be $1 / (1 + 2 + 3) * 100$, which is $16.67\%$.
## Way one: pivot and calculate.
One natural way to thingk about this is to want to do the calculations row-wise.
That is, if `low`, `med`, and `high` were columns, this would be an easy
calculation... for each row, you just divide each column by the sum of the
values in the `low`, `med`, and `high` columns, and multiply by 100.
Here's how you can do that:
```{r}
dft__ <- df %>%
group_by(oid) %>%
spread(key = category, value = v)
dft__ %>% print()
dft__ <- dft__ %>%
mutate( low_pct = low / (low + med + high) * 100
, med_pct = med / (low + med + high) * 100
, high_pct = high / (low + med + high) * 100)
dft__ %>% print()
```
This works. Some observations about this:
* To do this, you have to know *explicitly* the values in the `category` column
when computing the percentatges, as those values became the column names, and
those column names are used in the percentage calculation.
* There is a lot of redundancy in the formula. We effectively repeat the
calculation, with a small modification for each individual percentage.
## Way 2: Calculate in long form, pivot at end when needed.
It can be inconvenient to have to name all of the `category` levels as columns:
* There may be a lot of distinct values in that column. In this case, that means a lot of repeated code.
* The `category` values might have spaces and chararcters that make them not nice for use as column names.
So it would be nice if we could operate on the original dataframe and calculate
percentages without pivoting them wide first. We can do this as follows.
```{r}
dft__ <- df %>%
group_by(oid) %>%
mutate(ttl = sum(v), pct = v / ttl * 100) %>%
ungroup()
dft__ %>% print()
```
Note that:
* The new `pct` column has the same values you want, just like in the pivoted version.
* You didn't have to reference `low`, `med`, and `high` explicitly to do the
computation... you just summed over the appropriate column, and in the
appropriate group.
* You didn't have to think up new names for the new columns either.
* You can use `mutate()` with `group_by()` to add a column that has the same value wanted for the whole group -- in this case, the total of `v`.
### But what if you want the percentage values pivoted wide?
Well, we can do that from this form of the table, like so:
```{r}
dft__ %>%
select(-v, -ttl) %>%
spread(key = category, value = pct)
```
If you want some systematic naming convention, I like suffixing with the
operation I did on the original values, like `pct` for percent, and I can do
that as follows:
```{r}
dft__ %>%
select(-v, -ttl) %>%
mutate(category = sprintf("%s_pct", category)) %>%
spread(key = category, value = pct)
```
Again some notes:
* Using the `mutate()` call, I transformed `category` to add a suffix of `_pct`
to the column names. This is a consistent naming convention, and best of all, I
don't have to make up column names by hand!
* I had to drop the original values
to make this pivot correctly. You may want to pivot the original `v` column as
well for this output, but I will leave that for another notebook for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment