andrewheiss/pivoting-rank-extra-NAs.md

## pivoting-rank-extra-NAs.md

      
    Raw
  

              pivoting-rank-extra-NAs.md
            
          
    library(tidyverse)
library(gapminder)

gapminder_ranks_wrong <- gapminder %>% 
  filter(year %in% c(1952, 2007)) %>% 
  group_by(year) %>% 
  mutate(ranking = rank(-pop)) %>% 
  ungroup() %>%
  select(country, continent, year, pop, ranking) %>% 
  pivot_wider(names_from = year, 
              names_prefix = "rank_", 
              values_from = ranking) 

gapminder_ranks_wrong
#> # A tibble: 284 × 5
#>    country     continent      pop rank_1952 rank_2007
#>    <fct>       <fct>        <int>     <dbl>     <dbl>
#>  1 Afghanistan Asia       8425333        44        NA
#>  2 Afghanistan Asia      31889923        NA        36
#>  3 Albania     Europe     1282697       111        NA
#>  4 Albania     Europe     3600523        NA       116
#>  5 Algeria     Africa     9279525        35        NA
#>  6 Algeria     Africa    33333216        NA        35
#>  7 Angola      Africa     4232095        69        NA
#>  8 Angola      Africa    12420476        NA        64
#>  9 Argentina   Americas  17876956        25        NA
#> 10 Argentina   Americas  40301927        NA        29
#> # … with 274 more rows
In this case, the population column is still in the dataset before using pivot_wider(), so R doesn’t know how to spread it across 1952 and 2007
You can see this if you look at the data. The first row is Afghanistan 1952, the second is Afghanistan 2007, the third is Albania 1952, the fourth is Albania 2007, and son on. The rank column spread to two new columns as expected (rank_1952 and rank_2007), but population had no where to go so R kept it as a single column and repeated each of the countries twice
You can fix it two different ways. The easiest would be to get rid of the population column (or any column that you’re not spreading out to new year-based columns) before pivot_wider():
gapminder_ranks_right <- gapminder %>% 
  filter(year %in% c(1952, 2007)) %>% 
  group_by(year) %>% 
  mutate(ranking = rank(-pop)) %>% 
  ungroup() %>%
  # This line is different. We *don't* include the pop column
  select(country, continent, year, ranking) %>% 
  pivot_wider(names_from = year, 
              names_prefix = "rank_", 
              values_from = ranking) 
gapminder_ranks_right
#> # A tibble: 142 × 4
#>    country     continent rank_1952 rank_2007
#>    <fct>       <fct>         <dbl>     <dbl>
#>  1 Afghanistan Asia             44        36
#>  2 Albania     Europe          111       116
#>  3 Algeria     Africa           35        35
#>  4 Angola      Africa           69        64
#>  5 Argentina   Americas         25        29
#>  6 Australia   Oceania          40        49
#>  7 Austria     Europe           51        86
#>  8 Bahrain     Asia            140       137
#>  9 Bangladesh  Asia             10         7
#> 10 Belgium     Europe           39        72
#> # … with 132 more rows
Now that there’s no population colum, there’s no repetition of countries. Afghanistan has just one row; Albania has just one row; and so on.
Alternatively if you still want the populaion, you can pivot both the ranking and population columns
gapminder_ranks_and_pop <- gapminder %>% 
  filter(year %in% c(1952, 2007)) %>% 
  group_by(year) %>% 
  mutate(ranking = rank(-pop)) %>% 
  ungroup() %>%
  # Keep both pop and ranking
  select(country, continent, year, pop, ranking) %>% 
  # Spread out both ranking and pop
  pivot_wider(names_from = year, 
              values_from = c(ranking, pop))
gapminder_ranks_and_pop
#> # A tibble: 142 × 6
#>    country     continent ranking_1952 ranking_2007 pop_1952  pop_2007
#>    <fct>       <fct>            <dbl>        <dbl>    <int>     <int>
#>  1 Afghanistan Asia                44           36  8425333  31889923
#>  2 Albania     Europe             111          116  1282697   3600523
#>  3 Algeria     Africa              35           35  9279525  33333216
#>  4 Angola      Africa              69           64  4232095  12420476
#>  5 Argentina   Americas            25           29 17876956  40301927
#>  6 Australia   Oceania             40           49  8691212  20434176
#>  7 Austria     Europe              51           86  6927772   8199783
#>  8 Bahrain     Asia               140          137   120447    708573
#>  9 Bangladesh  Asia                10            7 46886859 150448339
#> 10 Belgium     Europe              39           72  8730405  10392226
#> # … with 132 more rows
That way both ranking and pop get year-specific columns, and each country gets just one row