library(tidyverse)
library(gapminder)
gapminder_ranks_wrong <- gapminder %>%
filter(year %in% c(1952, 2007)) %>%
group_by(year) %>%
mutate(ranking = rank(-pop)) %>%
ungroup() %>%
select(country, continent, year, pop, ranking) %>%
pivot_wider(names_from = year,
names_prefix = "rank_",
values_from = ranking)
gapminder_ranks_wrong
#> # A tibble: 284 × 5
#> country continent pop rank_1952 rank_2007
#> <fct> <fct> <int> <dbl> <dbl>
#> 1 Afghanistan Asia 8425333 44 NA
#> 2 Afghanistan Asia 31889923 NA 36
#> 3 Albania Europe 1282697 111 NA
#> 4 Albania Europe 3600523 NA 116
#> 5 Algeria Africa 9279525 35 NA
#> 6 Algeria Africa 33333216 NA 35
#> 7 Angola Africa 4232095 69 NA
#> 8 Angola Africa 12420476 NA 64
#> 9 Argentina Americas 17876956 25 NA
#> 10 Argentina Americas 40301927 NA 29
#> # … with 274 more rows
In this case, the population column is still in the dataset before using pivot_wider(), so R doesn’t know how to spread it across 1952 and 2007
You can see this if you look at the data. The first row is Afghanistan 1952, the second is Afghanistan 2007, the third is Albania 1952, the fourth is Albania 2007, and son on. The rank column spread to two new columns as expected (rank_1952 and rank_2007), but population had no where to go so R kept it as a single column and repeated each of the countries twice
You can fix it two different ways. The easiest would be to get rid of the population column (or any column that you’re not spreading out to new year-based columns) before pivot_wider():
gapminder_ranks_right <- gapminder %>%
filter(year %in% c(1952, 2007)) %>%
group_by(year) %>%
mutate(ranking = rank(-pop)) %>%
ungroup() %>%
# This line is different. We *don't* include the pop column
select(country, continent, year, ranking) %>%
pivot_wider(names_from = year,
names_prefix = "rank_",
values_from = ranking)
gapminder_ranks_right
#> # A tibble: 142 × 4
#> country continent rank_1952 rank_2007
#> <fct> <fct> <dbl> <dbl>
#> 1 Afghanistan Asia 44 36
#> 2 Albania Europe 111 116
#> 3 Algeria Africa 35 35
#> 4 Angola Africa 69 64
#> 5 Argentina Americas 25 29
#> 6 Australia Oceania 40 49
#> 7 Austria Europe 51 86
#> 8 Bahrain Asia 140 137
#> 9 Bangladesh Asia 10 7
#> 10 Belgium Europe 39 72
#> # … with 132 more rows
Now that there’s no population colum, there’s no repetition of countries. Afghanistan has just one row; Albania has just one row; and so on.
Alternatively if you still want the populaion, you can pivot both the ranking and population columns
gapminder_ranks_and_pop <- gapminder %>%
filter(year %in% c(1952, 2007)) %>%
group_by(year) %>%
mutate(ranking = rank(-pop)) %>%
ungroup() %>%
# Keep both pop and ranking
select(country, continent, year, pop, ranking) %>%
# Spread out both ranking and pop
pivot_wider(names_from = year,
values_from = c(ranking, pop))
gapminder_ranks_and_pop
#> # A tibble: 142 × 6
#> country continent ranking_1952 ranking_2007 pop_1952 pop_2007
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Afghanistan Asia 44 36 8425333 31889923
#> 2 Albania Europe 111 116 1282697 3600523
#> 3 Algeria Africa 35 35 9279525 33333216
#> 4 Angola Africa 69 64 4232095 12420476
#> 5 Argentina Americas 25 29 17876956 40301927
#> 6 Australia Oceania 40 49 8691212 20434176
#> 7 Austria Europe 51 86 6927772 8199783
#> 8 Bahrain Asia 140 137 120447 708573
#> 9 Bangladesh Asia 10 7 46886859 150448339
#> 10 Belgium Europe 39 72 8730405 10392226
#> # … with 132 more rows
That way both ranking and pop get year-specific columns, and each country gets just one row