Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Answer key for tidyverse transform data exercises
---
title: "Transform Data"
subtitle: "hands-on examples, with answers"
output: html_notebook
---
<!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio and https://github.com/cwickham/data-science-in-tidyverse-solutions. -->
```{r setup}
library(tidyverse)
library(gapminder)
```
# Transform Data 1: manipulate cases
## gapminder
Take a quick look at the data in the {gapminder} package.
```{r}
gapminder
```
## Your Turn 1.1: filter() rows
See if you can use the logical operators to manipulate our code below to show:
The data for Canada
```{r}
# example
filter(gapminder, country == "New Zealand")
# answer
filter(gapminder, country == "Canada")
```
All data for countries in Oceania
```{r}
# example
filter(gapminder, country == "New Zealand")
# answer
filter(gapminder, continent == "Oceania")
```
Rows where the life expectancy is greater than 82
```{r}
# example
filter(gapminder, country == "New Zealand")
# answer
filter(gapminder, lifeExp > 82)
```
## Your Turn 1.2: multiple criteria
Use Boolean operators to alter the code below to return only the rows that contain:
* United States before 1970
```{r}
# example
gapminder %>% # Note use of `%>%` pipe symbol
filter(country == "New Zealand", year > 2000)
# answer
gapminder %>%
filter(country == "United States", year < 1970)
# alternative answer
gapminder %>%
filter(country == "United States" & year < 1970) # Note use of `&` rather than comma
```
* Countries where life expectancy in 2007 is below 50 or over 75
```{r}
# answer
gapminder %>%
filter(year == 2007, lifeExp < 50 | lifeExp > 75) # Note that you have to state `lifeExp` twice, as
# there are two comparisons.
# What happens if you replace the comma after 2007 with `&`?
# - Why do you think this might be happening?
```
* Create a list object and return records for any of "New Zealand", "Canada" or "United States"
```{r}
# answer
gapminder %>%
filter(country %in% c("New Zealand", "Canada", "United States"))
# alternate answer
country_list <- c("New Zealand", "Canada", "United States") # create object that is list of countries
gapminder %>%
filter(country %in% country_list) # filter uses that object
```
* Return records for all continents except Europe
```{r}
# answer
gapminder %>%
filter(continent != "Europe")
# answer with a double-check
gapminder %>%
filter(continent != "Europe") %>%
distinct(continent) # the `distinct()` function returns the list of all the
# continent names after the filter is applied
```
## Your Turn 1.3: sorting with `arrange()`
Find the records with the smallest population.
```{r}
#answer
gapminder %>%
arrange(pop)
```
Find the records with the largest GDP per capita.
```{r}
#answer
gapminder %>% arrange(desc(gdpPercap))
#answer
arrange(gapminder, desc(gdpPercap))
#answer
gapminder %>% arrange(-gdpPercap)
```
## Your Turn 1.4: creating new variables
Calculate a new variable with total GDP
```{r, error = TRUE}
# answer
gapminder %>%
mutate(totalGDP = pop * gdpPercap)
```
Create a summary table with the population and GDP by continent for the year 1952
```{r}
# answer
gapminder %>%
filter(year == 1952) %>% # filter to get the records we need
mutate(totalGDP = pop * gdpPercap) %>% # note that we need to add the new variable calculated above!
group_by(continent) %>% # this defines the grouping category
summarize(pop = sum(pop), #
totalGDP = sum(totalGDP)) #
```
## Your Turn 1.5
Alter the code to add a `prev_lifeExp` column that contains the life expectancy from the previous record.
(Hint: use "Transformation" cheatsheet, you want to offset elements by one)
Extra challenge: Why isn't this quite the 'life expectency five years ago'?
```{r}
# answer
gapminder %>%
mutate(prev_lifeExp = lag(lifeExp))
# challenge hint:
# - note that there is an `NA` in the first row, since there's no prior row to pull from
# - look at the `prev_lifeExp` when the country changes from Afghanistan to Albania
```
## Take aways
* Extract cases with `filter()`
* Filter using a list with `%in%`
* Make new variables, with `mutate()`
* Connect operations with `%>%`
***
# Transform Data 2: summarize
<!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
```{r}
library(tidyverse)
library(gapminder)
```
## gapminder
Take a quick look at the data in the {gapminder} package.
```{r}
gapminder
```
## Your Turn 2.1
Use summarise() to compute three statistics about the data:
* The first (minimum) year in the dataset
* The last (maximum) year in the dataset
* The number of unique countries
```{r}
# answer
gapminder %>%
summarize(year_min = min(year),
year_max = max(year),
n_countries = n_distinct(country))
```
## Your Turn 2.2
Extract the rows for African countries in 2007.
Then find:
1. The number of unique countries
2. The median life expectancy of African countries as a group
```{r}
# answer
gapminder %>%
filter(continent == "Africa", year == 2007) %>%
summarise(n_countries = n_distinct(country),
med_life_exp = median(lifeExp))
```
## Your Turn 2.3
Find the median life expectancy by continent in 2007.
```{r}
gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarise(med_life_exp = median(lifeExp))
```
## Your Turn 2.4
Find the median life expectancy by continent for each year in the dataframe.
```{r}
gapminder %>%
group_by(continent, year) %>%
summarize(median_lifeExp = median(lifeExp))
```
## Your Turn 2.5
Brainstorm with your neighbor the sequence of operations to find: the country with biggest jump in life expectancy (between any two consecutive records) for each continent.
## Your Turn 2.6
Find the country with biggest jump in life expectancy (between any two consecutive records) for each continent.
```{r}
# One of many solutions
gapminder %>%
group_by(country) %>%
mutate(prev_lifeExp = lag(lifeExp),
jump = lifeExp - prev_lifeExp) %>%
arrange(desc(jump))
```
## Take aways
* Make tables of summaries with `summarise()`
* Do groupwise operations with `group_by()`
***
# Transform Data 3: pivot
<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->
```{r}
library(tidyverse)
# data packages
library(gapminder)
```
## Your Turn 3.1
A pivot table crosstab using the {mpg} data package of automobile fuel economy.
First, a look at the source table.
```{r}
mpg
```
Using `group_by` and `summarise` to create a summary table of the average engine displacement by vehicle class:
```{r}
mpg %>%
group_by(class) %>%
summarise(displ_mean = mean(displ))
```
Now do the same with vehicle class and number of cylinders:
```{r}
# answer
class_by_cyl <- mpg %>% # note that an ojbect `class_by_cyl` is created
group_by(class, cyl) %>%
summarise(displ_mean = mean(displ))
class_by_cyl # in order to print the object, just name it
```
What happens if you reverse the order of the grouping variables?
Create a pivot (crosstab) table with class by number of cylinders:
```{r}
# answer
# pivot table (wide)
class_by_cyl_pivot <- class_by_cyl %>%
pivot_wider(names_from = cyl, values_from = displ_mean)
class_by_cyl_pivot
```
Now, unpivot it back to the original structure:
```{r}
# and back to longer...
displ_class_by_cyl <- class_by_cyl_pivot %>%
pivot_longer(-class, names_to = "cyl", values_to = "displ_mean")
displ_class_by_cyl
```
What do you notice about the structure of the unpivoted table?
## Your Turn 3.2
Making some summary tables with the {gapminder} data.
Create a table where:
* object name `lifeExp_by_continent` that has
* the median life expectency for each continent for every year in the dataframe,
* where the continent is the column and the year is the row.
(This is a restructuring of the table from 2.4)
```{r}
# answer
lifeExp_by_continent <- gapminder %>%
group_by(continent, year) %>%
summarize(median_lifeExp = median(lifeExp)) %>%
pivot_wider(names_from = continent, values_from = median_lifeExp)
lifeExp_by_continent
```
## Your Turn 3.3
Create a table that shows:
* GDP per capita for
* the countries Canada, United States, and Mexico are the columns, and
* the years after 1980 are the rows
```{r}
gapminder %>%
filter(year > 1980,
country %in% c("Canada", "United States", "Mexico")) %>%
select(country, year, gdpPercap) %>% # select() is covered in section 4
pivot_wider(names_from = country, values_from = gdpPercap)
```
## Take aways
* Restructure data tabulations with `pivot_wider()` and `pivot_longer`
***
# Transform data 4: select()
<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->
```{r}
library(tidyverse)
# data packages
library(gapminder)
```
## Your Turn 4.1
{gapminder} again
```{r}
gapminder
```
Using `select()` to reduce the number of variables:
```{r}
# answer
gapminder %>%
select(country, year, lifeExp)
```
You can use the minus sign to drop variable--that is, to define ones to _not_ select.
In this example, note we can use `c()` to create a list:
```{r}
# answer
gapminder %>%
select(-(c(continent, pop, gdpPercap)))
# alternate answer
gapminder %>%
select(-continent, -pop, -gdpPercap)
```
You can define a range of columns to select by using the `:`
```{r}
# answer
gapminder %>%
select(country, year:pop)
```
## Take aways
* Restructure data frames with `select()` and `pull()`
* Reinforce other data transformation functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.