MonkmanMH/transform_data_solutions.Rmd

## transform_data_solutions.Rmd
---
title: "Transform Data"
subtitle: "hands-on examples, with answers"
output: html_notebook
---

<!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio and https://github.com/cwickham/data-science-in-tidyverse-solutions. -->

```{r setup}
library(tidyverse)
library(gapminder)

```

# Transform Data 1: manipulate cases


## gapminder

Take a quick look at the data in the {gapminder} package.

```{r}
gapminder
```

## Your Turn 1.1: filter() rows

See if you can use the logical operators to manipulate our code below to show:

The data for Canada

```{r}
# example
filter(gapminder, country == "New Zealand")

# answer
filter(gapminder, country == "Canada")

```

All data for countries in Oceania

```{r}
# example
filter(gapminder, country == "New Zealand")

# answer
filter(gapminder, continent == "Oceania")
```

Rows where the life expectancy is greater than 82
```{r}
# example
filter(gapminder, country == "New Zealand")

# answer
filter(gapminder, lifeExp > 82)
```


## Your Turn 1.2: multiple criteria

Use Boolean operators to alter the code below to return only the rows that contain:

* United States before 1970

```{r}
# example
gapminder %>%                                       # Note use of `%>%` pipe symbol
  filter(country == "New Zealand", year > 2000)

# answer
gapminder %>%
  filter(country == "United States", year < 1970)

# alternative answer
gapminder %>%
  filter(country == "United States" & year < 1970)  # Note use of `&` rather than comma

```

*  Countries where life expectancy in 2007 is below 50 or over 75

```{r}
# answer
gapminder %>%
  filter(year == 2007, lifeExp < 50 | lifeExp > 75) # Note that you have to state `lifeExp` twice, as
                                                    # there are two comparisons.
                                                    # What happens if you replace the comma after 2007 with `&`?
                                                    # - Why do you think this might be happening?


```


* Create a list object and return records for any of "New Zealand", "Canada" or "United States"


```{r}
# answer
gapminder %>%
  filter(country %in% c("New Zealand", "Canada", "United States"))

# alternate answer
country_list <- c("New Zealand", "Canada", "United States")   # create object that is list of countries
gapminder %>%
  filter(country %in% country_list)                           # filter uses that object


```

* Return records for all continents except Europe


```{r}
# answer
gapminder %>%
  filter(continent != "Europe")

# answer with a double-check
gapminder %>%
  filter(continent != "Europe") %>%
  distinct(continent)                 # the `distinct()` function returns the list of all the
                                      # continent names after the filter is applied

```

## Your Turn 1.3: sorting with `arrange()`

Find the records with the smallest population.
```{r}
#answer
gapminder %>%
  arrange(pop)

```

Find the records with the largest GDP per capita.
```{r}
#answer
gapminder %>% arrange(desc(gdpPercap))

#answer
arrange(gapminder, desc(gdpPercap))

#answer
gapminder %>% arrange(-gdpPercap)


```


## Your Turn 1.4: creating new variables

Calculate a new variable with total GDP

```{r, error = TRUE}
# answer
gapminder %>%
  mutate(totalGDP = pop * gdpPercap)


```

Create a summary table with the population and GDP by continent for the year 1952

```{r}
# answer
gapminder %>%
  filter(year == 1952) %>%               # filter to get the records we need
  mutate(totalGDP = pop * gdpPercap) %>% # note that we need to add the new variable calculated above!
  group_by(continent) %>%                # this defines the grouping category
  summarize(pop = sum(pop),              #
            totalGDP = sum(totalGDP))    #
```


## Your Turn 1.5

Alter the code to add a `prev_lifeExp` column that contains the life expectancy from the previous record.

(Hint: use "Transformation" cheatsheet, you want to offset elements by one)

Extra challenge: Why isn't this quite the 'life expectency five years ago'?

```{r}
# answer
gapminder %>%
  mutate(prev_lifeExp = lag(lifeExp))

# challenge hint:
# - note that there is an `NA` in the first row, since there's no prior row to pull from
# - look at the `prev_lifeExp` when the country changes from Afghanistan to Albania
```


## Take aways

* Extract cases with `filter()`
* Filter using a list with `%in%`
* Make new variables, with `mutate()`
* Connect operations with `%>%`

***

# Transform Data 2: summarize

<!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->


```{r}
library(tidyverse)
library(gapminder)

```

## gapminder

Take a quick look at the data in the {gapminder} package.

```{r}
gapminder
```


## Your Turn 2.1

Use summarise() to compute three statistics about the data:

* The first (minimum) year in the dataset
* The last (maximum) year in the dataset
* The number of unique countries

```{r}
# answer
gapminder %>%
  summarize(year_min = min(year),
            year_max = max(year),
            n_countries = n_distinct(country))

```

## Your Turn 2.2

Extract the rows for African countries in 2007.

Then find:

1. The number of unique countries
2. The median life expectancy of African countries as a group

```{r}
# answer
gapminder %>%
  filter(continent == "Africa", year == 2007) %>%
  summarise(n_countries = n_distinct(country),
    med_life_exp = median(lifeExp))


```


## Your Turn 2.3

Find the median life expectancy by continent in 2007.

```{r}
gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarise(med_life_exp = median(lifeExp))

```


## Your Turn 2.4

Find the median life expectancy by continent for each year in the dataframe.

```{r}

gapminder %>%
  group_by(continent, year) %>%
  summarize(median_lifeExp = median(lifeExp))

```
## Your Turn 2.5

Brainstorm with your neighbor the sequence of operations to find:  the country with biggest jump in life expectancy  (between any two consecutive records) for each continent.

## Your Turn 2.6

Find the country with biggest jump in life expectancy (between any two consecutive records) for each continent.

```{r}
# One of many solutions
gapminder %>%
  group_by(country) %>%
  mutate(prev_lifeExp = lag(lifeExp),
         jump = lifeExp - prev_lifeExp) %>%
  arrange(desc(jump))


```


## Take aways

* Make tables of summaries with `summarise()`
* Do groupwise operations with `group_by()`

***


# Transform Data 3: pivot

<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->


```{r}
library(tidyverse)
# data packages
library(gapminder)

```


## Your Turn 3.1

A pivot table crosstab using the {mpg} data package of automobile fuel economy.

First, a look at the source table.

```{r}
mpg
```

Using `group_by` and `summarise` to create a summary table of the average engine displacement by vehicle class:

```{r}
mpg %>%
  group_by(class) %>%
  summarise(displ_mean = mean(displ))

```

Now do the same with vehicle class and number of cylinders:

```{r}
# answer
class_by_cyl <- mpg %>%                 # note that an ojbect `class_by_cyl` is created
  group_by(class, cyl) %>%
  summarise(displ_mean = mean(displ))

class_by_cyl                            # in order to print the object, just name it

```


What happens if you reverse the order of the grouping variables?


Create a pivot (crosstab) table with class by number of cylinders:


```{r}
# answer
# pivot table (wide)

class_by_cyl_pivot <- class_by_cyl %>%
  pivot_wider(names_from = cyl, values_from = displ_mean)

class_by_cyl_pivot


```

Now, unpivot it back to the original structure:

```{r}
# and back to longer...

displ_class_by_cyl <- class_by_cyl_pivot %>%
  pivot_longer(-class, names_to = "cyl", values_to = "displ_mean")

displ_class_by_cyl

```

What do you notice about the structure of the unpivoted table?


## Your Turn 3.2

Making some summary tables with the {gapminder} data.

Create a table where:

* object name `lifeExp_by_continent` that has
* the median life expectency for each continent for every year in the dataframe,
* where the continent is the column and the year is the row.

(This is a restructuring of the table from 2.4)

```{r}
# answer

lifeExp_by_continent <- gapminder %>%
  group_by(continent, year) %>%
  summarize(median_lifeExp = median(lifeExp)) %>%
  pivot_wider(names_from = continent, values_from = median_lifeExp)

lifeExp_by_continent

```

## Your Turn 3.3

Create a table that shows:

* GDP per capita for
* the countries Canada, United States, and Mexico are the columns, and
* the years after 1980 are the rows

```{r}

gapminder %>%
  filter(year > 1980,
         country %in% c("Canada", "United States", "Mexico")) %>%
  select(country, year, gdpPercap) %>%        # select() is covered in section 4
  pivot_wider(names_from = country, values_from = gdpPercap)

```


## Take aways

* Restructure data tabulations with `pivot_wider()` and `pivot_longer`


***

# Transform data 4: select()

<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->


```{r}
library(tidyverse)
# data packages
library(gapminder)

```


## Your Turn 4.1

{gapminder} again

```{r}
gapminder
```

Using `select()` to reduce the number of variables:

```{r}
# answer
gapminder %>%
  select(country, year, lifeExp)

```


You can use the minus sign to drop variable--that is, to define ones to _not_ select.

In this example, note we can use `c()` to create a list:


```{r}
# answer
gapminder %>%
  select(-(c(continent, pop, gdpPercap)))

# alternate answer
gapminder %>%
  select(-continent, -pop, -gdpPercap)

```


You can define a range of columns to select by using the `:`

```{r}
# answer
gapminder %>%
  select(country, year:pop)
```


## Take aways

* Restructure data frames with `select()` and `pull()`
* Reinforce other data transformation functions
	---
	title: "Transform Data"
	subtitle: "hands-on examples, with answers"
	output: html_notebook
	---

	<!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio and https://github.com/cwickham/data-science-in-tidyverse-solutions. -->

	```{r setup}
	library(tidyverse)
	library(gapminder)

	```

	# Transform Data 1: manipulate cases


	## gapminder

	Take a quick look at the data in the {gapminder} package.

	```{r}
	gapminder
	```

	## Your Turn 1.1: filter() rows

	See if you can use the logical operators to manipulate our code below to show:

	The data for Canada

	```{r}
	# example
	filter(gapminder, country == "New Zealand")

	# answer
	filter(gapminder, country == "Canada")

	```

	All data for countries in Oceania

	```{r}
	# example
	filter(gapminder, country == "New Zealand")

	# answer
	filter(gapminder, continent == "Oceania")
	```

	Rows where the life expectancy is greater than 82
	```{r}
	# example
	filter(gapminder, country == "New Zealand")

	# answer
	filter(gapminder, lifeExp > 82)
	```


	## Your Turn 1.2: multiple criteria

	Use Boolean operators to alter the code below to return only the rows that contain:

	* United States before 1970

	```{r}
	# example
	gapminder %>% # Note use of `%>%` pipe symbol
	filter(country == "New Zealand", year > 2000)

	# answer
	gapminder %>%
	filter(country == "United States", year < 1970)

	# alternative answer
	gapminder %>%
	filter(country == "United States" & year < 1970) # Note use of `&` rather than comma

	```

	* Countries where life expectancy in 2007 is below 50 or over 75

	```{r}
	# answer
	gapminder %>%
	filter(year == 2007, lifeExp < 50 \| lifeExp > 75) # Note that you have to state `lifeExp` twice, as
	# there are two comparisons.
	# What happens if you replace the comma after 2007 with `&`?
	# - Why do you think this might be happening?



	```



	* Create a list object and return records for any of "New Zealand", "Canada" or "United States"


	```{r}
	# answer
	gapminder %>%
	filter(country %in% c("New Zealand", "Canada", "United States"))

	# alternate answer
	country_list <- c("New Zealand", "Canada", "United States") # create object that is list of countries
	gapminder %>%
	filter(country %in% country_list) # filter uses that object


	```

	* Return records for all continents except Europe


	```{r}
	# answer
	gapminder %>%
	filter(continent != "Europe")

	# answer with a double-check
	gapminder %>%
	filter(continent != "Europe") %>%
	distinct(continent) # the `distinct()` function returns the list of all the
	# continent names after the filter is applied

	```

	## Your Turn 1.3: sorting with `arrange()`

	Find the records with the smallest population.
	```{r}
	#answer
	gapminder %>%
	arrange(pop)

	```

	Find the records with the largest GDP per capita.
	```{r}
	#answer
	gapminder %>% arrange(desc(gdpPercap))

	#answer
	arrange(gapminder, desc(gdpPercap))

	#answer
	gapminder %>% arrange(-gdpPercap)


	```


	## Your Turn 1.4: creating new variables

	Calculate a new variable with total GDP

	```{r, error = TRUE}
	# answer
	gapminder %>%
	mutate(totalGDP = pop * gdpPercap)


	```

	Create a summary table with the population and GDP by continent for the year 1952

	```{r}
	# answer
	gapminder %>%
	filter(year == 1952) %>% # filter to get the records we need
	mutate(totalGDP = pop * gdpPercap) %>% # note that we need to add the new variable calculated above!
	group_by(continent) %>% # this defines the grouping category
	summarize(pop = sum(pop), #
	totalGDP = sum(totalGDP)) #
	```




	## Your Turn 1.5

	Alter the code to add a `prev_lifeExp` column that contains the life expectancy from the previous record.

	(Hint: use "Transformation" cheatsheet, you want to offset elements by one)

	Extra challenge: Why isn't this quite the 'life expectency five years ago'?

	```{r}
	# answer
	gapminder %>%
	mutate(prev_lifeExp = lag(lifeExp))

	# challenge hint:
	# - note that there is an `NA` in the first row, since there's no prior row to pull from
	# - look at the `prev_lifeExp` when the country changes from Afghanistan to Albania
	```


	## Take aways

	* Extract cases with `filter()`
	* Filter using a list with `%in%`
	* Make new variables, with `mutate()`
	* Connect operations with `%>%`

	***

	# Transform Data 2: summarize

	<!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->


	```{r}
	library(tidyverse)
	library(gapminder)

	```

	## gapminder

	Take a quick look at the data in the {gapminder} package.

	```{r}
	gapminder
	```



	## Your Turn 2.1

	Use summarise() to compute three statistics about the data:

	* The first (minimum) year in the dataset
	* The last (maximum) year in the dataset
	* The number of unique countries

	```{r}
	# answer
	gapminder %>%
	summarize(year_min = min(year),
	year_max = max(year),
	n_countries = n_distinct(country))

	```

	## Your Turn 2.2

	Extract the rows for African countries in 2007.

	Then find:

	1. The number of unique countries
	2. The median life expectancy of African countries as a group

	```{r}
	# answer
	gapminder %>%
	filter(continent == "Africa", year == 2007) %>%
	summarise(n_countries = n_distinct(country),
	med_life_exp = median(lifeExp))


	```


	## Your Turn 2.3

	Find the median life expectancy by continent in 2007.

	```{r}
	gapminder %>%
	filter(year == 2007) %>%
	group_by(continent) %>%
	summarise(med_life_exp = median(lifeExp))

	```


	## Your Turn 2.4

	Find the median life expectancy by continent for each year in the dataframe.

	```{r}

	gapminder %>%
	group_by(continent, year) %>%
	summarize(median_lifeExp = median(lifeExp))

	```
	## Your Turn 2.5

	Brainstorm with your neighbor the sequence of operations to find: the country with biggest jump in life expectancy (between any two consecutive records) for each continent.

	## Your Turn 2.6

	Find the country with biggest jump in life expectancy (between any two consecutive records) for each continent.

	```{r}
	# One of many solutions
	gapminder %>%
	group_by(country) %>%
	mutate(prev_lifeExp = lag(lifeExp),
	jump = lifeExp - prev_lifeExp) %>%
	arrange(desc(jump))


	```


	## Take aways

	* Make tables of summaries with `summarise()`
	* Do groupwise operations with `group_by()`

	***



	# Transform Data 3: pivot

	<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->


	```{r}
	library(tidyverse)
	# data packages
	library(gapminder)

	```


	## Your Turn 3.1

	A pivot table crosstab using the {mpg} data package of automobile fuel economy.

	First, a look at the source table.

	```{r}
	mpg
	```

	Using `group_by` and `summarise` to create a summary table of the average engine displacement by vehicle class:

	```{r}
	mpg %>%
	group_by(class) %>%
	summarise(displ_mean = mean(displ))

	```

	Now do the same with vehicle class and number of cylinders:

	```{r}
	# answer
	class_by_cyl <- mpg %>% # note that an ojbect `class_by_cyl` is created
	group_by(class, cyl) %>%
	summarise(displ_mean = mean(displ))

	class_by_cyl # in order to print the object, just name it

	```


	What happens if you reverse the order of the grouping variables?



	Create a pivot (crosstab) table with class by number of cylinders:


	```{r}
	# answer
	# pivot table (wide)

	class_by_cyl_pivot <- class_by_cyl %>%
	pivot_wider(names_from = cyl, values_from = displ_mean)

	class_by_cyl_pivot


	```

	Now, unpivot it back to the original structure:

	```{r}
	# and back to longer...

	displ_class_by_cyl <- class_by_cyl_pivot %>%
	pivot_longer(-class, names_to = "cyl", values_to = "displ_mean")

	displ_class_by_cyl

	```

	What do you notice about the structure of the unpivoted table?



	## Your Turn 3.2

	Making some summary tables with the {gapminder} data.

	Create a table where:

	* object name `lifeExp_by_continent` that has
	* the median life expectency for each continent for every year in the dataframe,
	* where the continent is the column and the year is the row.

	(This is a restructuring of the table from 2.4)

	```{r}
	# answer

	lifeExp_by_continent <- gapminder %>%
	group_by(continent, year) %>%
	summarize(median_lifeExp = median(lifeExp)) %>%
	pivot_wider(names_from = continent, values_from = median_lifeExp)

	lifeExp_by_continent

	```

	## Your Turn 3.3

	Create a table that shows:

	* GDP per capita for
	* the countries Canada, United States, and Mexico are the columns, and
	* the years after 1980 are the rows

	```{r}

	gapminder %>%
	filter(year > 1980,
	country %in% c("Canada", "United States", "Mexico")) %>%
	select(country, year, gdpPercap) %>% # select() is covered in section 4
	pivot_wider(names_from = country, values_from = gdpPercap)

	```



	## Take aways

	* Restructure data tabulations with `pivot_wider()` and `pivot_longer`


	***

	# Transform data 4: select()

	<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->


	```{r}
	library(tidyverse)
	# data packages
	library(gapminder)

	```


	## Your Turn 4.1

	{gapminder} again

	```{r}
	gapminder
	```

	Using `select()` to reduce the number of variables:

	```{r}
	# answer
	gapminder %>%
	select(country, year, lifeExp)

	```



	You can use the minus sign to drop variable--that is, to define ones to _not_ select.

	In this example, note we can use `c()` to create a list:


	```{r}
	# answer
	gapminder %>%
	select(-(c(continent, pop, gdpPercap)))

	# alternate answer
	gapminder %>%
	select(-continent, -pop, -gdpPercap)

	```



	You can define a range of columns to select by using the `:`

	```{r}
	# answer
	gapminder %>%
	select(country, year:pop)
	```


	## Take aways

	* Restructure data frames with `select()` and `pull()`
	* Reinforce other data transformation functions