Skip to content

Instantly share code, notes, and snippets.

@brshallo
Created May 24, 2019 16:46
Show Gist options
  • Save brshallo/9c1d199d2956466d64427c24a377d329 to your computer and use it in GitHub Desktop.
Save brshallo/9c1d199d2956466d64427c24a377d329 to your computer and use it in GitHub Desktop.
Solutions to chapter 25 of R for Data Science. Shared as example for Rstudio-git-Avecto problem
---
title: "Many models"
date: "`r paste('Last updated: ', format(Sys.time(), '%Y-%m-%d'))`"
author: "Bryan Shalloway"
output:
github_document:
toc: true
toc_depth: 3
html_document:
toc: true
toc_depth: 3
---
*Make sure the following packages are installed:*
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE)
library(ggplot2)
library(ggbeeswarm)
library(tidyverse)
library(gapminder)
library(modelr)
library(broom)
library(nycflights13)
library(babynames)
library(nasaweather)
library(lubridate)
```
# ch. 25: Many models
* `nest` creates a list-column with default key value `data`. Each row value becomes a dataframe with all non-grouping columns and all rows corresponding with a particular group
```{r, eval = FALSE}
iris %>%
group_by(Species) %>%
nest()
```
* `unnest` unnest any list-column in your dataframe.
Notes on `unnest` behavior:
* if there are multiple rows in the list-col being unnested, the existing row values will be duplicated
```{r, eval = FALSE}
tibble(x = 1:100) %>%
mutate(test1 = list(c(1,2))) %>%
unnest(test1)
# notice duplicated x values: 1,1,2,2,...100,100
# Notice that in situations like the below one though, we get two new columns
# however the row length does not change i.e. there are no duplicates created
# (this is similar to the structure when using broom::glance)
tibble(x = 1:100) %>%
mutate(test1 = list(tibble(a = 1, b = 2))) %>%
unnest(test1)
```
* if there are multiple list-cols, specify the column to unnest or default behavior will be to unnest all (if not possible)
* when unnesting a single column but multiple list-cols exist, the default behavior is that if unnesting caused no change in row number than keep other list-cols, but if unnesting caused a change in row-number drop other columns. To override the former use `.drop = TRUE` to override the latter use `.drop = FALSE`. ^[Note that if using `.drop = FALSE` in the latter case that you are creating replicated rows for list-col values]
```{r, eval = FALSE}
tibble(x = 1:100) %>%
mutate(test1 = list(c(1,2)),
test2 = list(c(3,4))) %>%
unnest(test1) # use .drop = TRUE to drop test2 column
# or to unnest both
tibble(x = 1:100) %>%
mutate(test1 = list(c(1,2)),
test2 = list(c(3,4))) %>%
unnest()
```
* when unnesting multiple columns, all must be the same length or you will get an error, e.g. below would fail:
```{r, error = TRUE, eval = FALSE}
tibble(x = 1:100) %>%
mutate(test1 = list(c(1)),
test2 = list(c(2,3))) %>%
unnest()
```
* To unnest mulitple list-cols with different row-lengths, you should use multiple `unnest` statements, e.g. below would work:
```{r, error = TRUE, eval = FALSE}
tibble(x = 1:100) %>%
mutate(test1 = list(c(1)),
test2 = list(c(2,3))) %>%
unnest(test1) %>%
unnest(test2)
```
* Method for nesting individual vectors: `group_by` `%>%` `summarise`, e.g.:
```{r}
iris %>%
group_by(Species) %>%
summarise_all(list)
```
* the above has the advantage of producing atomic vectors rather than dataframes as the types inside of the lists
* `broom::glance` takes a model as input and outputs a one row tibble with columns for each of several model evalation statistics (note that these metrics are geared towards evaluating the training)
* `broom::tidy` creates a tibble with columns `term`, `estimate`, `std.error`, `statistic` (t-statistic) and `p.value`. A new row is created for each `term` type, e.g. intercept, x1, x2, etc.
* `ggtitle()`, alternative to `labs(title = "type title here")`
* see [25.4.5] number 3 for a useful way of wrapping certain functions in `list` functions to take advantage of the list-col format
## 25.2: gapminder
The set-up example Hadley goes through is important, below is a slightly altered copy of his example.
__Nested Data__
```{r}
by_country <- gapminder %>%
group_by(country, continent) %>%
nest()
```
__List-columns__
```{r}
country_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
```
Want to apply this function over every data frame, the dataframes are in a list, so do this by:
```{r}
by_country2 <- by_country %>%
mutate(model = purrr::map(data, country_model))
```
Advantage with keeping things in the dataframe is that when you filter, or move things around, everything stays in sync, as do new summary values you might add.
```{r}
by_country2 %>%
arrange(continent, country)
by_country2 %>%
mutate(summaries = purrr::map(model, summary)) %>%
mutate(r_squared = purrr::map2_dbl(model, data, rsquare))
```
__unnesting__, dd another dataframe with the residuals included and then unnest
```{r}
by_country3 <- by_country2 %>%
mutate(resids = purrr::map2(data, model, add_residuals))
```
```{r}
resids <- by_country3 %>%
unnest(resids)
resids
```
### 25.2.5
1. A linear trend seems to be slightly too simple for the overall trend. Can you do better with a quadratic polynomial? How can you interpret the coefficients of the quadratic? (Hint you might want to transform `year` so that it has mean zero.)
*Create functions*
```{r}
# funciton to center value
center_value <- function(df){
df %>%
mutate(year_cent = year - mean(year))
}
# this function allows me to input any text to "var" to customize the inputs
# to the model, default are a linear and quadratic term for year (centered)
lm_quad_2 <- function(df, var = "year_cent + I(year_cent^2)"){
lm(as.formula(paste("lifeExp ~ ", var)), data = df)
}
```
*Create dataframe with evaluation metrics*
```{r}
by_country3_quad <- by_country3 %>%
mutate(
# create centered data
data_cent = purrr::map(data, center_value),
# create quadratic models
mod_quad = purrr::map(data_cent, lm_quad_2),
# get model evaluation stats from original model
glance_mod = purrr::map(model, broom::glance),
# get model evaluation stats from quadratic model
glance_quad = purrr::map(mod_quad, broom::glance))
```
*Create plots*
```{r}
by_country3_quad %>%
unnest(glance_mod, glance_quad, .sep = "_", .drop = TRUE) %>%
gather(glance_mod_r.squared, glance_quad_r.squared,
key = "order", value = "r.squared") %>%
ggplot(aes(x = continent, y = r.squared, colour = continent)) +
geom_boxplot() +
facet_wrap(~order)
```
* The quadratic trend seems to do better --> indicated by the distribution of the R^2 values being closer to one. The level of improvement seems especially pronounced for African countries.
Let's check this for sure by looking at percentage point improvement in R^2 in chart below
```{r}
by_country3_quad %>%
mutate(quad_coefs = map(mod_quad, broom::tidy)) %>%
unnest(glance_mod, .sep = "_") %>%
unnest(glance_quad) %>%
mutate(bad_fit = glance_mod_r.squared < 0.25,
R.squ_ppt_increase = r.squared - glance_mod_r.squared) %>%
ggplot(aes(x = continent, y = R.squ_ppt_increase))+
# geom_quasirandom(aes(alpha = bad_fit), colour = "black")+
geom_boxplot(alpha = 0.1, colour = "dark grey")+
geom_quasirandom(aes(colour = continent))+
labs(title = "Percentage point (PPT) improvement in R squared value",
subtitle = "(When adding a quadratic term to the linear regression model)")
```
*View predictions from linear model with quadratic term*
(of countries where linear trend did not capture relationship)
```{r}
bad_fit <- by_country3 %>%
mutate(glance = purrr::map(model, broom::glance)) %>%
unnest(glance, .drop = TRUE) %>%
filter(r.squared < 0.25)
#solve with join with bad_fit
by_country3_quad %>%
semi_join(bad_fit, by = "country") %>%
mutate(data_preds = purrr::map2(data_cent, mod_quad, add_predictions)) %>%
unnest(data_preds) %>%
ggplot(aes(x = year, group = country))+
geom_point(aes(y = lifeExp, colour = country))+
geom_line(aes(y = pred, colour = country))+
facet_wrap(~country)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
* while the quadratic model does a better job fitting the model than a linear term does, I wouldn't say it does a good job of fitting the model
* it looks like the trends are generally consistent rates of improvement and then there is a sudden drop-off associated with some event, hence an intervention variable may be a more appropriate method for modeling this pattern
*Quadratic model parameters*
```{r}
by_country3_quad %>%
mutate(quad_coefs = map(mod_quad, broom::tidy)) %>%
unnest(glance_mod, .sep = "_") %>%
unnest(glance_quad) %>%
unnest(quad_coefs) %>%
mutate(bad_fit = glance_mod_r.squared < 0.25) %>%
ggplot(aes(x = continent, y = estimate, alpha = bad_fit))+
geom_boxplot(alpha = 0.1, colour = "dark grey")+
geom_quasirandom(aes(colour = continent))+
facet_wrap(~term, scales = "free")+
labs(caption = "Note that 'bad fit' represents a bad fit on the initial model \nthat did not contain a quadratic term)")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
* The quadratic term (in a linear function, trained with the x-value centered at the mean, as in this dataset) has a few important notes related to interpretation
* If the coefficient is positive the output will be convex, if it is negative it will be concave (i.e. smile vs. frown shape)
* The value on the coefficient represents 1/2 the rate at which the relationship between `lifeExp` and `year` is changing for every one unit change from the mean / expected value of `lifeExp` in the dataset.
* Hence if the coefficient is near 0, that means the relationship between `lifeExp` and `year` does not change (or at least does not change at a constant rate) when moving in either direction from `lifeExp`s mean value.
To better understand this, let's look look at a specific example. Excluding Rwanda, Botswana was the `country` that the linear model without the quadratic term performed the worst on. We'll use this as our example for interpreting the coefficients.
*Plots of predicted and actual values for Botswanian life expectancy by year*
```{r}
by_country3_quad %>%
filter(country == "Botswana") %>%
mutate(data_preds = purrr::map2(data_cent, mod_quad, add_predictions)) %>%
unnest(data_preds) %>%
ggplot(aes(x = year, group = country))+
geom_point(aes(y = lifeExp))+
geom_line(aes(y = pred, colour = "prediction"))+
labs(title = "Data and quadratic trend of predictions for Botswana")
```
*(note that the centered value for year in the 'centered' dataset is 1979.5)*
In the model for Botswana, coefficents are:
Intercept: ~ 59.81
year (centered): ~ 0.0607
year (centered)^2: ~ -0.0175
Hence for every one year we move away from the central year (1979.5), the rate of change between year and price decreases by *~0.035*.
Below I show this graphically by plotting the lines tangent to the models output.
```{r}
botswana_coefs <- by_country3_quad %>%
filter(country == "Botswana") %>%
with(map(mod_quad, coef)) %>%
flatten_dbl()
```
Helper functions to find tangent points
```{r}
find_slope <- function(x){
2*botswana_coefs[[3]]*x + botswana_coefs[[2]]
}
find_y1 <- function(x){
botswana_coefs[[3]]*(x^2) + botswana_coefs[[2]]*x + botswana_coefs[[1]]
}
find_intercept <- function(x, y, m){
y - x*m
}
tangent_lines <- tibble(x1 = seq(-20, 20, 10)) %>%
mutate(slope = find_slope(x1),
y1 = find_y1(x1),
intercept = find_intercept(x1, y1, slope),
slope_change = x1*2*botswana_coefs[[3]]) %>%
select(slope, intercept, everything())
```
```{r}
by_country3_quad %>%
filter(country == "Botswana") %>%
mutate(data_preds = purrr::map2(data_cent, mod_quad, add_predictions)) %>%
unnest(data_preds) %>%
ggplot(aes(x = year_cent))+
geom_line(aes(x = year_cent, y = pred), colour = "red")+
# geom_line(aes(y = pred, colour = "prediction"))+
# labs(title = "Data and 2nd order model predictions for Botswana")+
geom_abline(intercept = tangent_lines[[2]][[1]], slope = tangent_lines[[1]][[1]])+
geom_abline(intercept = tangent_lines[[2]][[2]], slope = tangent_lines[[1]][[2]])+
geom_abline(intercept = tangent_lines[[2]][[3]], slope = tangent_lines[[1]][[3]])+
geom_abline(intercept = tangent_lines[[2]][[4]], slope = tangent_lines[[1]][[4]])+
geom_abline(intercept = tangent_lines[[2]][[5]], slope = tangent_lines[[1]][[5]])+
ylim(c(40, 70))+
coord_fixed()
by_country3_quad %>%
filter(country == "Botswana") %>%
mutate(data_preds = purrr::map2(data_cent, mod_quad, add_predictions)) %>%
unnest(data_preds) %>%
ggplot(aes(x = year_cent))+
geom_line(aes(x = year_cent, y = pred), colour = "red")+
geom_abline(aes(intercept = intercept, slope = slope),
data = tangent_lines)+
coord_fixed()
```
Below is the relevant output in a table.
`x1`: represents the change in x value from 1979.5
`slope`: slope of the tangent line at particular `x1` value
`slope_diff_central`: the amount the slope is different from the slope of the tangent line at the central year
```{r}
select(tangent_lines, x1, slope, slope_diff_central = slope_change)
```
* notice that for every 10 year increase in `x1` we see the slope of the tangent line has decreased by 0.35. If we'd looked at just one year we would have seen the change was 0.035, this correspondig with 2 multiplied by the coefficient on the quadratic term of our model.
1. Explore other methods for visualising the distribution of $R^2$ per continent. You might want to try the ggbeeswarm package, which provides similar methods for avoiding overlaps as jitter, but uses deterministic methods.
*visualisations of linear model*
```{r}
by_country3_quad %>%
unnest(glance_mod) %>%
ggplot(aes(x = continent, y = r.squared, colour = continent))+
geom_boxplot(alpha = 0.1, colour = "dark grey")+
ggbeeswarm::geom_quasirandom()
by_country3_quad %>%
unnest(glance_mod) %>%
ggplot(aes(x = continent, y = r.squared, colour = continent))+
geom_boxplot(alpha = 0.1, colour = "dark grey")+
geom_jitter()
by_country3_quad %>%
unnest(glance_mod) %>%
ggplot(aes(x = continent, y = r.squared, colour = continent))+
geom_boxplot(alpha = 0.1, colour = "dark grey")+
ggbeeswarm::geom_beeswarm()
```
* I like `geom_quasirandom` the best as an overlay on boxplot, it keeps things centered and doesn't have the gravitational pull affect that makes `geom_beeswarm` become a little misaligned
1. To create the last plot (showing the data for the countries with the worst model fits), we needed two steps: we created a data frame with one row per country and then semi-joined it to the original dataset. It's possible to avoid this join if we use `unnest()` instead of `unnest(.drop = TRUE)`. How?
```{r}
#first filter by r.squared and then unnest
by_country3_quad %>%
mutate(data_preds = purrr::map2(data_cent, mod_quad, add_predictions)) %>%
unnest(glance_mod) %>%
mutate(bad_fit = r.squared < 0.25) %>%
filter(bad_fit) %>%
unnest(data_preds) %>%
ggplot(aes(x = year, group = country))+
geom_point(aes(y = lifeExp, colour = country))+
geom_line(aes(y = pred, colour = country))+
facet_wrap(~country)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
## 25.4: Creating list-columns
### 25.4.5
1. List all the functions that you can think of that take an atomic vector and return a list.
* `stringr::str_extract_all` + other `stringr` functions
(however the below can also take types that are not atomic and are probably not really what is being looked for)
* `list`
* `tibble`
* `map` / `lapply`
1. Brainstorm useful summary functions that, like `quantile()`, return multiple values.
* `summary`
* `range`
* ...
1. What's missing in the following data frame? How does `quantile()` return that missing piece? Why isn't that helpful here?
```{r}
mtcars %>%
group_by(cyl) %>%
summarise(q = list(quantile(mpg))) %>%
unnest()
```
* need to capture probabilities of quantiles to make useful...
```{r}
probs <- c(0.01, 0.25, 0.5, 0.75, 0.99)
mtcars %>%
group_by(cyl) %>%
summarise(p = list(probs), q = list(quantile(mpg, probs))) %>%
unnest()
```
* see [quantile example] for related method that captures names of quantiles (rather than requiring th user to manually input a vector of probabilities)
1. What does this code do? Why might it be useful?
```{r, eval = FALSE}
mtcars %>%
select(1:3) %>%
group_by(cyl) %>%
summarise_all(funs(list))
```
* It turns each row into an atomic vector grouped by the particular `cyl` value. It is different from `nest` in that each column creates a new list-column representing an atomic vector. If `nest` had been used, this woudl have created a single dataframe that all the values woudl have been in. Could be useful for running purr through particular columns...
* e.g. let's say we want to find the number of unique items in each column for each grouping, we could do that like so
```{r}
mtcars %>%
group_by(cyl) %>%
select(1:5) %>%
summarise_all(funs(list)) %>%
mutate_all(funs(unique = map_int(., ~length(unique(.x)))))
# we could also simply overwrite the values (rather than make new columns)
mtcars %>%
group_by(cyl) %>%
select(1:5) %>%
summarise_all(funs(list)) %>%
mutate_all(funs(map_int(., ~length(unique(.x)))))
```
## 25.5: Simplifying list-columns
### 25.5.3
1. Why might the `lengths()` function be useful for creating atomic
vector columns from list-columns?
If all you wanted were the length (not the number of unique items), you would still need a `map function when using `length`
```{r}
mtcars %>%
group_by(cyl) %>%
select(1:5) %>%
summarise_all(funs(list)) %>%
mutate_all(funs(map_int(., length)))
```
for this problem, using `lengths` prevents the need to use a `map` function
```{r}
mtcars %>%
group_by(cyl) %>%
select(1:5) %>%
summarise_all(funs(list)) %>%
mutate_all(lengths)
```
* is there a more helpful use case...?
1. List the most common types of vector found in a data frame. What makes lists different?
* the atomic types: char, int, double, fact, date are all more common, they are atomic, whereas lists are not
# Appendix
## models in lists
this is the more traditional way you might store models
```{r}
models_countries <- purrr::map(by_country$data, country_model)
names(models_countries) <- by_country$country
models_countries[1:3]
```
## list-columns for sampling
say you want to sample all the flights on 50 days out of the year. List-cols can be helpful in generating a sample like this:
```{r}
flights %>%
mutate(create_date = make_date(year, month, day)) %>%
select(create_date, 5:8) %>%
group_by(create_date) %>%
nest() %>%
sample_n(50) %>%
unnest()
```
The alternative a join, e.g.
```{r}
flights_samp <- flights %>%
mutate(create_date = make_date(year, month, day)) %>%
distinct(create_date) %>%
sample_n(50)
flights %>%
mutate(create_date = make_date(year, month, day)) %>%
select(create_date, 5:8) %>%
semi_join(flights_samp, by = "create_date")
```
* I find the `nest` - `unnest` method more elegant
* I've found the `semi_join` method seems to run goes faster on large dataframes
## 25.2.5.1
### Include cubic term
Let's look at this example if we had allowed year to be a 3rd order polynomial. We're really stretching our degrees of freedom in this case -- these might be unlikely to generalize to other data well.
```{r}
by_country3 %>%
semi_join(bad_fit, by = "country") %>%
mutate(
# create centered data
data_cent = purrr::map(data, center_value),
# create cubic (3rd order) data
mod_cubic = purrr::map(data_cent, lm_quad_2, var = "year_cent + I(year_cent^2) + I(year_cent^3)"),
# get predictions for 3rd order model
data_cubic = purrr::map2(data_cent, mod_cubic, add_predictions)) %>%
unnest(data_cubic) %>%
ggplot(aes(x = year, group = country))+
geom_point(aes(y = lifeExp, colour = country))+
geom_line(aes(y = pred, colour = country))+
facet_wrap(~country)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
* interpretibility of coefficients beyond quadratic term becomes less strait forward to explain
## Multiple graphs in chunk
My work flow for pulling multiple graphs into a single input has typically been either to build the graphs seperately and then add each to the function `gridExtra::grid.arrange()` or to use faceting as much as possible.
In 25.2 Hadley showed an example where he put all the graphs within a chunk into a single outputted figure by setting the code chunk options for this. The chunk below is that example.
```{r, out.width = "33%", fig.asp = 1, fig.width = 3, fig.align='default'}
nz <- filter(gapminder, country == "New Zealand")
nz %>%
ggplot(aes(year, lifeExp)) +
geom_line() +
ggtitle("Full data = ")
nz_mod <- lm(lifeExp ~ year, data = nz)
nz %>%
add_predictions(nz_mod) %>%
ggplot(aes(year, pred)) +
geom_line() +
ggtitle("Linear trend + ")
nz %>%
add_residuals(nz_mod) %>%
ggplot(aes(year, resid)) +
geom_hline(yintercept = 0, colour = "white", size = 3) +
geom_line() +
ggtitle("Remaining pattern")
```
* still printing as 3 individual plots, how to fix?
## list(quantile()) examples
```{r}
prob_vals <- c(0, .25, .5, .75, 1)
iris %>%
group_by(Species) %>%
summarise(Petal.Length_q = list(quantile(Petal.Length))) %>%
mutate(probs = list(prob_vals)) %>%
unnest()
```
*Example for using quantile across range of columns*
*Also notice dynamic method for extracting names*
```{r}
iris %>%
group_by(Species) %>%
summarise_all(funs(list(quantile(., probs = prob_vals)))) %>%
mutate(probs = map(Petal.Length, names)) %>%
unnest()
```
## extracting row names
Don't know I'd do this...
```{r}
quantile(1:100) %>%
as_tibble() %>%
rownames_to_column()
```
## `invoke_map` example (book)
I liked Hadley's example with invoke_map and wanted to save it
```{r}
sim <- tribble(
~f, ~params,
"runif", list(min = -1, max = -1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>%
mutate(sims = invoke_map(f, params, n = 10))
```
## named list example (book)
I liked Hadley's example where you have a list of named vectors that you need to iterate over both the values as well as the names and the use of enframe to facilitate this.
Below is the copied example and notes:
```{r}
x <- list(
a = 1:5,
b = 3:4,
c = 5:6
)
df <- enframe(x)
df
```
The advantage of this structure is that it generalises in a straightforward way - names are useful if you have character vector of metadata, but don't help if you have other types of data, or multiple vectors.
Now if you want to iterate over names and values in parallel, you can use `map2()`:
```{r}
df %>%
mutate(
smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1]))
)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment