Skip to content

Instantly share code, notes, and snippets.

@jmclawson
Last active November 16, 2023 19:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jmclawson/50e0fbf83d24ba4c9dde0711f362dc45 to your computer and use it in GitHub Desktop.
Save jmclawson/50e0fbf83d24ba4c9dde0711f362dc45 to your computer and use it in GitHub Desktop.
Understanding and using linear models in R
---
title: "Understanding and using linear models in R"
subtitle: "A 10-slide guide to the dark side"
format:
revealjs:
slide-number: true
embed-resources: true
warning: false
message: false
theme: dark
---
## Case study: finding mass from height
The `starwars` data set is always useful. We're filtering out Jabba Desilijic Tiure who is *surprisingly* massive for their height. We're also filtering out anyone whose mass is missing.
```{r}
#| label: setup
#| echo: true
library(tidyverse)
starwars |>
filter(name != "Jabba Desilijic Tiure", !is.na(mass)) |>
select(name, height, mass) |>
head()
```
##
```{r}
# black ggplot theme:
source("https://gist.githubusercontent.com/jmclawson/47c84f8f3d4e6e09edfd42ed112f4e03/raw/63fecc34d65a7bbd8c61e9b7aa0c1591ac25e40c/theme_black.R")
base_plot <- starwars |>
filter(name != "Jabba Desilijic Tiure", !is.na(mass)) |>
lm(mass ~ height, data = _) |>
ggplot(aes(height, mass)) +
geom_point(color = "lightgray",
shape = 21,
fill = "black") +
labs(title = "Height and mass look related. That is, taller characters are also more massive.",
subtitle = "The <i>response</i> variable is typically shown on the Y-axis. We’re predicting <b>mass</b> from <b>height</b>.",
y = "mass (kg)",
x = "height (cm)") +
theme_black(base_size = 14, black = "#191919") +
theme(plot.title = ggtext::element_textbox(),
plot.caption = ggtext::element_textbox(),
plot.subtitle = ggtext::element_textbox(),
plot.title.position = "plot",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
base_plot
```
##
```{r}
color_res <- "#009E73"
color_pred <- "#CC79A7"
color_trend <- "#0072B2"
color_actual <- "#EEEEEE"
second_base <- base_plot +
geom_smooth(method = "lm",
se = FALSE,
color = color_trend) +
labs(title = paste0('<b>geom_smooth(method=\\"lm\\")</b> fits a linear <b style="color:',color_trend,';">trend line</b> confirming the relationship.'),
subtitle = "If we want to learn more details, we'll need to model the relationship of <b>height</b> to <b>mass</b>.")
second_base
```
##
```{r}
third_base <- second_base +
geom_segment(aes(xend = height,
yend = .fitted),
color = color_res) +
geom_point(color = "gray",
shape = 21,
fill = "black") +
geom_smooth(method = "lm",
formula = y ~ x,
se = FALSE,
color = color_trend) +
geom_point(aes(y = .fitted),
color = "white",
shape = 21,
fill = color_pred) +
labs(title = paste0('<b>lm(mass ~ height, data = starwars)</b> models relationship of <b>height</b> to <b>mass</b>.'),
subtitle = paste0('This model can show us values for the <b style="color:',color_pred,';">predicted fit</b> and the <b style="color:',color_res,';">residual distances</b>.')) +
scale_color_identity()
third_base
```
##
```{r}
annotations <- tibble(
label = c(
"The model's **.fitted** column<br>shows the predicted mass<br>for a given height",
"The **mass** column in **starwars**<br>shows the known mass for<br>a given height",
"The model's **.resid**<br>column measures<br>inaccuracy"),
color = c(color_pred, color_actual, color_res),
x = c(130, 160, 220),
y = c(95, 145,40))
home_base <- third_base +
annotate(geom = "segment",
color = color_res,
x = 220, xend = 195,
y = 40, yend = 60,
arrow = arrow(length = unit(0.1,"cm"))) +
annotate(geom = "segment",
color = color_res,
x = 220, xend = 198,
y = 40, yend = 74,
arrow = arrow(length = unit(0.1,"cm"))) +
annotate(geom = "segment",
color = color_actual,
x = 173, xend = 177,
y = 145, yend = 122.5,
arrow = arrow(length = unit(0.1,"cm"))) +
annotate(geom = "segment",
color = color_actual,
x = 173, xend = 188,
y = 145, yend = 116,
arrow = arrow(length = unit(0.1,"cm"))) +
annotate(geom = "segment",
color = color_pred,
x = 130, xend = 148,
y = 95, yend = 63,
arrow = arrow(length = unit(0.1,"cm"))) +
annotate(geom = "segment",
color = color_pred,
x = 130, xend = 157.2,
y = 95, yend = 70.4,
arrow = arrow(length = unit(0.1,"cm"))) +
ggtext::geom_richtext(
data = annotations,
aes(x = x, y = y, color = color, label = label),
fill = "#191919") +
labs(caption = paste0('A model’s effectiveness is expressed by its Root Mean Squared Error, <b>sqrt(var(<span style="color:',color_res,';">model$.resid</span>))</b> = <b>',round(sqrt(var(lm(mass ~ height, data = filter(starwars, name != "Jabba Desilijic Tiure", !is.na(mass)))$residuals)),1),'</b>'))
home_base +
labs(caption = NULL)
```
## `lm()` fits a linear model
Inside the function, a formula `y ~ x` will show that we want to deduce `y` from `x`:
```{r}
#| echo: true
#| output-location: fragment
the_model <- starwars |>
filter(name != "Jabba Desilijic Tiure", !is.na(mass)) |>
lm(mass ~ height, data = _)
the_model
```
## Use **broom** to `tidy()`
```{r}
#| echo: true
library(broom)
tidy(the_model)
```
::: {.fragment}
Coefficients reveal the trend line's equation, *y = mx + a*
::: columns
::: {.column width="50%"}
* *m* = height = `r round(the_model$coefficients[[2]], 3)`; each cm adds this many kg
* At 100cm, a character's mass should be `r round(100 * the_model$coefficients[[2]] + the_model$coefficients[[1]], 1)`kg.
:::
::: {.column width="50%"}
* *a* = Intercept = `r round(the_model$coefficients[[1]], 1)`; or height if mass is 0kg
* At 200cm, a character's mass should be `r round(200 * the_model$coefficients[[2]] + the_model$coefficients[[1]], 1)`kg.
:::
:::
:::
## `augment()` adds two columns
```{r}
#| echo: true
starwars |>
select(name, height, mass) |>
head() |>
augment(x = the_model,
newdata = _)
```
* `.fitted` is the predicted mass
* `.resid` calculates the actual `mass` minus this prediction
## RMSE measures a model's accuracy
* RMSE stands for Root Mean Squared Error
* lower RMSE is better than higher
* calculate with `sqrt(var(the_model$residuals))`
```{r}
#| fig-align: center
third_base +
labs(
title = paste0('This model’s RMSE is <b>',round(sqrt(var(the_model$residuals)),2),'kg</b>.'),
caption = NULL,
subtitle = NULL)
```
@jmclawson
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment