Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save xxMrPHDxx/bcf3c5d14e3db7334bdf1001d931466c to your computer and use it in GitHub Desktop.
Save xxMrPHDxx/bcf3c5d14e3db7334bdf1001d931466c to your computer and use it in GitHub Desktop.
modeling-with-data-in-the-tidyverse
# Exploratory visualization of house size
## Instruction 1/2
```{r}
# Load packages
library(moderndive)
library(ggplot2)
# Plot the histogram
ggplot(house_prices, aes(x = sqft_living)) +
geom_histogram() +
labs(x = 'Size (sq.feet)', y = 'count')
```
## Instruction 2/2
- No skew
- sqft_living is right-skewed. (*)
- sqft_living is left-skewed.
# Log10 transformation of house size
```{r}
# Load packages
library(moderndive)
library(dplyr)
library(ggplot2)
# Add log10_size
house_prices_2 <- house_prices %>%
mutate(log10_size = log10(sqft_living))
# Plot the histogram
ggplot(house_prices_2, aes(x = log10_size)) +
geom_histogram() +
labs(x = "log10 size", y = "count")
```
# EDA of relationship of teaching & "beauty" scores
```{r}
# Plot the histogram
ggplot(evals, aes(x=bty_avg)) +
geom_histogram(binwidth=0.5) +
labs(x = "Beauty score", y = "count")
# Scatterplot
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_point() +
labs(x = "beauty score", y = "teaching score")
# Jitter plot
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_jitter() +
labs(x = "beauty score", y = "teaching score")
```
# Correlation between teaching and "beauty" scores
## Intruction 1/2
```{r}
# Compute correlation
evals %>%
summarize(correlation = cor(score, bty_avg))
```
## Intruction 2/2
- score and bty_avg are strongly negatively associated.
- score and bty_avg are weakly negatively associated.
- score and bty_avg are weakly positively associated. (*)
- score and bty_avg are strongly positively associated.
# EDA of relationship of house price and waterfront
```{r}
# View the structure of log10_price and waterfront
house_prices %>%
select(log10_price, waterfront) %>%
glimpse
# Plot
ggplot(house_prices, aes(x = waterfront, y = log10_price)) +
geom_boxplot() +
labs(x = "waterfront", y = "log10 price")
```
# Predicting house price with waterfront
```{r}
# Calculate stats
house_prices %>%
group_by(waterfront) %>%
summarize(mean_log10_price = mean(log10_price), n = n())
# Prediction of price for houses with view
10^(6.12)
# Prediction of price for houses without view
10^(5.66)
```
# Plotting a "best-fitting" regression line
```{r}
# Load packages
library(ggplot2)
library(dplyr)
library(moderndive)
# Plot
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_point() +
labs(x = "beauty score", y = "score") +
geom_smooth(method = 'lm', se = F)
```
# Fitting a regression with a numerical x
## Instruction 1/3
```{r}
# Load package
library(moderndive)
# Fit model
model_score_2 <- lm(score ~ bty_avg, data = evals)
# Output content
model_score_2
```
## Instruction 2/3
```{r}
# Load package
library(moderndive)
# Fit model
model_score_2 <- lm(score ~ bty_avg, data = evals)
# Output regression table
get_regression_table(model_score_2)
```
## Instruction 3/3
- For every person who has a beauty score of one, their teaching score will be 0.0670.
- For every increase of one in beauty score, you should observe an associated increase of on average 0.0670 units in teaching score. (*)
- Less "beautiful" instructors tend to get higher teaching evaluation scores.
# Making predictions using "beauty score"
```{r}
# Use fitted intercept and slope to get a prediction
y_hat <- 3.88 + 0.067 * 5
y_hat
# Compute residual y - y_hat
4.7 - y_hat
```
# Computing fitted/predicted values & residuals
## Instruction 1/3
```{r}
# Fit regression model
model_score_2 <- lm(score ~ bty_avg, data = evals)
# Get regression table
get_regression_table(model_score_2)
# Get all fitted/predicted values and residuals
get_regression_points(model_score_2)
```
## Instruction 2/3
```{r}
# Fit regression model
model_score_2 <- lm(score ~ bty_avg, data = evals)
# Get regression table
get_regression_table(model_score_2)
# Get all fitted/predicted values and residuals
get_regression_points(model_score_2) %>%
mutate(score_hat_2 = 3.88 + 0.067 * bty_avg)
```
## Instruction 3/3
```{r}
# Fit regression model
model_score_2 <- lm(score ~ bty_avg, data = evals)
# Get regression table
get_regression_table(model_score_2)
# Get all fitted/predicted values and residuals
get_regression_points(model_score_2) %>%
mutate(residual_2 = score - score_hat)
```
# EDA of relationship of score and rank
```{r}
ggplot(evals, aes(x=rank, y=score)) +
geom_boxplot() +
labs(x = "rank", y = "score")
evals %>%
group_by(rank) %>%
summarize(n = n(), mean_score = mean(score), sd_score = sd(score))
```
# Fitting a regression with a categorical x
```{r}
# Fit regression model
model_score_4 <- lm(score ~ rank, data = evals)
# Get regression table
get_regression_table(model_score_4)
# teaching mean
teaching_mean <- 4.28
# tenure track mean
tenure_track_mean <- teaching_mean - 0.13
# tenured mean
tenured_mean <- teaching_mean - 0.145
```
# Making predictions using rank
- A good prediction of their score would be 4.28 - 0.145 = 4.135.
- A good prediction of their score would be -0.145.
- There is no information in the table that can aid your prediction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment