Created
July 22, 2019 04:34
-
-
Save xxMrPHDxx/bcf3c5d14e3db7334bdf1001d931466c to your computer and use it in GitHub Desktop.
modeling-with-data-in-the-tidyverse
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Exploratory visualization of house size | |
## Instruction 1/2 | |
```{r} | |
# Load packages | |
library(moderndive) | |
library(ggplot2) | |
# Plot the histogram | |
ggplot(house_prices, aes(x = sqft_living)) + | |
geom_histogram() + | |
labs(x = 'Size (sq.feet)', y = 'count') | |
``` | |
## Instruction 2/2 | |
- No skew | |
- sqft_living is right-skewed. (*) | |
- sqft_living is left-skewed. | |
# Log10 transformation of house size | |
```{r} | |
# Load packages | |
library(moderndive) | |
library(dplyr) | |
library(ggplot2) | |
# Add log10_size | |
house_prices_2 <- house_prices %>% | |
mutate(log10_size = log10(sqft_living)) | |
# Plot the histogram | |
ggplot(house_prices_2, aes(x = log10_size)) + | |
geom_histogram() + | |
labs(x = "log10 size", y = "count") | |
``` | |
# EDA of relationship of teaching & "beauty" scores | |
```{r} | |
# Plot the histogram | |
ggplot(evals, aes(x=bty_avg)) + | |
geom_histogram(binwidth=0.5) + | |
labs(x = "Beauty score", y = "count") | |
# Scatterplot | |
ggplot(evals, aes(x = bty_avg, y = score)) + | |
geom_point() + | |
labs(x = "beauty score", y = "teaching score") | |
# Jitter plot | |
ggplot(evals, aes(x = bty_avg, y = score)) + | |
geom_jitter() + | |
labs(x = "beauty score", y = "teaching score") | |
``` | |
# Correlation between teaching and "beauty" scores | |
## Intruction 1/2 | |
```{r} | |
# Compute correlation | |
evals %>% | |
summarize(correlation = cor(score, bty_avg)) | |
``` | |
## Intruction 2/2 | |
- score and bty_avg are strongly negatively associated. | |
- score and bty_avg are weakly negatively associated. | |
- score and bty_avg are weakly positively associated. (*) | |
- score and bty_avg are strongly positively associated. | |
# EDA of relationship of house price and waterfront | |
```{r} | |
# View the structure of log10_price and waterfront | |
house_prices %>% | |
select(log10_price, waterfront) %>% | |
glimpse | |
# Plot | |
ggplot(house_prices, aes(x = waterfront, y = log10_price)) + | |
geom_boxplot() + | |
labs(x = "waterfront", y = "log10 price") | |
``` | |
# Predicting house price with waterfront | |
```{r} | |
# Calculate stats | |
house_prices %>% | |
group_by(waterfront) %>% | |
summarize(mean_log10_price = mean(log10_price), n = n()) | |
# Prediction of price for houses with view | |
10^(6.12) | |
# Prediction of price for houses without view | |
10^(5.66) | |
``` | |
# Plotting a "best-fitting" regression line | |
```{r} | |
# Load packages | |
library(ggplot2) | |
library(dplyr) | |
library(moderndive) | |
# Plot | |
ggplot(evals, aes(x = bty_avg, y = score)) + | |
geom_point() + | |
labs(x = "beauty score", y = "score") + | |
geom_smooth(method = 'lm', se = F) | |
``` | |
# Fitting a regression with a numerical x | |
## Instruction 1/3 | |
```{r} | |
# Load package | |
library(moderndive) | |
# Fit model | |
model_score_2 <- lm(score ~ bty_avg, data = evals) | |
# Output content | |
model_score_2 | |
``` | |
## Instruction 2/3 | |
```{r} | |
# Load package | |
library(moderndive) | |
# Fit model | |
model_score_2 <- lm(score ~ bty_avg, data = evals) | |
# Output regression table | |
get_regression_table(model_score_2) | |
``` | |
## Instruction 3/3 | |
- For every person who has a beauty score of one, their teaching score will be 0.0670. | |
- For every increase of one in beauty score, you should observe an associated increase of on average 0.0670 units in teaching score. (*) | |
- Less "beautiful" instructors tend to get higher teaching evaluation scores. | |
# Making predictions using "beauty score" | |
```{r} | |
# Use fitted intercept and slope to get a prediction | |
y_hat <- 3.88 + 0.067 * 5 | |
y_hat | |
# Compute residual y - y_hat | |
4.7 - y_hat | |
``` | |
# Computing fitted/predicted values & residuals | |
## Instruction 1/3 | |
```{r} | |
# Fit regression model | |
model_score_2 <- lm(score ~ bty_avg, data = evals) | |
# Get regression table | |
get_regression_table(model_score_2) | |
# Get all fitted/predicted values and residuals | |
get_regression_points(model_score_2) | |
``` | |
## Instruction 2/3 | |
```{r} | |
# Fit regression model | |
model_score_2 <- lm(score ~ bty_avg, data = evals) | |
# Get regression table | |
get_regression_table(model_score_2) | |
# Get all fitted/predicted values and residuals | |
get_regression_points(model_score_2) %>% | |
mutate(score_hat_2 = 3.88 + 0.067 * bty_avg) | |
``` | |
## Instruction 3/3 | |
```{r} | |
# Fit regression model | |
model_score_2 <- lm(score ~ bty_avg, data = evals) | |
# Get regression table | |
get_regression_table(model_score_2) | |
# Get all fitted/predicted values and residuals | |
get_regression_points(model_score_2) %>% | |
mutate(residual_2 = score - score_hat) | |
``` | |
# EDA of relationship of score and rank | |
```{r} | |
ggplot(evals, aes(x=rank, y=score)) + | |
geom_boxplot() + | |
labs(x = "rank", y = "score") | |
evals %>% | |
group_by(rank) %>% | |
summarize(n = n(), mean_score = mean(score), sd_score = sd(score)) | |
``` | |
# Fitting a regression with a categorical x | |
```{r} | |
# Fit regression model | |
model_score_4 <- lm(score ~ rank, data = evals) | |
# Get regression table | |
get_regression_table(model_score_4) | |
# teaching mean | |
teaching_mean <- 4.28 | |
# tenure track mean | |
tenure_track_mean <- teaching_mean - 0.13 | |
# tenured mean | |
tenured_mean <- teaching_mean - 0.145 | |
``` | |
# Making predictions using rank | |
- A good prediction of their score would be 4.28 - 0.145 = 4.135. | |
- A good prediction of their score would be -0.145. | |
- There is no information in the table that can aid your prediction. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment