xxMrPHDxx/modeling-with-data-in-the-tidyverse.Rmd

## modeling-with-data-in-the-tidyverse.Rmd
# Exploratory visualization of house size

## Instruction 1/2
```{r}
# Load packages
library(moderndive)
library(ggplot2)

# Plot the histogram
ggplot(house_prices, aes(x = sqft_living)) +
  geom_histogram() +
  labs(x = 'Size (sq.feet)', y = 'count')
```

## Instruction 2/2
- No skew
- sqft_living is right-skewed. (*)
- sqft_living is left-skewed.

# Log10 transformation of house size
```{r}
# Load packages
library(moderndive)
library(dplyr)
library(ggplot2)

# Add log10_size
house_prices_2 <- house_prices %>%
  mutate(log10_size = log10(sqft_living))

# Plot the histogram
ggplot(house_prices_2, aes(x = log10_size)) +
  geom_histogram() +
  labs(x = "log10 size", y = "count")
```

# EDA of relationship of teaching & "beauty" scores
```{r}
# Plot the histogram
ggplot(evals, aes(x=bty_avg)) +
  geom_histogram(binwidth=0.5) +
  labs(x = "Beauty score", y = "count")

# Scatterplot
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_point() +
  labs(x = "beauty score", y = "teaching score")

# Jitter plot
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_jitter() +
  labs(x = "beauty score", y = "teaching score")
```

# Correlation between teaching and "beauty" scores

## Intruction 1/2
```{r}
# Compute correlation
evals %>%
  summarize(correlation = cor(score, bty_avg))
```

## Intruction 2/2
- score and bty_avg are strongly negatively associated.
- score and bty_avg are weakly negatively associated.
- score and bty_avg are weakly positively associated. (*)
- score and bty_avg are strongly positively associated.

# EDA of relationship of house price and waterfront
```{r}
# View the structure of log10_price and waterfront
house_prices %>%
  select(log10_price, waterfront) %>%
  glimpse

# Plot
ggplot(house_prices, aes(x = waterfront, y = log10_price)) +
  geom_boxplot() +
  labs(x = "waterfront", y = "log10 price")
```

# Predicting house price with waterfront
```{r}
# Calculate stats
house_prices %>%
  group_by(waterfront) %>%
  summarize(mean_log10_price = mean(log10_price), n = n())

# Prediction of price for houses with view
10^(6.12)

# Prediction of price for houses without view
10^(5.66)
```

# Plotting a "best-fitting" regression line
```{r}
# Load packages
library(ggplot2)
library(dplyr)
library(moderndive)

# Plot
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_point() +
  labs(x = "beauty score", y = "score") +
  geom_smooth(method = 'lm', se = F)
```

# Fitting a regression with a numerical x

## Instruction 1/3
```{r}
# Load package
library(moderndive)

# Fit model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Output content
model_score_2
```

## Instruction 2/3
```{r}
# Load package
library(moderndive)

# Fit model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Output regression table
get_regression_table(model_score_2)
```

## Instruction 3/3
- For every person who has a beauty score of one, their teaching score will be 0.0670.
- For every increase of one in beauty score, you should observe an associated increase of on average 0.0670 units in teaching score. (*)
- Less "beautiful" instructors tend to get higher teaching evaluation scores.

# Making predictions using "beauty score"
```{r}
# Use fitted intercept and slope to get a prediction
y_hat <- 3.88 + 0.067 * 5
y_hat

# Compute residual y - y_hat
4.7 - y_hat
```

# Computing fitted/predicted values & residuals

## Instruction 1/3
 ```{r}
 # Fit regression model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Get regression table
get_regression_table(model_score_2)

# Get all fitted/predicted values and residuals
get_regression_points(model_score_2)
 ```

## Instruction 2/3
 ```{r}
# Fit regression model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Get regression table
get_regression_table(model_score_2)

# Get all fitted/predicted values and residuals
get_regression_points(model_score_2) %>%
  mutate(score_hat_2 = 3.88 + 0.067 * bty_avg)
 ```

## Instruction 3/3
 ```{r}
# Fit regression model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Get regression table
get_regression_table(model_score_2)

# Get all fitted/predicted values and residuals
get_regression_points(model_score_2) %>%
  mutate(residual_2 = score - score_hat)
```

# EDA of relationship of score and rank
```{r}
ggplot(evals, aes(x=rank, y=score)) +
  geom_boxplot() +
  labs(x = "rank", y = "score")

evals %>%
  group_by(rank) %>%
  summarize(n = n(), mean_score = mean(score), sd_score = sd(score))
```

# Fitting a regression with a categorical x
```{r}
# Fit regression model
model_score_4 <- lm(score ~ rank, data = evals)

# Get regression table
get_regression_table(model_score_4)

# teaching mean
teaching_mean <- 4.28

# tenure track mean
tenure_track_mean <- teaching_mean - 0.13

# tenured mean
tenured_mean <- teaching_mean - 0.145
```

# Making predictions using rank
- A good prediction of their score would be 4.28 - 0.145 = 4.135.
- A good prediction of their score would be -0.145.
- There is no information in the table that can aid your prediction.
	# Exploratory visualization of house size

	## Instruction 1/2
	```{r}
	# Load packages
	library(moderndive)
	library(ggplot2)

	# Plot the histogram
	ggplot(house_prices, aes(x = sqft_living)) +
	geom_histogram() +
	labs(x = 'Size (sq.feet)', y = 'count')
	```

	## Instruction 2/2
	- No skew
	- sqft_living is right-skewed. (*)
	- sqft_living is left-skewed.

	# Log10 transformation of house size
	```{r}
	# Load packages
	library(moderndive)
	library(dplyr)
	library(ggplot2)

	# Add log10_size
	house_prices_2 <- house_prices %>%
	mutate(log10_size = log10(sqft_living))

	# Plot the histogram
	ggplot(house_prices_2, aes(x = log10_size)) +
	geom_histogram() +
	labs(x = "log10 size", y = "count")
	```

	# EDA of relationship of teaching & "beauty" scores
	```{r}
	# Plot the histogram
	ggplot(evals, aes(x=bty_avg)) +
	geom_histogram(binwidth=0.5) +
	labs(x = "Beauty score", y = "count")

	# Scatterplot
	ggplot(evals, aes(x = bty_avg, y = score)) +
	geom_point() +
	labs(x = "beauty score", y = "teaching score")

	# Jitter plot
	ggplot(evals, aes(x = bty_avg, y = score)) +
	geom_jitter() +
	labs(x = "beauty score", y = "teaching score")
	```

	# Correlation between teaching and "beauty" scores

	## Intruction 1/2
	```{r}
	# Compute correlation
	evals %>%
	summarize(correlation = cor(score, bty_avg))
	```

	## Intruction 2/2
	- score and bty_avg are strongly negatively associated.
	- score and bty_avg are weakly negatively associated.
	- score and bty_avg are weakly positively associated. (*)
	- score and bty_avg are strongly positively associated.

	# EDA of relationship of house price and waterfront
	```{r}
	# View the structure of log10_price and waterfront
	house_prices %>%
	select(log10_price, waterfront) %>%
	glimpse

	# Plot
	ggplot(house_prices, aes(x = waterfront, y = log10_price)) +
	geom_boxplot() +
	labs(x = "waterfront", y = "log10 price")
	```

	# Predicting house price with waterfront
	```{r}
	# Calculate stats
	house_prices %>%
	group_by(waterfront) %>%
	summarize(mean_log10_price = mean(log10_price), n = n())

	# Prediction of price for houses with view
	10^(6.12)

	# Prediction of price for houses without view
	10^(5.66)
	```

	# Plotting a "best-fitting" regression line
	```{r}
	# Load packages
	library(ggplot2)
	library(dplyr)
	library(moderndive)

	# Plot
	ggplot(evals, aes(x = bty_avg, y = score)) +
	geom_point() +
	labs(x = "beauty score", y = "score") +
	geom_smooth(method = 'lm', se = F)
	```

	# Fitting a regression with a numerical x

	## Instruction 1/3
	```{r}
	# Load package
	library(moderndive)

	# Fit model
	model_score_2 <- lm(score ~ bty_avg, data = evals)

	# Output content
	model_score_2
	```

	## Instruction 2/3
	```{r}
	# Load package
	library(moderndive)

	# Fit model
	model_score_2 <- lm(score ~ bty_avg, data = evals)

	# Output regression table
	get_regression_table(model_score_2)
	```

	## Instruction 3/3
	- For every person who has a beauty score of one, their teaching score will be 0.0670.
	- For every increase of one in beauty score, you should observe an associated increase of on average 0.0670 units in teaching score. (*)
	- Less "beautiful" instructors tend to get higher teaching evaluation scores.

	# Making predictions using "beauty score"
	```{r}
	# Use fitted intercept and slope to get a prediction
	y_hat <- 3.88 + 0.067 * 5
	y_hat

	# Compute residual y - y_hat
	4.7 - y_hat
	```

	# Computing fitted/predicted values & residuals

	## Instruction 1/3
	```{r}
	# Fit regression model
	model_score_2 <- lm(score ~ bty_avg, data = evals)

	# Get regression table
	get_regression_table(model_score_2)

	# Get all fitted/predicted values and residuals
	get_regression_points(model_score_2)
	```

	## Instruction 2/3
	```{r}
	# Fit regression model
	model_score_2 <- lm(score ~ bty_avg, data = evals)

	# Get regression table
	get_regression_table(model_score_2)

	# Get all fitted/predicted values and residuals
	get_regression_points(model_score_2) %>%
	mutate(score_hat_2 = 3.88 + 0.067 * bty_avg)
	```

	## Instruction 3/3
	```{r}
	# Fit regression model
	model_score_2 <- lm(score ~ bty_avg, data = evals)

	# Get regression table
	get_regression_table(model_score_2)

	# Get all fitted/predicted values and residuals
	get_regression_points(model_score_2) %>%
	mutate(residual_2 = score - score_hat)
	```

	# EDA of relationship of score and rank
	```{r}
	ggplot(evals, aes(x=rank, y=score)) +
	geom_boxplot() +
	labs(x = "rank", y = "score")

	evals %>%
	group_by(rank) %>%
	summarize(n = n(), mean_score = mean(score), sd_score = sd(score))
	```

	# Fitting a regression with a categorical x
	```{r}
	# Fit regression model
	model_score_4 <- lm(score ~ rank, data = evals)

	# Get regression table
	get_regression_table(model_score_4)

	# teaching mean
	teaching_mean <- 4.28

	# tenure track mean
	tenure_track_mean <- teaching_mean - 0.13

	# tenured mean
	tenured_mean <- teaching_mean - 0.145
	```

	# Making predictions using rank
	- A good prediction of their score would be 4.28 - 0.145 = 4.135.
	- A good prediction of their score would be -0.145.
	- There is no information in the table that can aid your prediction.