Created
October 1, 2023 02:51
-
-
Save diamonaj/9861bf5503e2b9ac372f4f795bcb522d to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## This long coding example shows you how to obtain | |
## confidence intervals for logistic regression. | |
## The appendix at the very bottom also shows you how to obtain | |
## something analogous to prediction intervals | |
## for a logistic regression. | |
## Here's a High-level summary of the basic procedure, step-by-step: | |
## Step 1: Run desired logistic regression, including any desired interactions | |
## | |
## Step 2: Predict LINEAR LOGIT (not the predicted probabilities) | |
## using predict(). | |
## Make sure your predict() function call includes the following words: | |
## se.fit = TRUE ------ this will give you an "se.fit" object. | |
## | |
## Step 3: Take your linear logit predictions, and add and subtract 1.96*se.fit | |
## Your observation-specific predictions and se.fit numbers come | |
## from the predict() output (from Step 2 above). For each observation, | |
## store the 2 numbers you obtained -- one from adding 1.96*se.fit, and | |
## the other from subtracting 1.96*se.fit. | |
## | |
## Step 4: Take these 2 numbers (for each observation), that represent that | |
## observation's confidence interval bounds, and convert them into | |
## predicted probabilities by using the logistic function: | |
## exp(logit) / 1+exp(logit) | |
## | |
## Now you have bounds on predicted probabilities for each observation, and | |
## you can create tables & data visualizations communicating conf intervals. | |
## Because conf intervals for logistic regression are predicted probabilities. | |
## Details below. | |
# we'll use the Lalonde data for this illustrative example | |
library(Matching) | |
data(lalonde) | |
# let's run a logistic regression model that predicts | |
# marital status (married = 1 or 1) | |
# using information from "re75" (real earnings in 1975) | |
model_1 <- glm(married ~ re75, data=lalonde, family=binomial) | |
#### by the way, if you had wanted to include an interaction term, you would: | |
#### (1) create the variable and include it in your dataframe, e.g., if you | |
#### wanted to interact education and re75 for some reason, then here is | |
#### code: lalonde$interact <- lalonde$75*lalonde$educ | |
#### (2) incorporate the interaction term into your regression: | |
#### reg1_interact <- glm(married ~ re75 + educ + interact, | |
#### data = lalonde, | |
#### family = binomial) | |
#### (3) everything else is as described in the more generic process w/out | |
#### the interaction term, below--when you use predict(), | |
#### just make sure your newdata has a column called "interact" | |
#### so the predict() function can find it. Of course, if you are simply | |
#### simply making predictions for the data that fit the model (as below) | |
#### then of course the necessary variables will already be in the newdata | |
#### data set. Below, we continue the example w/out any interaction terms: | |
# let's get the predicted values out, BUT... for a conf interval, we want the | |
# LOGIT predicted values (values produced by the linear logit model), NOT the | |
# predicted values we would ORDINARILY obtain from a logistic regression (which | |
# are the predicted probabilities). | |
# If you forgot what the linear logit is: equation for predicted probabilities is | |
# exp^(linear logit model) / (1 + exp^(linear logit model)) | |
# Check the ISLR textbook if you want to see the formula nicely written. | |
# in the line above^^, the "se.fit" = TRUE will enable you to get the | |
# standard error of every predicted value (below), which we will use (below). | |
# let's get the predicted logit numbers, each wih its measure of uncertainty | |
# (it's standard error). We'll do this for our original data set (which you | |
# can think of as the training set, because you used it to fit your model). | |
pred_dat <- predict(model_1, newdata = lalonde, se.fit=TRUE) | |
# the pred_dat object you created in the line above^^ will have 3 components: | |
names(pred_dat) | |
# [1] "fit" "se.fit" "residual.scale" | |
# We are only interested in the first two: the "fit" and the "se.fit": | |
# pred_dat$fit" is the predicted value on the linear scale, for that observation. | |
# sometimes this is called a "fitted value". To get the predicted probabilities, | |
# (if I wanted them now, which I don't) I would type: | |
# predict(model_1, type = "response") | |
# pred_dat$se.fit" is the standard error of that specific fitted value. | |
# Now it's time to create a data visualization showing the conf interval | |
# of the dependent variable for this prediction problem. | |
# Sincw we are predicting the probability of "married == 1". | |
# the y axis on our figure will be probability of being married. | |
# plot the data: | |
# the x-axis will be earnings, y-axis will be the probability that married == 1 | |
plot(x = lalonde$re75, y = lalonde$married, pch = 16, | |
xlab = "real $ earnings in 1975", | |
ylab = "probability of being married", | |
main = "Does 1975 earnings predict marriage in 1978?", | |
type = "n") # the type = "n" means it draws the axes but does not plot | |
# the points. I don't plot the points now b/c I want to show | |
# predicted probabilities, not actual 1s and 0s in the data. | |
# And I have not yet created any predicted probabilities. | |
# now add points to the figure, showing predicted probabilities, applying the | |
# exp / (1 + exp) formula to get probabilities from our linear logit predictions: | |
points(x = lalonde$re75, y = exp(pred_dat$fit)/(1 + exp(pred_dat$fit)), | |
col = "blue", pch = 16) # pch = 16 makes the points solid-colored circles | |
# (in the preceding line, instead of using the formula exp / (1 + exp), | |
# we could have estimated & plotted predicted probabilities this way: | |
# predicted_probabilities <- predict(model_1, type = "response") | |
# now add more points showing confidence intervals, using the "se.fit" results, | |
# and knowing that the boundaries of confidence intervals (on the logit scale) | |
# are basically the predicted values PLUS & MINUS 1.96*stand error, i.e., se.fit | |
upper_conf_logit <- pred_dat$fit + 1.96*(pred_dat$se.fit) | |
lower_conf_logit <- pred_dat$fit - 1.96*(pred_dat$se.fit) | |
# now we convert these bounds (on the logit scale) to the probability scale | |
# by applying the logistic function: | |
upper_conf_prob <- exp(upper_conf_logit)/(1 + exp(upper_conf_logit)) | |
lower_conf_prob <- exp(lower_conf_logit)/(1 + exp(lower_conf_logit)) | |
# we plot these points, which are on the probability scale: | |
points(x = lalonde$re75, upper_conf_prob, col = "red", pch = 16, cex = 0.5) | |
points(x = lalonde$re75, lower_conf_prob, col = "red", pch = 16, cex = 0.5) | |
# APPENDIX: a few additional comments (see "A", "B", and "C" below): | |
# A: Covering more values in the confidence interval, making it prettier: | |
# B: Multiplying by 1.96: this is approximate, and we can make it more precise | |
# C: How to create something analogous to PREDICTION intervals (1s and 0s) | |
# A. The confidence interval in the dataviz is only showing results for the | |
# re75 values that are in our data set. So the conf interval looks a bit spotty. | |
# If we want an interval that looks more like a solid line, then we can | |
# calculate a confidence intervals for each x-axis number 0, 1, 2, 3, etc., | |
# from the smallest x-value (0) to the largest x-value (25142.2): | |
# > max(lalonde$re75) | |
# [1] 25142.2 | |
re75 <- 1:25143 # create a new object called re75 (b/c "re75" = "x" variable). | |
# this "re75" is not in the lalonde dataset. | |
# lalonde$re75 is still there, unchanged. | |
# I then make re75 a dataframe called "data_new": the predict() function needs | |
# a "newdata" input with the x-variable(s) in my regression model (i.e., re75)... | |
data_new <- data.frame(re75) | |
# I run "predict()" again, with my new "newdata" that has more x-vals (1:25143) | |
pred2 <- predict(model_1, newdata = data_new, se.fit=TRUE) | |
# I re-create the conf intervals and plot them: | |
upper_conf_logit <- pred2$fit + 1.96*(pred2$se.fit) | |
upper_conf_prob <- exp(upper_conf_logit)/(1 + exp(upper_conf_logit)) | |
lower_conf_logit <- pred2$fit - 1.96*(pred2$se.fit) | |
lower_conf_prob <- exp(lower_conf_logit)/(1 + exp(lower_conf_logit)) | |
# notice that below, I am not using x = lalonde$re75 (as above), | |
# but I am using the newly-created re75 which has many more x-vals. | |
points(x = re75, upper_conf_prob, col = "red", pch = 16, cex = 0.5) | |
points(x = re75, lower_conf_prob, col = "red", pch = 16, cex = 0.5) | |
# B. Above we simply use "1.96" as our multiplier to get conf intervals, | |
# but the exact multiplier number will depend upon our sample size. | |
# To get the exact number, you need to run a qt() function, | |
# and you need the number of "degrees of freedom" (abbreviation "df"), | |
# which is simply calculated as: n - k - 1, | |
# where n is the sample size and k is the number of predictors. | |
# e.g., for this exercise, the sample size was 445 and k = 1. so df = 443. | |
qt(0.975, df = 443) | |
# [1] 1.965333 -- this is more accurate than 1.96 (but it's so close who cares.) | |
# C. The above discussion produced confidence intervals. To produce | |
# prediction intervals for logistic regression is a little strange, because | |
# all the predictions are 1s and 0s. There is a way to do something like it: | |
# Unfortunately, it requires using a new library, the "arm" library, and the | |
# sim() function within that library. It's a little complicated. | |
# I show this procedure below: | |
install.packages("arm") | |
library(arm) | |
sim_output <- sim(model_1, n.sims = 100) # the 100 is my arbitrary choice | |
# using sim(), we can obtain simulated regression coefficients (our model had | |
# 2 coefficients--- one for the intercept and one for the re74 variable). | |
# These simulated regression coefficients can be used to obtain 100 different | |
# predicted PROBABLITY values with a for-loop. | |
# below, I will demonstrate this for the very first observation in lalonde | |
# only--this observation had an re75 equal to 0. | |
# > lalonde$re75[1] | |
# [1] 0 | |
# See below: | |
# first we create an empty storage vector, that we will fill in the for-loop: | |
predicted_probabilities <- c(NA) | |
for (i in 1:100) # 100 because we had arbitrarily created 100 simulated values | |
{ | |
linear_logit <- # intercept * 1 + coefficient * lalonde$re75[1] | |
sum(sim_output@coef[i,] * c(1, lalonde$re75[1])) # intercept multiplied by 1 | |
predicted_probabilities[i] <- exp(linear_logit) / (1 + exp(linear_logit)) | |
} | |
# Now we have 100 simulated predicted probabilities. | |
# We use each of these simulated probabilities to actually flip coins associated | |
# with these probabilities and get predicted BINARY outcomes. | |
# E.g., if the predicted probability was 50%, it's analogous to a fair coin. | |
# If the probability was 70%, then it would be analogous to a coin flipe that | |
# favors heads. We keep track of all the 100 results, and then we plot them. | |
# To do this, we use a for-loop again, creating a storage vector and filling it, | |
# by running the sample() function for binary outcomes (1, 0) and using | |
# the sample() function's "prob" argument to specify the coin-flip probability. | |
# We're going to get one predicted outcome for each predicted probability. | |
predicted_outcomes <- c(NA) | |
for(i in 1:100) | |
{ | |
predicted_outcomes[i] <- | |
sample(c(1,0), # the binary outcomes | |
size = 1, # we only get 1 predicted outcome per predicted probability | |
prob = c(predicted_probabilities[i], # prob of a 1 | |
1 - predicted_probabilities[i])# prob of a 0 | |
) | |
} | |
# Now we have 100 1s and 0s predictions. How many 1s and 0s occurred? | |
# table(predicted_outcomes) | |
# predicted_outcomes | |
# 0 1 | |
# 88 12 | |
# Remember that all of these 100 predicted 1/0 outcomes are for the very | |
# first lalonde observation -- if we wanted a prediction interval for values of | |
# re75 other than 0, or e.g., for every re75 value in the original lalonde data, | |
# then we would repeat this process for each and every lalonde observation. | |
# Now you can add this point to the plot in a clever way that communicates | |
# that the incidence of predicted 0s is much greater incidence of predicted 0s: | |
points(x = jitter(rep(lalonde$re75[1], 100), factor = 10000), | |
y = jitter(predicted_outcomes, factor = 0.1), | |
col = "wheat", pch = 1) | |
# this is what you do for the first observation in the lalonde data set | |
# create a for-loop that performs the procedure above for each and every | |
# observation, one after another, and you will get 2 lines representing | |
# upper and lower bounds on predicted binary outcomes, with the thickness of | |
# the lines roughly indicating the relative frequency of 1s and 0s. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment