diamonaj/conf_and_prediction_intervals_for_logistic_regression.R

## conf_and_prediction_intervals_for_logistic_regression.R
## This long coding example shows you how to obtain
## confidence intervals for logistic regression.
## The appendix at the very bottom also shows you how to obtain
## something analogous to prediction intervals
## for a logistic regression.

## Here's a High-level summary of the basic procedure, step-by-step:

## Step 1:  Run desired logistic regression, including any desired interactions
##
## Step 2:  Predict LINEAR LOGIT (not the predicted probabilities)
##          using predict().
##          Make sure your predict() function call includes the following words:
##          se.fit = TRUE    ------ this will give you an "se.fit" object.
##
## Step 3:  Take your linear logit predictions, and add and subtract 1.96*se.fit
##          Your observation-specific predictions and se.fit numbers come
##          from the predict() output (from Step 2 above). For each observation,
##          store the 2 numbers you obtained -- one from adding 1.96*se.fit, and
##          the other from subtracting 1.96*se.fit.
##
## Step 4:  Take these 2 numbers (for each observation), that represent that
##          observation's confidence interval bounds, and convert them into
##          predicted probabilities by using the logistic function:
##          exp(logit) / 1+exp(logit)
##
## Now you have bounds on predicted probabilities for each observation, and
## you can create tables & data visualizations communicating conf intervals.
## Because conf intervals for logistic regression are predicted probabilities.

## Details below.


# we'll use the Lalonde data for this illustrative example
library(Matching)
data(lalonde)

# let's run a logistic regression model that predicts
# marital status (married = 1 or 1)
# using information from "re75" (real earnings in 1975)
model_1 <- glm(married ~ re75, data=lalonde, family=binomial)

#### by the way, if you had wanted to include an interaction term, you would:
#### (1) create the variable and include it in your dataframe, e.g., if you
####     wanted to interact education and re75 for some reason, then here is
####     code: lalonde$interact <- lalonde$75*lalonde$educ

#### (2) incorporate the interaction term into your regression:
####           reg1_interact <- glm(married ~ re75 + educ + interact,
####                                          data = lalonde,
####                                          family = binomial)

#### (3) everything else is as described in the more generic process w/out
####     the interaction term, below--when you use predict(),
####     just make sure your newdata has a column called "interact"
####     so the predict() function can find it. Of course, if you are simply
####     simply making predictions for the data that fit the model (as below)
####     then of course the necessary variables will already be in the newdata
####     data set.  Below, we continue the example w/out any interaction terms:

# let's get the predicted values out, BUT... for a conf interval, we want the
# LOGIT predicted values (values produced by the linear logit model), NOT the
# predicted values we would ORDINARILY obtain from a logistic regression (which
# are the predicted probabilities).

# If you forgot what the linear logit is: equation for predicted probabilities is
# exp^(linear logit model) / (1 + exp^(linear logit model))

# Check the ISLR textbook if you want to see the formula nicely written.

# in the line above^^, the "se.fit" = TRUE will enable you to get the
# standard error of every predicted value (below), which we will use (below).

# let's get the predicted logit numbers, each wih its measure of uncertainty
# (it's standard error). We'll do this for our original data set (which you
# can think of as the training set, because you used it to fit your model).
pred_dat <- predict(model_1, newdata = lalonde, se.fit=TRUE)
# the pred_dat object you created in the line above^^ will have 3 components:

names(pred_dat)
# [1] "fit"            "se.fit"         "residual.scale"
# We are only interested in the first two: the "fit" and the "se.fit":
# pred_dat$fit" is the predicted value on the linear scale, for that observation.
# sometimes this is called a "fitted value". To get the predicted probabilities,
# (if I wanted them now, which I don't) I would type:
# predict(model_1, type = "response")

# pred_dat$se.fit" is the standard error of that specific fitted value.

# Now it's time to create a data visualization showing the conf interval
# of the dependent variable for this prediction problem.

# Sincw we are predicting the probability of "married == 1".
# the y axis on our figure will be probability of being married.

# plot the data:
# the x-axis will be earnings, y-axis will be the probability that married == 1
plot(x = lalonde$re75, y = lalonde$married, pch = 16,
     xlab = "real $ earnings in 1975",
     ylab = "probability of being married",
     main = "Does 1975 earnings predict marriage in 1978?",
     type = "n") # the type = "n" means it draws the axes but does not plot
                 # the points. I don't plot the points now b/c I want to show
                 # predicted probabilities, not actual 1s and 0s in the data.
                 # And I have not yet created any predicted probabilities.

# now add points to the figure, showing predicted probabilities, applying the
# exp / (1 + exp) formula to get probabilities from our linear logit predictions:
points(x = lalonde$re75, y = exp(pred_dat$fit)/(1 + exp(pred_dat$fit)),
       col = "blue", pch = 16) # pch = 16 makes the points solid-colored circles

# (in the preceding line, instead of using the formula exp / (1 + exp),
# we could have estimated & plotted predicted probabilities this way:
# predicted_probabilities <- predict(model_1, type = "response")

# now add more points showing confidence intervals, using the "se.fit" results,
# and knowing that the boundaries of confidence intervals (on the logit scale)
# are basically the predicted values PLUS & MINUS 1.96*stand error, i.e., se.fit

upper_conf_logit <- pred_dat$fit + 1.96*(pred_dat$se.fit)
lower_conf_logit <- pred_dat$fit - 1.96*(pred_dat$se.fit)

# now we convert these bounds (on the logit scale) to the probability scale
# by applying the logistic function:
upper_conf_prob <- exp(upper_conf_logit)/(1 + exp(upper_conf_logit))
lower_conf_prob <- exp(lower_conf_logit)/(1 + exp(lower_conf_logit))

# we plot these points, which are on the probability scale:
points(x = lalonde$re75, upper_conf_prob, col = "red", pch = 16, cex = 0.5)
points(x = lalonde$re75, lower_conf_prob, col = "red", pch = 16, cex = 0.5)

# APPENDIX: a few additional comments (see "A", "B", and "C" below):
# A: Covering more values in the confidence interval, making it prettier:
# B: Multiplying by 1.96: this is approximate, and we can make it more precise
# C: How to create something analogous to PREDICTION intervals (1s and 0s)


# A. The confidence interval in the dataviz is only showing results for the
# re75 values that are in our data set. So the conf interval looks a bit spotty.
# If we want an interval that looks more like a solid line, then we can
# calculate a confidence intervals for each x-axis number 0, 1, 2, 3, etc.,
# from the smallest x-value (0) to the largest x-value (25142.2):
# > max(lalonde$re75)
# [1] 25142.2

re75 <- 1:25143 # create a new object called re75 (b/c "re75" = "x" variable).
                # this "re75" is not in the lalonde dataset.
                # lalonde$re75 is still there, unchanged.

# I then make re75 a dataframe called "data_new": the predict() function needs
# a "newdata" input with the x-variable(s) in my regression model (i.e., re75)...
data_new <- data.frame(re75)

# I run "predict()" again, with my new "newdata" that has more x-vals (1:25143)
pred2 <- predict(model_1, newdata = data_new, se.fit=TRUE)

# I re-create the conf intervals and plot them:
upper_conf_logit <- pred2$fit + 1.96*(pred2$se.fit)
upper_conf_prob <- exp(upper_conf_logit)/(1 + exp(upper_conf_logit))

lower_conf_logit <- pred2$fit - 1.96*(pred2$se.fit)
lower_conf_prob <- exp(lower_conf_logit)/(1 + exp(lower_conf_logit))

# notice that below, I am not using x = lalonde$re75 (as above),
# but I am using the newly-created re75 which has many more x-vals.
points(x = re75, upper_conf_prob, col = "red", pch = 16, cex = 0.5)
points(x = re75, lower_conf_prob, col = "red", pch = 16, cex = 0.5)

# B. Above we simply use "1.96" as our multiplier to get conf intervals,
# but the exact multiplier number will depend upon our sample size.
# To get the exact number, you need to run a qt() function,
# and you need the number of "degrees of freedom" (abbreviation "df"),
# which is simply calculated as: n - k - 1,
# where n is the sample size and k is the number of predictors.
# e.g., for this exercise, the sample size was 445 and k = 1. so df = 443.
qt(0.975, df = 443)
# [1] 1.965333 -- this is more accurate than 1.96 (but it's so close who cares.)


# C. The above discussion produced confidence intervals. To produce
# prediction intervals for logistic regression is a little strange, because
# all the predictions are 1s and 0s. There is a way to do something like it:
# Unfortunately, it requires using a new library, the "arm" library, and the
# sim() function within that library. It's a little complicated.
# I show this procedure below:

install.packages("arm")
library(arm)
sim_output <- sim(model_1, n.sims = 100) # the 100 is my arbitrary choice

# using sim(), we can obtain simulated regression coefficients (our model had
# 2 coefficients--- one for the intercept and one for the re74 variable).
# These simulated regression coefficients can be used to obtain 100 different
# predicted PROBABLITY values with a for-loop.

# below, I will demonstrate this for the very first observation in lalonde
# only--this observation had an re75 equal to 0.
# > lalonde$re75[1]
# [1] 0

# See below:

# first we create an empty storage vector, that we will fill in the for-loop:
predicted_probabilities <- c(NA)

for (i in 1:100) # 100 because we had arbitrarily created 100 simulated values
{
  linear_logit <- # intercept * 1 + coefficient * lalonde$re75[1]
    sum(sim_output@coef[i,] * c(1, lalonde$re75[1])) # intercept multiplied by 1

  predicted_probabilities[i] <- exp(linear_logit) / (1 + exp(linear_logit))
}

# Now we have 100 simulated predicted probabilities.
# We use each of these simulated probabilities to actually flip coins associated
# with these probabilities and get predicted BINARY outcomes.
# E.g., if the predicted probability was 50%, it's analogous to a fair coin.
# If the probability was 70%, then it would be analogous to a coin flipe that
# favors heads. We keep track of all the 100 results, and then we plot them.

# To do this, we use a for-loop again, creating a storage vector and filling it,
# by running the sample() function for binary outcomes (1, 0) and using
# the sample() function's "prob" argument to specify the coin-flip probability.
# We're going to get one predicted outcome for each predicted probability.

predicted_outcomes <- c(NA)

for(i in 1:100)
{
  predicted_outcomes[i] <-
    sample(c(1,0),   # the binary outcomes
           size = 1, # we only get 1 predicted outcome per predicted probability
           prob = c(predicted_probabilities[i], # prob of a 1
                    1 - predicted_probabilities[i])# prob of a 0
  )
}

# Now we have 100 1s and 0s predictions. How many 1s and 0s occurred?
# table(predicted_outcomes)
# predicted_outcomes
# 0   1
# 88 12

# Remember that all of these 100 predicted 1/0 outcomes are for the very
# first lalonde observation -- if we wanted a prediction interval for values of
# re75 other than 0, or e.g., for every re75 value in the original lalonde data,
# then we would repeat this process for each and every lalonde observation.

# Now you can add this point to the plot in a clever way that communicates
# that the incidence of predicted 0s is much greater incidence of predicted 0s:
points(x = jitter(rep(lalonde$re75[1], 100), factor = 10000),
       y = jitter(predicted_outcomes, factor = 0.1),
                col = "wheat", pch = 1)

# this is what you do for the first observation in the lalonde data set
# create a for-loop that performs the procedure above for each and every
# observation, one after another, and you will get 2 lines representing
# upper and lower bounds on predicted binary outcomes, with the thickness of
# the lines roughly indicating the relative frequency of 1s and 0s.
	## This long coding example shows you how to obtain
	## confidence intervals for logistic regression.
	## The appendix at the very bottom also shows you how to obtain
	## something analogous to prediction intervals
	## for a logistic regression.

	## Here's a High-level summary of the basic procedure, step-by-step:

	## Step 1: Run desired logistic regression, including any desired interactions
	##
	## Step 2: Predict LINEAR LOGIT (not the predicted probabilities)
	## using predict().
	## Make sure your predict() function call includes the following words:
	## se.fit = TRUE ------ this will give you an "se.fit" object.
	##
	## Step 3: Take your linear logit predictions, and add and subtract 1.96*se.fit
	## Your observation-specific predictions and se.fit numbers come
	## from the predict() output (from Step 2 above). For each observation,
	## store the 2 numbers you obtained -- one from adding 1.96*se.fit, and
	## the other from subtracting 1.96*se.fit.
	##
	## Step 4: Take these 2 numbers (for each observation), that represent that
	## observation's confidence interval bounds, and convert them into
	## predicted probabilities by using the logistic function:
	## exp(logit) / 1+exp(logit)
	##
	## Now you have bounds on predicted probabilities for each observation, and
	## you can create tables & data visualizations communicating conf intervals.
	## Because conf intervals for logistic regression are predicted probabilities.

	## Details below.


	# we'll use the Lalonde data for this illustrative example
	library(Matching)
	data(lalonde)

	# let's run a logistic regression model that predicts
	# marital status (married = 1 or 1)
	# using information from "re75" (real earnings in 1975)
	model_1 <- glm(married ~ re75, data=lalonde, family=binomial)

	#### by the way, if you had wanted to include an interaction term, you would:
	#### (1) create the variable and include it in your dataframe, e.g., if you
	#### wanted to interact education and re75 for some reason, then here is
	#### code: lalonde$interact <- lalonde$75*lalonde$educ

	#### (2) incorporate the interaction term into your regression:
	#### reg1_interact <- glm(married ~ re75 + educ + interact,
	#### data = lalonde,
	#### family = binomial)

	#### (3) everything else is as described in the more generic process w/out
	#### the interaction term, below--when you use predict(),
	#### just make sure your newdata has a column called "interact"
	#### so the predict() function can find it. Of course, if you are simply
	#### simply making predictions for the data that fit the model (as below)
	#### then of course the necessary variables will already be in the newdata
	#### data set. Below, we continue the example w/out any interaction terms:

	# let's get the predicted values out, BUT... for a conf interval, we want the
	# LOGIT predicted values (values produced by the linear logit model), NOT the
	# predicted values we would ORDINARILY obtain from a logistic regression (which
	# are the predicted probabilities).

	# If you forgot what the linear logit is: equation for predicted probabilities is
	# exp^(linear logit model) / (1 + exp^(linear logit model))

	# Check the ISLR textbook if you want to see the formula nicely written.

	# in the line above^^, the "se.fit" = TRUE will enable you to get the
	# standard error of every predicted value (below), which we will use (below).

	# let's get the predicted logit numbers, each wih its measure of uncertainty
	# (it's standard error). We'll do this for our original data set (which you
	# can think of as the training set, because you used it to fit your model).
	pred_dat <- predict(model_1, newdata = lalonde, se.fit=TRUE)
	# the pred_dat object you created in the line above^^ will have 3 components:

	names(pred_dat)
	# [1] "fit" "se.fit" "residual.scale"
	# We are only interested in the first two: the "fit" and the "se.fit":
	# pred_dat$fit" is the predicted value on the linear scale, for that observation.
	# sometimes this is called a "fitted value". To get the predicted probabilities,
	# (if I wanted them now, which I don't) I would type:
	# predict(model_1, type = "response")

	# pred_dat$se.fit" is the standard error of that specific fitted value.

	# Now it's time to create a data visualization showing the conf interval
	# of the dependent variable for this prediction problem.

	# Sincw we are predicting the probability of "married == 1".
	# the y axis on our figure will be probability of being married.

	# plot the data:
	# the x-axis will be earnings, y-axis will be the probability that married == 1
	plot(x = lalonde$re75, y = lalonde$married, pch = 16,
	xlab = "real $ earnings in 1975",
	ylab = "probability of being married",
	main = "Does 1975 earnings predict marriage in 1978?",
	type = "n") # the type = "n" means it draws the axes but does not plot
	# the points. I don't plot the points now b/c I want to show
	# predicted probabilities, not actual 1s and 0s in the data.
	# And I have not yet created any predicted probabilities.

	# now add points to the figure, showing predicted probabilities, applying the
	# exp / (1 + exp) formula to get probabilities from our linear logit predictions:
	points(x = lalonde$re75, y = exp(pred_dat$fit)/(1 + exp(pred_dat$fit)),
	col = "blue", pch = 16) # pch = 16 makes the points solid-colored circles

	# (in the preceding line, instead of using the formula exp / (1 + exp),
	# we could have estimated & plotted predicted probabilities this way:
	# predicted_probabilities <- predict(model_1, type = "response")

	# now add more points showing confidence intervals, using the "se.fit" results,
	# and knowing that the boundaries of confidence intervals (on the logit scale)
	# are basically the predicted values PLUS & MINUS 1.96*stand error, i.e., se.fit

	upper_conf_logit <- pred_dat$fit + 1.96*(pred_dat$se.fit)
	lower_conf_logit <- pred_dat$fit - 1.96*(pred_dat$se.fit)

	# now we convert these bounds (on the logit scale) to the probability scale
	# by applying the logistic function:
	upper_conf_prob <- exp(upper_conf_logit)/(1 + exp(upper_conf_logit))
	lower_conf_prob <- exp(lower_conf_logit)/(1 + exp(lower_conf_logit))

	# we plot these points, which are on the probability scale:
	points(x = lalonde$re75, upper_conf_prob, col = "red", pch = 16, cex = 0.5)
	points(x = lalonde$re75, lower_conf_prob, col = "red", pch = 16, cex = 0.5)

	# APPENDIX: a few additional comments (see "A", "B", and "C" below):
	# A: Covering more values in the confidence interval, making it prettier:
	# B: Multiplying by 1.96: this is approximate, and we can make it more precise
	# C: How to create something analogous to PREDICTION intervals (1s and 0s)


	# A. The confidence interval in the dataviz is only showing results for the
	# re75 values that are in our data set. So the conf interval looks a bit spotty.
	# If we want an interval that looks more like a solid line, then we can
	# calculate a confidence intervals for each x-axis number 0, 1, 2, 3, etc.,
	# from the smallest x-value (0) to the largest x-value (25142.2):
	# > max(lalonde$re75)
	# [1] 25142.2

	re75 <- 1:25143 # create a new object called re75 (b/c "re75" = "x" variable).
	# this "re75" is not in the lalonde dataset.
	# lalonde$re75 is still there, unchanged.

	# I then make re75 a dataframe called "data_new": the predict() function needs
	# a "newdata" input with the x-variable(s) in my regression model (i.e., re75)...
	data_new <- data.frame(re75)

	# I run "predict()" again, with my new "newdata" that has more x-vals (1:25143)
	pred2 <- predict(model_1, newdata = data_new, se.fit=TRUE)

	# I re-create the conf intervals and plot them:
	upper_conf_logit <- pred2$fit + 1.96*(pred2$se.fit)
	upper_conf_prob <- exp(upper_conf_logit)/(1 + exp(upper_conf_logit))

	lower_conf_logit <- pred2$fit - 1.96*(pred2$se.fit)
	lower_conf_prob <- exp(lower_conf_logit)/(1 + exp(lower_conf_logit))

	# notice that below, I am not using x = lalonde$re75 (as above),
	# but I am using the newly-created re75 which has many more x-vals.
	points(x = re75, upper_conf_prob, col = "red", pch = 16, cex = 0.5)
	points(x = re75, lower_conf_prob, col = "red", pch = 16, cex = 0.5)

	# B. Above we simply use "1.96" as our multiplier to get conf intervals,
	# but the exact multiplier number will depend upon our sample size.
	# To get the exact number, you need to run a qt() function,
	# and you need the number of "degrees of freedom" (abbreviation "df"),
	# which is simply calculated as: n - k - 1,
	# where n is the sample size and k is the number of predictors.
	# e.g., for this exercise, the sample size was 445 and k = 1. so df = 443.
	qt(0.975, df = 443)
	# [1] 1.965333 -- this is more accurate than 1.96 (but it's so close who cares.)


	# C. The above discussion produced confidence intervals. To produce
	# prediction intervals for logistic regression is a little strange, because
	# all the predictions are 1s and 0s. There is a way to do something like it:
	# Unfortunately, it requires using a new library, the "arm" library, and the
	# sim() function within that library. It's a little complicated.
	# I show this procedure below:

	install.packages("arm")
	library(arm)
	sim_output <- sim(model_1, n.sims = 100) # the 100 is my arbitrary choice

	# using sim(), we can obtain simulated regression coefficients (our model had
	# 2 coefficients--- one for the intercept and one for the re74 variable).
	# These simulated regression coefficients can be used to obtain 100 different
	# predicted PROBABLITY values with a for-loop.

	# below, I will demonstrate this for the very first observation in lalonde
	# only--this observation had an re75 equal to 0.
	# > lalonde$re75[1]
	# [1] 0

	# See below:

	# first we create an empty storage vector, that we will fill in the for-loop:
	predicted_probabilities <- c(NA)

	for (i in 1:100) # 100 because we had arbitrarily created 100 simulated values
	{
	linear_logit <- # intercept * 1 + coefficient * lalonde$re75[1]
	sum(sim_output@coef[i,] * c(1, lalonde$re75[1])) # intercept multiplied by 1

	predicted_probabilities[i] <- exp(linear_logit) / (1 + exp(linear_logit))
	}

	# Now we have 100 simulated predicted probabilities.
	# We use each of these simulated probabilities to actually flip coins associated
	# with these probabilities and get predicted BINARY outcomes.
	# E.g., if the predicted probability was 50%, it's analogous to a fair coin.
	# If the probability was 70%, then it would be analogous to a coin flipe that
	# favors heads. We keep track of all the 100 results, and then we plot them.

	# To do this, we use a for-loop again, creating a storage vector and filling it,
	# by running the sample() function for binary outcomes (1, 0) and using
	# the sample() function's "prob" argument to specify the coin-flip probability.
	# We're going to get one predicted outcome for each predicted probability.

	predicted_outcomes <- c(NA)

	for(i in 1:100)
	{
	predicted_outcomes[i] <-
	sample(c(1,0), # the binary outcomes
	size = 1, # we only get 1 predicted outcome per predicted probability
	prob = c(predicted_probabilities[i], # prob of a 1
	1 - predicted_probabilities[i])# prob of a 0
	)
	}

	# Now we have 100 1s and 0s predictions. How many 1s and 0s occurred?
	# table(predicted_outcomes)
	# predicted_outcomes
	# 0 1
	# 88 12

	# Remember that all of these 100 predicted 1/0 outcomes are for the very
	# first lalonde observation -- if we wanted a prediction interval for values of
	# re75 other than 0, or e.g., for every re75 value in the original lalonde data,
	# then we would repeat this process for each and every lalonde observation.

	# Now you can add this point to the plot in a clever way that communicates
	# that the incidence of predicted 0s is much greater incidence of predicted 0s:
	points(x = jitter(rep(lalonde$re75[1], 100), factor = 10000),
	y = jitter(predicted_outcomes, factor = 0.1),
	col = "wheat", pch = 1)

	# this is what you do for the first observation in the lalonde data set
	# create a for-loop that performs the procedure above for each and every
	# observation, one after another, and you will get 2 lines representing
	# upper and lower bounds on predicted binary outcomes, with the thickness of
	# the lines roughly indicating the relative frequency of 1s and 0s.