Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save diamonaj/9861bf5503e2b9ac372f4f795bcb522d to your computer and use it in GitHub Desktop.
Save diamonaj/9861bf5503e2b9ac372f4f795bcb522d to your computer and use it in GitHub Desktop.
## This long coding example shows you how to obtain
## confidence intervals for logistic regression.
## The appendix at the very bottom also shows you how to obtain
## something analogous to prediction intervals
## for a logistic regression.
## Here's a High-level summary of the basic procedure, step-by-step:
## Step 1: Run desired logistic regression, including any desired interactions
##
## Step 2: Predict LINEAR LOGIT (not the predicted probabilities)
## using predict().
## Make sure your predict() function call includes the following words:
## se.fit = TRUE ------ this will give you an "se.fit" object.
##
## Step 3: Take your linear logit predictions, and add and subtract 1.96*se.fit
## Your observation-specific predictions and se.fit numbers come
## from the predict() output (from Step 2 above). For each observation,
## store the 2 numbers you obtained -- one from adding 1.96*se.fit, and
## the other from subtracting 1.96*se.fit.
##
## Step 4: Take these 2 numbers (for each observation), that represent that
## observation's confidence interval bounds, and convert them into
## predicted probabilities by using the logistic function:
## exp(logit) / 1+exp(logit)
##
## Now you have bounds on predicted probabilities for each observation, and
## you can create tables & data visualizations communicating conf intervals.
## Because conf intervals for logistic regression are predicted probabilities.
## Details below.
# we'll use the Lalonde data for this illustrative example
library(Matching)
data(lalonde)
# let's run a logistic regression model that predicts
# marital status (married = 1 or 1)
# using information from "re75" (real earnings in 1975)
model_1 <- glm(married ~ re75, data=lalonde, family=binomial)
#### by the way, if you had wanted to include an interaction term, you would:
#### (1) create the variable and include it in your dataframe, e.g., if you
#### wanted to interact education and re75 for some reason, then here is
#### code: lalonde$interact <- lalonde$75*lalonde$educ
#### (2) incorporate the interaction term into your regression:
#### reg1_interact <- glm(married ~ re75 + educ + interact,
#### data = lalonde,
#### family = binomial)
#### (3) everything else is as described in the more generic process w/out
#### the interaction term, below--when you use predict(),
#### just make sure your newdata has a column called "interact"
#### so the predict() function can find it. Of course, if you are simply
#### simply making predictions for the data that fit the model (as below)
#### then of course the necessary variables will already be in the newdata
#### data set. Below, we continue the example w/out any interaction terms:
# let's get the predicted values out, BUT... for a conf interval, we want the
# LOGIT predicted values (values produced by the linear logit model), NOT the
# predicted values we would ORDINARILY obtain from a logistic regression (which
# are the predicted probabilities).
# If you forgot what the linear logit is: equation for predicted probabilities is
# exp^(linear logit model) / (1 + exp^(linear logit model))
# Check the ISLR textbook if you want to see the formula nicely written.
# in the line above^^, the "se.fit" = TRUE will enable you to get the
# standard error of every predicted value (below), which we will use (below).
# let's get the predicted logit numbers, each wih its measure of uncertainty
# (it's standard error). We'll do this for our original data set (which you
# can think of as the training set, because you used it to fit your model).
pred_dat <- predict(model_1, newdata = lalonde, se.fit=TRUE)
# the pred_dat object you created in the line above^^ will have 3 components:
names(pred_dat)
# [1] "fit" "se.fit" "residual.scale"
# We are only interested in the first two: the "fit" and the "se.fit":
# pred_dat$fit" is the predicted value on the linear scale, for that observation.
# sometimes this is called a "fitted value". To get the predicted probabilities,
# (if I wanted them now, which I don't) I would type:
# predict(model_1, type = "response")
# pred_dat$se.fit" is the standard error of that specific fitted value.
# Now it's time to create a data visualization showing the conf interval
# of the dependent variable for this prediction problem.
# Sincw we are predicting the probability of "married == 1".
# the y axis on our figure will be probability of being married.
# plot the data:
# the x-axis will be earnings, y-axis will be the probability that married == 1
plot(x = lalonde$re75, y = lalonde$married, pch = 16,
xlab = "real $ earnings in 1975",
ylab = "probability of being married",
main = "Does 1975 earnings predict marriage in 1978?",
type = "n") # the type = "n" means it draws the axes but does not plot
# the points. I don't plot the points now b/c I want to show
# predicted probabilities, not actual 1s and 0s in the data.
# And I have not yet created any predicted probabilities.
# now add points to the figure, showing predicted probabilities, applying the
# exp / (1 + exp) formula to get probabilities from our linear logit predictions:
points(x = lalonde$re75, y = exp(pred_dat$fit)/(1 + exp(pred_dat$fit)),
col = "blue", pch = 16) # pch = 16 makes the points solid-colored circles
# (in the preceding line, instead of using the formula exp / (1 + exp),
# we could have estimated & plotted predicted probabilities this way:
# predicted_probabilities <- predict(model_1, type = "response")
# now add more points showing confidence intervals, using the "se.fit" results,
# and knowing that the boundaries of confidence intervals (on the logit scale)
# are basically the predicted values PLUS & MINUS 1.96*stand error, i.e., se.fit
upper_conf_logit <- pred_dat$fit + 1.96*(pred_dat$se.fit)
lower_conf_logit <- pred_dat$fit - 1.96*(pred_dat$se.fit)
# now we convert these bounds (on the logit scale) to the probability scale
# by applying the logistic function:
upper_conf_prob <- exp(upper_conf_logit)/(1 + exp(upper_conf_logit))
lower_conf_prob <- exp(lower_conf_logit)/(1 + exp(lower_conf_logit))
# we plot these points, which are on the probability scale:
points(x = lalonde$re75, upper_conf_prob, col = "red", pch = 16, cex = 0.5)
points(x = lalonde$re75, lower_conf_prob, col = "red", pch = 16, cex = 0.5)
# APPENDIX: a few additional comments (see "A", "B", and "C" below):
# A: Covering more values in the confidence interval, making it prettier:
# B: Multiplying by 1.96: this is approximate, and we can make it more precise
# C: How to create something analogous to PREDICTION intervals (1s and 0s)
# A. The confidence interval in the dataviz is only showing results for the
# re75 values that are in our data set. So the conf interval looks a bit spotty.
# If we want an interval that looks more like a solid line, then we can
# calculate a confidence intervals for each x-axis number 0, 1, 2, 3, etc.,
# from the smallest x-value (0) to the largest x-value (25142.2):
# > max(lalonde$re75)
# [1] 25142.2
re75 <- 1:25143 # create a new object called re75 (b/c "re75" = "x" variable).
# this "re75" is not in the lalonde dataset.
# lalonde$re75 is still there, unchanged.
# I then make re75 a dataframe called "data_new": the predict() function needs
# a "newdata" input with the x-variable(s) in my regression model (i.e., re75)...
data_new <- data.frame(re75)
# I run "predict()" again, with my new "newdata" that has more x-vals (1:25143)
pred2 <- predict(model_1, newdata = data_new, se.fit=TRUE)
# I re-create the conf intervals and plot them:
upper_conf_logit <- pred2$fit + 1.96*(pred2$se.fit)
upper_conf_prob <- exp(upper_conf_logit)/(1 + exp(upper_conf_logit))
lower_conf_logit <- pred2$fit - 1.96*(pred2$se.fit)
lower_conf_prob <- exp(lower_conf_logit)/(1 + exp(lower_conf_logit))
# notice that below, I am not using x = lalonde$re75 (as above),
# but I am using the newly-created re75 which has many more x-vals.
points(x = re75, upper_conf_prob, col = "red", pch = 16, cex = 0.5)
points(x = re75, lower_conf_prob, col = "red", pch = 16, cex = 0.5)
# B. Above we simply use "1.96" as our multiplier to get conf intervals,
# but the exact multiplier number will depend upon our sample size.
# To get the exact number, you need to run a qt() function,
# and you need the number of "degrees of freedom" (abbreviation "df"),
# which is simply calculated as: n - k - 1,
# where n is the sample size and k is the number of predictors.
# e.g., for this exercise, the sample size was 445 and k = 1. so df = 443.
qt(0.975, df = 443)
# [1] 1.965333 -- this is more accurate than 1.96 (but it's so close who cares.)
# C. The above discussion produced confidence intervals. To produce
# prediction intervals for logistic regression is a little strange, because
# all the predictions are 1s and 0s. There is a way to do something like it:
# Unfortunately, it requires using a new library, the "arm" library, and the
# sim() function within that library. It's a little complicated.
# I show this procedure below:
install.packages("arm")
library(arm)
sim_output <- sim(model_1, n.sims = 100) # the 100 is my arbitrary choice
# using sim(), we can obtain simulated regression coefficients (our model had
# 2 coefficients--- one for the intercept and one for the re74 variable).
# These simulated regression coefficients can be used to obtain 100 different
# predicted PROBABLITY values with a for-loop.
# below, I will demonstrate this for the very first observation in lalonde
# only--this observation had an re75 equal to 0.
# > lalonde$re75[1]
# [1] 0
# See below:
# first we create an empty storage vector, that we will fill in the for-loop:
predicted_probabilities <- c(NA)
for (i in 1:100) # 100 because we had arbitrarily created 100 simulated values
{
linear_logit <- # intercept * 1 + coefficient * lalonde$re75[1]
sum(sim_output@coef[i,] * c(1, lalonde$re75[1])) # intercept multiplied by 1
predicted_probabilities[i] <- exp(linear_logit) / (1 + exp(linear_logit))
}
# Now we have 100 simulated predicted probabilities.
# We use each of these simulated probabilities to actually flip coins associated
# with these probabilities and get predicted BINARY outcomes.
# E.g., if the predicted probability was 50%, it's analogous to a fair coin.
# If the probability was 70%, then it would be analogous to a coin flipe that
# favors heads. We keep track of all the 100 results, and then we plot them.
# To do this, we use a for-loop again, creating a storage vector and filling it,
# by running the sample() function for binary outcomes (1, 0) and using
# the sample() function's "prob" argument to specify the coin-flip probability.
# We're going to get one predicted outcome for each predicted probability.
predicted_outcomes <- c(NA)
for(i in 1:100)
{
predicted_outcomes[i] <-
sample(c(1,0), # the binary outcomes
size = 1, # we only get 1 predicted outcome per predicted probability
prob = c(predicted_probabilities[i], # prob of a 1
1 - predicted_probabilities[i])# prob of a 0
)
}
# Now we have 100 1s and 0s predictions. How many 1s and 0s occurred?
# table(predicted_outcomes)
# predicted_outcomes
# 0 1
# 88 12
# Remember that all of these 100 predicted 1/0 outcomes are for the very
# first lalonde observation -- if we wanted a prediction interval for values of
# re75 other than 0, or e.g., for every re75 value in the original lalonde data,
# then we would repeat this process for each and every lalonde observation.
# Now you can add this point to the plot in a clever way that communicates
# that the incidence of predicted 0s is much greater incidence of predicted 0s:
points(x = jitter(rep(lalonde$re75[1], 100), factor = 10000),
y = jitter(predicted_outcomes, factor = 0.1),
col = "wheat", pch = 1)
# this is what you do for the first observation in the lalonde data set
# create a for-loop that performs the procedure above for each and every
# observation, one after another, and you will get 2 lines representing
# upper and lower bounds on predicted binary outcomes, with the thickness of
# the lines roughly indicating the relative frequency of 1s and 0s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment