stephlocke/Evaluating Logistic Regression Models in R.Rmd

## Evaluating Logistic Regression Models in R.Rmd
This post provides an overview of performing diagnostic and performance evaluation on logistic regression models in R. After training a statistical model, it’s important to understand how well that model did in regards to it’s accuracy and predictive power. The following content will provide the background and theory to ensure that the right technique are being utilized for evaluating logistic regression models in R.

Logistic Regression Example
We will use the GermanCredit dataset in the caret package for this example. It contains 62 characteristics and 1000 observations, with a target variable (Class) that is allready defined. The response variable is coded 0 for bad consumer and 1 for good. It’s always recommended that one looks at the coding of the response variable to ensure that it’s a factor variable that’s coded accurately with a 0/1 scheme or two factor levels in the right order. The first step is to partition the data into training and testing sets.

```
library(caret)
data(GermanCredit)
Train <- createDataPartition(GermanCredit$Class, p=0.6, list=FALSE)
training <- GermanCredit[ Train, ]
testing <- GermanCredit[ -Train, ]
Using the training dataset, which contains 600 observations, we will use logistic regression
to model Class as a function of five predictors.

mod_fit <- train(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own +
                   CreditHistory.Critical,  data=training, method="glm", family="binomial")
Bear in mind that the estimates from logistic regression characterize the relationship between the predictor and response variable on a log-odds scale. For example, this model suggests that for every one unit increase in Age, the log-odds of the consumer having good credit increases by 0.018. Because this isn’t of much practical value, we’ll ussually want to use the exponential function to calculate the odds ratios for each preditor.

exp(coef(mod_fit$finalModel))
This informs us that for every one unit increase in Age, the odds of having good credit increases by a factor of 1.01. In many cases, we often want to use the model parameters to predict the value of the target variable in a completely new set of observations. That can be done with the predict function. Keep in mind that if the model was created using the glm function, you’ll need to add type="response" to the predict command.

predict(mod_fit, newdata=testing)
predict(mod_fit, newdata=testing, type="prob")
```

Model Evaluation and Diagnostics
A logistic regression model has been built and the coefficients have been examined. However, some critical questions remain. Is the model any good? How well does the model fit the data? Which predictors are most important? Are the predictions accurate? The rest of this document will cover techniques for answering these questions and provide R code to conduct that analysis.

For the following sections, we will primarily work with the logistic regression that I created with the glm() function. While I prefer utilizing the Caret package, many functions in R will work better with a glm object.

```
mod_fit_one <- glm(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own +
                     CreditHistory.Critical, data=training, family="binomial")

mod_fit_two <- glm(Class ~ Age + ForeignWorker, data=training, family="binomial")
```

Goodness of Fit
Likelihood Ratio Test

A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This is performed using the likelihood ratio test, which compares the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors. Removing predictor variables from a model will almost always make the model fit less well (i.e. a model will have a lower log likelihood), but it is necessary to test whether the observed difference in model fit is statistically significant. Given that (H_0) holds that the reduced model is true, a p-value for the overall model fit statistic that is less than (0.05) would compel us to reject the null hypothesis. It would provide evidence against the reduced model in favor of the current model. The likelihood ratio test can be performed in R using the lrtest() function from the lmtest package or using the anova() function in base.

Higher values of the LR test statistic lead to small p-values and provide evidence against the reduced model. However, when the test statistic is lower and the alpha level is greater than 0.05, we can not reject H_0 and it provides evidence in support of the reduced model. In such cases, we would want to consider a variety of separate models with different predictors in order to determine the best model for the given dataset. If the value of alpha is more than 0.05 for the reduced model, we should consider testing different models with other sets of predictors.
```
library(lmtest)
lrtest(mod_fit_one, mod_fit_two)

# the lrt value for the reduced model is higher than the full model
# the value of alpha is less than 0.05
# this provides evidence against the null hypothesis that the null hypothesis is true.
```

Pseudo R2

Unlike linear regression with ordinary least squares estimation, there is no R^2 statistic which explains the proportion of variance in the dependent variable that is explained by the predictors. However, there are a number of pseudo R^2 metrics that could be of value. Most notable is McFadden’s R^2, which is defined as 1 – [ ln(L_M) / ln(L_0) ] where ln(L_M) is the log likelihood value for the fitted model and ln(L_0) is the log likelihood for the null model with only an intercept as a predictor. The measure ranges from 0 to just under 1, with values closer to zero indicating that the model has no predictive power.

Values for McFadden’s R^2 range with 0 to just under 1 with values between above 0.2 generally being considered as satisfactory. If McFadden’s R^2 is less then 0.2, then the model does a poor job in explaining the target variable. It’s reccomended that you should try to find additional predictive variables that will improve the overall fit of the model.

```
library(pscl)
pR2(mod_fit_one)  # look for 'McFadden'
```

Hosmer-Lemeshow Test

Another approch to determining the goodness of fit is through the Homer-Lemeshow statistics, which is computed on data after the observations have been segmented into groups based on having similar predicted probabilities. It examines whether the observed proportions of events are similar to the predicted probabilities of occurence in subgroups of the data set using a pearson chi square test. Small values with large p-values indicate a good fit to the data while large values with p-values below (0.05) indicate a poor fit. The null hypothesis holds that the model fits the data and in the below example we would reject (H_0).

If the alpha for the HL test statistic is greater than 0.05, we fail to reject the null hypothesis that there is no difference between the observed and model-predicted values. This would suggest that the model provides an adaquete fit to the data. However, when alpha is less than 0.05, the fit is poor and the influence of individual observations on the model fit should be examined.

```
library(MKmisc)
HLgof.test(fit = fitted(mod_fit_one), obs = training$Class)
library(ResourceSelection)
hoslem.test(training$Class, fitted(mod_fit_one), g=10)
```

Statistical Tests for Individual Predictors
Wald Test

A wald test is used to evaluate the statistical significance of each coefficient in the model and is calculated by taking the ratio of the square of the regression coefficient to the square of the standard error of the coefficient. The idea is to test the hypothesis that the coefficient of an independent variable in the model is not significantly different from zero. If the test fails to reject the null hypothesis, this suggests that removing the variable from the model will not substantially harm the fit of that model. One important point to note is that the Wald test is usually printed out in most model outputs as the z-score.

If the alpha for the Wald statistics is below 0.05, it should compel us to reject the null hypothesis and accept that the variable should be included in the model. However, an alpha that is greater than 0.05 suggests that those explanatory variables can be omitted from the model.

```
library(survey)
regTermTest(mod_fit_one, "ForeignWorker")
regTermTest(mod_fit_one, "CreditHistory.Critical")
```

Variable Importance

To assess the relative importance of individual predictors in the model, we can also look at the absolute value of the t-statistic for each model parameter. This technique is utilized by the varImp function in the caret package for general and generalized linear models.

These variable importance tables inform us as to the strength of the effect associated with each explanatory variable in the model. One should carefully examine these table and use them to determine if there are predictors that could be thrown out of the regression model.

```
varImp(mod_fit)
```

Validation of Predicted Values
Classification Rate

When developing models for prediction, the most critical metric regards how well the model does in predicting the target variable on out of sample observations. The process involves using the model estimates to predict values on the training set. Afterwards, we will compared the predicted target variable versus the observed values for each observation. In the example below, you’ll notice that our model accurately predicted 67% of the observations in the testing set.

Whether a classification rate is good depends on a number of factors, including the quality of the data, business domain, modeling technique, and so forth. Therefore, one should consider all of these factors when looking at the classification rate and determining whether it’s ‘good enough.’
```
pred = predict(mod_fit, newdata=testing)
accuracy <- table(pred, testing[,"Class"])
sum(diag(accuracy))/sum(accuracy)
pred = predict(mod_fit, newdata=testing)
confusionMatrix(data=pred, testing$Class)
```

ROC Curve (AUROC)

The receiving operating characteristic is a measure of classifier performance. Using the proportion of positive data points that are correctly considered as positive and the proportion of negative data points that are mistakenly considered as positive, we generate a graphic that shows the trade off between the rate at which you can correctly predict something with the rate of incorrectly predicting something. Ultimately, we’re concerned about the area under the ROC curve, or AUROC. That metric ranges from 0.50 to 1.00, and values above 0.80 indicate that the model does a great job in discriminating between the two categories which comprise our target variable. Bear in mind that ROC curves can examine both target-x-predictor pairings and target-x-model performance. An example of both are presented below.

The area under the ROC curve ranges from 0.50 to 1.00, and values above 0.80 indicate that the model does a great job in discriminating between the two categories which comprise our target variable. If the value of the ROC curve is between 0.50 to 0.70, one should consider revisiting the individual predictors that are in the model and consider if any other explanatory variables should be included.
```
library(pROC)
# Compute AUC for predicting Class with the variable CreditHistory.Critical
f1 = roc(Class ~ CreditHistory.Critical, data=training)
plot(f1, col="red")
library(ROCR)
# Compute AUC for predicting Class with the model
prob <- predict(mod_fit_one, newdata=testing, type="response")
pred <- prediction(prob, testing$Class)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf)
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
```

Other Diagnostics
Confidence Intervals

Once the logistic regression model has been developed and the odds-ratios calculated, it’s advised that we look at the confidence intervals for these estimates. Given that these parameters were estimated from a sample in order to explain phenomena at a larger population, it’s reasonable to expect that the real value of the statistic within the population won’t exactly match our model estimate and will actually fall within some range of values. Therefore, when evaluating statistical models, it’s important to consider the confidence intervals for the predictor estimates and their odds ratios. With logistic regression, we also need to be cognizant of the underlying distributional assumptions and the scale of the interval.

When assessing confidence intervals for the estimate is whether the values are ‘significantly differently different from 0.’ This can help inform us as to whether an explanatory variable should be included in the model. There is no hard rule however, and any decisions to modify the predictor list should also account for other diagnostic results, domain expertise, and so forth. The important thing to remember is that confidence intervals should be examined for both the individual attribute parameters and odds-ratios. Furthermore, is is suggested that one should compare both the Wald and profile likelihood CIs when evaluating the model.

```
confint(mod_fit_one) # selects the appropriate profile method
confint.default(mod_fit_one) # ci's for the estimates using wald

require(MASS)
exp(confint(mod_fit_one)) # ci's for the odds ratios
```

Correlation Coefficient

Before building a model and evaluating it’s results, we need to examine the correlations between the variables in the data set. The goal is to identify those variables which have a strong linear relationship and is done by developing a correlation matrix which takes each continuous variable and finds the correlation coefficient for every pairing in the data set. The correlation coefficient is calculated using Pearson or Spearman measurements, with the values ranging from -1 (negative correlation) to 1 (positive correlation). When working with a large number of variable, the correlation matrix can be visualized using a correlation plot to make the evaluation easier. For non-continuous variables that are categorical or ordinal, techniques such as analysis of variance or chi-square tests can be used to assess the relationship between those variables. In such cases, we want to know if there is a statistically significant difference between the two variables that we’re testing.

Correlation coefficients range from -1 (negative correlation) to 1 (positive correlation). For each two-by two correlation or entry in the correlation matrix, we can know whether the covariance between the two variables is strong positive, weak positive, and so forth. As a general rules, values greater than 0.5 or less than -0.05 indicate that the correlation is strong. Values between +/- 0.30 and +/- 0.50 indicate a moderate relationship. When these thresholds aren’t met, the user should consider whether those results to determine if there’s any covariance that could be contributing to multicolinearity in the model. Likewise, if we fair to reject the null hypothesis on the chi-square of independence of have point-beral estimates below +/- 0.30, the user should consider if these metrics could indicate that multicolinearity is present within the model.

```
cor(training$Age, training$ForeignWorker)  # two-way correlation

df <- training[,c("Age","ForeignWorker","Property.RealEstate","Housing.Own","CreditHistory.Critical")]
cor(df)  # correlation matrix

M <- cor(df)
library(corrplot)
corrplot(M)  # visualization of correlation matrix

tbl = table(df$Housing.Own, df$CreditHistory.Critical)
chisq.test(tbl)  # chi-square test

library(psych)
biserial(training$Age, training$CreditHistory.Critical) # point biseral correlation
# continuous x variable and a dichotmous y variable
```

Variance Inflation Factor (Multicolinearity)

Multicolinearity refers to a situation in which an explanatory variable within a regression model is correlated with other predictors. The same tests we use to detect multicolinearity in multivariable linear regression models can also be used to detect whether multicolinearity exists within logistic regression, although slight modifications will have to be made in the interpretation of the statistic. The primary test for detecting multicolinearity in a set of explanatory variables is the variance inflation factor test. It determines how much of the inflation in the standard error could be caused by collinearity.

Variance inflation factors are calculated for each predictor in the regression model. While there is disagreement on the appropriate cutoff for identifying whether the model suffers from multicolinearity, the following rules provide a rough guideline.

If VIF > 1 and VIF < 2.5, then those explanatory variables are moderately correlated.
If VIF > 2.5, then those variables are highly correlated.
When variables show high correlation between one another, the user should consider ways to mitigate that. One solution could be to throw out an explanatory variable. The user could also find ways to aggregate variables that are correlated into a single index or new predictor. Lastly, the user could always look for a proxy variable that is similar to the variables that suffer from multicolinearity, but can still help with predicting the target variable.

Influential Observations

After developing the model, the user will also want to examine whether certain observations in the data set had a significant impact on the estimates. Observations or data points which are significantly different from others may impact the parameter estimates and could indicate that there may be data entry errors. Furthermore, such influential observations may be worth examining further. To examine whether an observation has a stronger impact on the model estimates than other observations, we must first look at look at the residuals to identify observations that are not explained well by the model. Furthermore, we want to look at leverage scores, which measure the potential impact of an individual case on the results. Points with high leverage have a potentially influential impact on the model. However, bear in mind that an observation can have high leverage but little influence.

Because a high leverage observation is not necessarily influential, the user must consider how much of an impact that the high leverage observation has on the estimates of the model. High leverage does not automatically suggest that it’s influential and low leverage doesn’t necessarily mean that it’s not influential. When you believe that a data point has both leverage and contributes negatively to the accuracy of the model, the user must consider what action to take. It’s possible that there was a data entry error or some other reason which could compel us to throw out the observation. Another solution could be to imput that data point with value which isn’t as extreme, such as the average for that variable.
```
resid(mod_fit_one)
mod_fit_one$residuals

# partial residual plot against fitted values (explanatory variable)
plot(residuals(mod_fit_one, type="pearson") ~ training$Age, main="pearson residual vs sex plot")

# examine residuals and leverages
rp = residuals(mod_fit_one, type="pearson")  # Pearson residuals
rpstd = rstudent(mod_fit_one)   #Standardized Pearson residuals
lev   = influence(mod_fit_one)$hat  #  Case Leverages
cookd = cooks.distance(mod_fit_one)   # Cook's Distances
fv <- fitted(mod_fit_one)     # Fitted Values
cbind(training$Class,fv,rp,rpstd,lev,cookd)
```
Conclusion

There you have it, a quick overview of how to evaluate logistic regression models. We outlined a variety of techniques that are used to assess goodness of variable, variable importance, and predictive accuracy. Furthermore, we covered other essential stages of the model building process such as outlier detection and testing for multicolinearity. Clearly,it’s not enough just to build a model and look at it’s R^2 value. Practioners must examine a variety of things to ensure that they are building quality models that can generalize to a larger population.