Skip to content

Instantly share code, notes, and snippets.

@mick001
Last active October 1, 2023 10:22
Show Gist options
  • Star 32 You must be signed in to star a gist
  • Fork 26 You must be signed in to fork a gist
  • Save mick001/ac92e7c017aecff216fd to your computer and use it in GitHub Desktop.
Save mick001/ac92e7c017aecff216fd to your computer and use it in GitHub Desktop.
Logistic regression tutorial code. Full article available at http://datascienceplus.com/perform-logistic-regression-in-r/
# Load the raw training data and replace missing values with NA
training.data.raw <- read.csv('train.csv',header=T,na.strings=c(""))
# Output the number of missing values for each column
sapply(training.data.raw,function(x) sum(is.na(x)))
# Quick check for how many different values for each feature
sapply(training.data.raw, function(x) length(unique(x)))
# A visual way to check for missing data
library(Amelia)
missmap(training.data.raw, main = "Missing values vs observed")
# Subsetting the data
data <- subset(training.data.raw,select=c(2,3,5,6,7,8,10,12))
# Substitute the missing values with the average value
data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T)
# R should automatically code Embarked as a factor(). A factor is R's way of dealing with
# categorical variables
is.factor(data$Sex) # Returns TRUE
is.factor(data$Embarked) # Returns TRUE
# Check categorical variables encoding for better understanding of the fitted model
contrasts(data$Sex)
contrasts(data$Embarked)
# Remove rows (Embarked) with NAs
data <- data[!is.na(data$Embarked),]
rownames(data) <- NULL
# Train test splitting
train <- data[1:800,]
test <- data[801:889,]
# Model fitting
model <- glm(Survived ~.,family=binomial(link='logit'),data=train)
summary(model)
# Analysis of deviance
anova(model,test="Chisq")
# McFadden R^2
library(pscl)
pR2(model)
#-------------------------------------------------------------------------------
# MEASURING THE PREDICTIVE ABILITY OF THE MODEL
# If prob > 0.5 then 1, else 0. Threshold can be set for better results
fitted.results <- predict(model,newdata=subset(test,select=c(2,3,4,5,6,7,8)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$Survived)
print(paste('Accuracy',1-misClasificError))
# Confusion matrix
library(caret)
confusionMatrix(data=fitted.results, reference=test$Survived)
library(ROCR)
# ROC and AUC
p <- predict(model, newdata=subset(test,select=c(2,3,4,5,6,7,8)), type="response")
pr <- prediction(p, test$Survived)
# TPR = sensitivity, FPR=specificity
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
@mohi-uddin
Copy link

sir,
i got the same accuracy as you did but
when i write code of confusion matrix it shows the following error in r

confusionMatrix(data=fitted.results, reference=test$Survived)

Error in confusionMatrix(data = fitted.results, reference = test$Survived) :
unused arguments (data = fitted.results, reference = test$Survived)

kindly reply

@EarlGlynn
Copy link

This worked for me:

library(caret)
confusionMatrix(data=fitted.results, reference=test$Survived)

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 51  9
         1  5 24
                                          
               Accuracy : 0.8427          
                 95% CI : (0.7502, 0.9112)
    No Information Rate : 0.6292          
    P-Value [Acc > NIR] : 8.248e-06       
                                          
                  Kappa : 0.6543          
 Mcnemar's Test P-Value : 0.4227          
                                          
            Sensitivity : 0.9107          
            Specificity : 0.7273          
         Pos Pred Value : 0.8500          
         Neg Pred Value : 0.8276          
             Prevalence : 0.6292          
         Detection Rate : 0.5730          
   Detection Prevalence : 0.6742          
      Balanced Accuracy : 0.8190          
                                          
       'Positive' Class : 0               

@harigovind-s-menon
Copy link

predict(model,newdata=subset(test,select=c(2,3,4,5,6,7,8)),type='response')

Why are you removing the first column?

@kmnhz
Copy link

kmnhz commented Feb 15, 2017

Did you check the accuracy level with test set provided by Kaggle?

@Coconuthack
Copy link

Thanks for the code and tutorial!

Copy link

ghost commented Nov 18, 2017

@ harigovind-s-menon
because he did not need the passenger id(the first column) in the analytics so he removed it

@riddhidutta
Copy link

#Please try the below codes for contrasts:
contrasts(as.factor(data$Sex))
contrasts(as.factor(data$Embarked)

@Arezoo-Bozorgmehr
Copy link

Hi,

is this code for logistic regression as statistical modell or for machine learning. I need it for statistical modell, because I did my work with Machine learning and I would to model my dataset with normale logistic regression to compare with 3 machine learning methods.
Can you help me please?
Best Regards
Arezoo

@oliai
Copy link

oliai commented Dec 12, 2019

Hello Arezoo,
The code is for logistic regression model and can be used as one of the simplest machine learning tools. There is a bit of information on this link you may find helpful: https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/ . The good thing is with this method you can get the accuracy to compare with other methods, vs. simple linear regression. You may also want to consider these models for comparative work: Decision Tree, KNN, SVM, ANN, and Naïve Bayes. Good luck to you!
Shahryar

@mory-moussa
Copy link

thanks

@shadrack-oo
Copy link

sir,
i got the same accuracy as you did but
when i write code of confusion matrix it shows the following error in r

confusionMatrix(data=fitted.results, reference=test$Survived)

Error in confusionMatrix(data = fitted.results, reference = test$Survived) :
unused arguments (data = fitted.results, reference = test$Survived)

kindly reply

Hi, i managed to get the error corrected by putting it in a "table": see below
confusionMatrix(table(data=fitted.results, reference=test$Survived))

@sagnik99rocks
Copy link

sir,
i got the same accuracy as you did but
when i write code of confusion matrix it shows the following error in r

confusionMatrix(data=fitted.results, reference=test$Survived)

Error in confusionMatrix(data = fitted.results, reference = test$Survived) :
unused arguments (data = fitted.results, reference = test$Survived)

kindly reply

Hi, i managed to get the error corrected by putting it in a "table": see below
confusionMatrix(table(data=fitted.results, reference=test$Survived))

yes, it is working thanks :)

@shadrack-oo
Copy link

shadrack-oo commented Oct 14, 2020 via email

@TeresaChek
Copy link

Pls help when i run the confusionMatrix, using table()) error msg i got was all arguments must have the same length. Pls help advise whats that problem

@ELHAIRECH
Copy link

Fix for the error (data and reference should be factors with the same levels.)in

confusionMatrix(data=fitted.results, reference=test$Survived)

do:

install.packages("e1071")
confusionMatrix(data=as.factor(fitted.results), reference=as.factor(test$Survived))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment