Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Logistic regression tutorial code. Full article available at http://datascienceplus.com/perform-logistic-regression-in-r/
# Load the raw training data and replace missing values with NA
training.data.raw <- read.csv('train.csv',header=T,na.strings=c(""))
# Output the number of missing values for each column
sapply(training.data.raw,function(x) sum(is.na(x)))
# Quick check for how many different values for each feature
sapply(training.data.raw, function(x) length(unique(x)))
# A visual way to check for missing data
library(Amelia)
missmap(training.data.raw, main = "Missing values vs observed")
# Subsetting the data
data <- subset(training.data.raw,select=c(2,3,5,6,7,8,10,12))
# Substitute the missing values with the average value
data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T)
# R should automatically code Embarked as a factor(). A factor is R's way of dealing with
# categorical variables
is.factor(data$Sex) # Returns TRUE
is.factor(data$Embarked) # Returns TRUE
# Check categorical variables encoding for better understanding of the fitted model
contrasts(data$Sex)
contrasts(data$Embarked)
# Remove rows (Embarked) with NAs
data <- data[!is.na(data$Embarked),]
rownames(data) <- NULL
# Train test splitting
train <- data[1:800,]
test <- data[801:889,]
# Model fitting
model <- glm(Survived ~.,family=binomial(link='logit'),data=train)
summary(model)
# Analysis of deviance
anova(model,test="Chisq")
# McFadden R^2
library(pscl)
pR2(model)
#-------------------------------------------------------------------------------
# MEASURING THE PREDICTIVE ABILITY OF THE MODEL
# If prob > 0.5 then 1, else 0. Threshold can be set for better results
fitted.results <- predict(model,newdata=subset(test,select=c(2,3,4,5,6,7,8)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$Survived)
print(paste('Accuracy',1-misClasificError))
# Confusion matrix
library(caret)
confusionMatrix(data=fitted.results, reference=test$Survived)
library(ROCR)
# ROC and AUC
p <- predict(model, newdata=subset(test,select=c(2,3,4,5,6,7,8)), type="response")
pr <- prediction(p, test$Survived)
# TPR = sensitivity, FPR=specificity
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
@mohi-uddin

This comment has been minimized.

Copy link

mohi-uddin commented Nov 29, 2016

sir,
i got the same accuracy as you did but
when i write code of confusion matrix it shows the following error in r

confusionMatrix(data=fitted.results, reference=test$Survived)

Error in confusionMatrix(data = fitted.results, reference = test$Survived) :
unused arguments (data = fitted.results, reference = test$Survived)

kindly reply

@EarlGlynn

This comment has been minimized.

Copy link

EarlGlynn commented Dec 1, 2016

This worked for me:

library(caret)
confusionMatrix(data=fitted.results, reference=test$Survived)

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 51  9
         1  5 24
                                          
               Accuracy : 0.8427          
                 95% CI : (0.7502, 0.9112)
    No Information Rate : 0.6292          
    P-Value [Acc > NIR] : 8.248e-06       
                                          
                  Kappa : 0.6543          
 Mcnemar's Test P-Value : 0.4227          
                                          
            Sensitivity : 0.9107          
            Specificity : 0.7273          
         Pos Pred Value : 0.8500          
         Neg Pred Value : 0.8276          
             Prevalence : 0.6292          
         Detection Rate : 0.5730          
   Detection Prevalence : 0.6742          
      Balanced Accuracy : 0.8190          
                                          
       'Positive' Class : 0               

@harigovind-s-menon

This comment has been minimized.

Copy link

harigovind-s-menon commented Jan 31, 2017

predict(model,newdata=subset(test,select=c(2,3,4,5,6,7,8)),type='response')

Why are you removing the first column?

@kmnhz

This comment has been minimized.

Copy link

kmnhz commented Feb 15, 2017

Did you check the accuracy level with test set provided by Kaggle?

@Coconuthack

This comment has been minimized.

Copy link

Coconuthack commented Nov 4, 2017

Thanks for the code and tutorial!

@ghost

This comment has been minimized.

Copy link

ghost commented Nov 18, 2017

@ harigovind-s-menon
because he did not need the passenger id(the first column) in the analytics so he removed it

@riddhidutta

This comment has been minimized.

Copy link

riddhidutta commented Dec 5, 2017

#Please try the below codes for contrasts:
contrasts(as.factor(data$Sex))
contrasts(as.factor(data$Embarked)

@Arezoo-Bozorgmehr

This comment has been minimized.

Copy link

Arezoo-Bozorgmehr commented Nov 22, 2019

Hi,

is this code for logistic regression as statistical modell or for machine learning. I need it for statistical modell, because I did my work with Machine learning and I would to model my dataset with normale logistic regression to compare with 3 machine learning methods.
Can you help me please?
Best Regards
Arezoo

@oliai

This comment has been minimized.

Copy link

oliai commented Dec 12, 2019

Hello Arezoo,
The code is for logistic regression model and can be used as one of the simplest machine learning tools. There is a bit of information on this link you may find helpful: https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/ . The good thing is with this method you can get the accuracy to compare with other methods, vs. simple linear regression. You may also want to consider these models for comparative work: Decision Tree, KNN, SVM, ANN, and Naïve Bayes. Good luck to you!
Shahryar

@mory-moussa

This comment has been minimized.

Copy link

mory-moussa commented Jan 6, 2020

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.