Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Logistic regression tutorial code. Full article available at http://datascienceplus.com/perform-logistic-regression-in-r/
# Load the raw training data and replace missing values with NA
training.data.raw <- read.csv('train.csv',header=T,na.strings=c(""))
# Output the number of missing values for each column
sapply(training.data.raw,function(x) sum(is.na(x)))
# Quick check for how many different values for each feature
sapply(training.data.raw, function(x) length(unique(x)))
# A visual way to check for missing data
library(Amelia)
missmap(training.data.raw, main = "Missing values vs observed")
# Subsetting the data
data <- subset(training.data.raw,select=c(2,3,5,6,7,8,10,12))
# Substitute the missing values with the average value
data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T)
# R should automatically code Embarked as a factor(). A factor is R's way of dealing with
# categorical variables
is.factor(data$Sex) # Returns TRUE
is.factor(data$Embarked) # Returns TRUE
# Check categorical variables encoding for better understanding of the fitted model
contrasts(data$Sex)
contrasts(data$Embarked)
# Remove rows (Embarked) with NAs
data <- data[!is.na(data$Embarked),]
rownames(data) <- NULL
# Train test splitting
train <- data[1:800,]
test <- data[801:889,]
# Model fitting
model <- glm(Survived ~.,family=binomial(link='logit'),data=train)
summary(model)
# Analysis of deviance
anova(model,test="Chisq")
# McFadden R^2
library(pscl)
pR2(model)
#-------------------------------------------------------------------------------
# MEASURING THE PREDICTIVE ABILITY OF THE MODEL
# If prob > 0.5 then 1, else 0. Threshold can be set for better results
fitted.results <- predict(model,newdata=subset(test,select=c(2,3,4,5,6,7,8)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$Survived)
print(paste('Accuracy',1-misClasificError))
# Confusion matrix
library(caret)
confusionMatrix(data=fitted.results, reference=test$Survived)
library(ROCR)
# ROC and AUC
p <- predict(model, newdata=subset(test,select=c(2,3,4,5,6,7,8)), type="response")
pr <- prediction(p, test$Survived)
# TPR = sensitivity, FPR=specificity
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
@mohi-uddin

This comment has been minimized.

Copy link

commented Nov 29, 2016

sir,
i got the same accuracy as you did but
when i write code of confusion matrix it shows the following error in r

confusionMatrix(data=fitted.results, reference=test$Survived)

Error in confusionMatrix(data = fitted.results, reference = test$Survived) :
unused arguments (data = fitted.results, reference = test$Survived)

kindly reply

@EarlGlynn

This comment has been minimized.

Copy link

commented Dec 1, 2016

This worked for me:

library(caret)
confusionMatrix(data=fitted.results, reference=test$Survived)

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 51  9
         1  5 24
                                          
               Accuracy : 0.8427          
                 95% CI : (0.7502, 0.9112)
    No Information Rate : 0.6292          
    P-Value [Acc > NIR] : 8.248e-06       
                                          
                  Kappa : 0.6543          
 Mcnemar's Test P-Value : 0.4227          
                                          
            Sensitivity : 0.9107          
            Specificity : 0.7273          
         Pos Pred Value : 0.8500          
         Neg Pred Value : 0.8276          
             Prevalence : 0.6292          
         Detection Rate : 0.5730          
   Detection Prevalence : 0.6742          
      Balanced Accuracy : 0.8190          
                                          
       'Positive' Class : 0               

@harigovind-s-menon

This comment has been minimized.

Copy link

commented Jan 31, 2017

predict(model,newdata=subset(test,select=c(2,3,4,5,6,7,8)),type='response')

Why are you removing the first column?

@kmnhz

This comment has been minimized.

Copy link

commented Feb 15, 2017

Did you check the accuracy level with test set provided by Kaggle?

@Coconuthack

This comment has been minimized.

Copy link

commented Nov 4, 2017

Thanks for the code and tutorial!

@alaahesham

This comment has been minimized.

Copy link

commented Nov 18, 2017

@ harigovind-s-menon
because he did not need the passenger id(the first column) in the analytics so he removed it

@riddhidutta

This comment has been minimized.

Copy link

commented Dec 5, 2017

#Please try the below codes for contrasts:
contrasts(as.factor(data$Sex))
contrasts(as.factor(data$Embarked)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.