Skip to content

Instantly share code, notes, and snippets.

@idrissrasheed
Created August 13, 2017 21:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save idrissrasheed/44544d88c918c785feaaa307cfbc6078 to your computer and use it in GitHub Desktop.
Save idrissrasheed/44544d88c918c785feaaa307cfbc6078 to your computer and use it in GitHub Desktop.
Data Mining HW
---
title: "PSTAT 131 HW1"
author: "Idris Rasheed and Faraz Farooq"
date: "April 12, 2017"
output: pdf_document
---
###1
(a) Looking up customers of a company according to their profitability.
No. This is a simple accounting calculation.
(b) Computing the total sales of a company.
No. We are just calculating a simple value, not extracting any information from total sales.
(c) Predicting the future stock price of a company using historical records.
Yes. A data mining task will involve building a predictive model to forecast the future stock price.
(d) Sorting a student database based on student identification numbers.
No. This is a database query task.
(e) Predicting the outcomes of tossing a (fair) pair of dice.
No. This is a probability problem because the pair of dice are fair. If the dice were not fair, then we could use data mining to understand the probability.
(f) Extracting the frequencies of a sound wave.
No. This a task pertaining to signal processing.
###Importing Algae Files and libraries
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(alr4)
library(outliers)
```
```{r}
algae <- read.table('algaeBloom.txt',header=F,dec='.', col.names=c('season','size','speed','mxPH','mnO2','Cl','NO3','NH4','oPO4', 'PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),
na.strings=c('XXXXXXX'))
attach(algae)
head(algae,3)
```
###2a)
```{r}
library(dplyr)
algae %>%
group_by(season) %>%
summarise(n = n())
```
So the number of observations for autumn, spring, summer, winter are 40,53,45, and 62 respectively.
###2b)
```{r}
colMeans(algae[,5:11], na.rm = TRUE)
var(algae[5], na.rm = TRUE)
var(algae[6], na.rm =TRUE)
var(algae[7], na.rm =TRUE)
var(algae[8], na.rm =TRUE)
var(algae[9], na.rm =TRUE)
var(algae[10], na.rm =TRUE)
var(algae[11], na.rm = TRUE)
```
There are missing values since we had to put na.rm, because the code would not run without them due to NA's. The mean of mnO2 is bigger than NO3 however the variance of NO3 is larger. This is a puzzling connection.For mnO2 AND NO3 the mean for mnO2 is bigger than NO3 9.11 >3.28 but the variance for NO3 is larger 14.261 > 5.71. The mean's and variance might also be affected by the number of missing observations.
###2c
```{r}
#Column Medians
median(algae$mnO2, na.rm = TRUE)
median(algae$Cl, na.rm = TRUE)
median(algae$NO3, na.rm = TRUE)
median(algae$NH4, na.rm = TRUE)
median(algae$oPO4, na.rm = TRUE)
median(algae$PO4, na.rm = TRUE)
median(algae$Chla, na.rm = TRUE)
```
```{r}
#Column Median Absolute Deviation
mad(algae$mnO2, na.rm = TRUE)
mad(algae$Cl, na.rm = TRUE)
mad(algae$NO3, na.rm = TRUE)
mad(algae$NH4, na.rm = TRUE)
mad(algae$oPO4, na.rm = TRUE)
mad(algae$PO4, na.rm = TRUE)
mad(algae$Chla, na.rm = TRUE)
```
The values for the mean and variance seem to be higher than the values of the median and MAD for these chemicals. So based on this the data is very spread out, so it makes it tougher to fit it to a normal distribution.
###3a-b)
```{r}
hist(algae$mxPH, freq = FALSE, main = "Histogram of mxPH", col = "red") # It looks a bit skewed to the right.
lines(density(na.omit(algae$mxPH)), col="green")
lines(rug(na.omit(algae$mxPH)))
```
The distribution is slightly skewed to the right.
###3c)
```{r}
boxplot(algae$a1 ~ algae$size, main = "A conditional BoxPlot of Algae a1")
```
###3d)
```{r}
boxplot(algae$NO3 ~ algae$size, main = "A conditional BoxPlot of Algae NO3")
boxplot(algae$NH4 ~ algae$size, main = "A conditional BoxPlot of Algae NH4")
outlierTest(lm(algae$NO3 ~ algae$size))
outlierTest(lm(algae$NH4 ~algae$size))
```
There are outlier's for both since we see them in the boxplot, they are 1.5 distance away from the quantiles. However,it is tough to tell how many so we did an outlier test, they both represent outlier's. However, there is only one observation similar to both so we would only consider the 153rd observation as an outlier.
###3e
From 2c)
mean of NO3 was 3.28 and NH4 is 501.3 and variance was 14.26 and 3851585
Medians are 2.67 and 103.16 and MAD are 2.17 and 111.675
Outliers are defined as having higher variance from the rest of the data points. The computation of the mean and variance take outliers into account, thus making them sensitive to outliers. Because of this, the mean and variance have weak resistance to outliers and are not sufficient estimators. On the other hand, MAD and median are better estimators because they are less sensitive to outliers, and MAD is a more robust estimator than the sample variance and mean in the presence of outliers. This explains why the means and variances of NO3 and NH4 are so different from their respective medians and MADs. The chemicals' means and variances values are affected by outlier, and their medians and MADs are better estimators for these chemicals.
---
title: "Homework 2 PSTAT 131"
author: "Idris Rasheed and Faraz Farooq"
date: "__Due on April 25th, 2017 at 11:59 pm__"
graphics: yes
geometry: margin=0.75in
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE,
cache=TRUE,
fig.width=5,
fig.height=5,
fig.align='center')
indent1 = ' '
indent2 = paste(rep(indent1, 2), collapse='')
solcode = TRUE
r = function(x, digits=2){ round(x, digits=digits) }
```
###1.
**Image compression with PCA**
Bitmap image can be read in via the following command:
```{r readbitmap,indent=indent1}
#install.packages('bmp')
library(bmp)
library(dplyr)
img = read.bmp('image1.bmp')
img = t(img[ncol(img):1,]) # fix orientation
img = img - mean(img) # subtract overall mean
```
Plot the image in grayscale:
```{r plot-bitmap,indent=indent1}
gs = grey((0:255)/255)
image(img, asp=1, col=gs, yaxs='r', xaxt='n', yaxt='n')
```
###1a)
```{r}
??bmp::xaxt
??bmp::yaxt
??bmp:image
```
###1b)
```{r}
pca_img <- prcomp(img)
str(pca_img)
names(pca_img)
```
###1c)
```{r}
X <- img
PHI <- pca_img$rotation
Z <- pca_img$x
(norm(Z-(X%*%PHI), type = "F"))^2
#This shoudld be zero but something is not working with our code
```
###1d)
```{r}
Q <- pca_img$rotation
(norm((t(Q)%*%Q)-diag(512), type = "F"))^2
```
###1e)
```{r}
new_img <- Z[,10:100]%*%t(PHI[,10:100])
image(new_img, asp=1, col=gs)
```
###1f)
```{r}
percent_var_expl <- pca_img$sdev^2 / sum(pca_img$sdev^2)
screeplot(pca_img, npcs = 10, type = "lines")
plot(percent_var_expl)
cumVar <- cumsum(pca_img$sdev^2 / sum(pca_img$sdev^2))
plot(cumVar)
summary(prcomp( .9*(pca_img$sdev^2 / sum(pca_img$sdev^2))))
```
From the plot's we can tell there are about 18 principal components needed to explain 90% of the variance
###3a)
```{r}
algae <- read.table('algaeBloom.txt',header=F,dec='.',
col.names=c('season','size','speed','mxPH','mnO2','Cl','NO3','NH4','oPO4',
'PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),
na.strings=c('XXXXXXX'))
attach(algae)
sum(is.na(algae))
colSums(is.na(algae))
```
i)There are 33 total observations that contain missing values
ii)The only variables that contain missing values are mxPH (1), mn02(2), Cl(10), NO3(2), NH4(2),oPo4(2), Po4(2), Chla(12)
###3b)
```{r}
algae.del <- algae%>%filter(complete.cases(.))
algae.del
nrow(algae.del)
```
There are 184 observations in algae.del
###3c)
```{r}
algae.med <- algae %>%mutate_each(funs(ifelse(is.na(.), median(., na.rm=TRUE), .)))
nrow(algae.med)
algae.med[c(48,62,199),]
```
There are 200 observations
###3d)
```{r}
cor(algae.med, use= "complete.obs", method = "pearson")
prediction <- predict(lm(PO4~oPO4, data = algae.med))
prediction[28]
```
48.04407 is our value for the 28th observation
###4a)
```{r}
library(tree)
library(ISLR)
set.seed(1)
dim(algae.med)
train = sample(1:nrow(algae.med), 0.8*dim(algae.med)[1])
train
tree.algae.med = tree(lm(a1~ season + size + speed + mxPH +mnO2 + Cl + NO3 + NH4 +oPO4 + PO4 + Chla), data = algae.med, subset = train)
#Training error rate
algae.med.train= algae.med [train,]
a1.test = algae.med.train$a1
a1.pred = predict(tree.algae.med, algae.med.train)
error1 = table(a1.pred, a1.test)
error1
sum(diag(error1))/sum(error1)
1-sum(diag(error1))/sum(error1)
#Test error rate
algae.med.test = algae.med [-train,]
algae.med.test
a12.test = algae.med.test$a1
a12.pred = predict(tree.algae.med, algae.med.test)
error12 = table(a12.pred, a12.test)
error12
sum(diag(error12))/sum(error12)
1-sum(diag(error12))/sum(error12)
```
The training error rate is 90.625% and the test error rate is 82.5%
###4b)
```{r}
set.seed(2)
dim(algae.med)
train = sample(1:nrow(algae.med), 0.8*dim(algae.med)[1])
train
tree.algae.med = tree(lm(a1~ season + size + speed + mxPH +mnO2 + Cl + NO3 + NH4 +oPO4 + PO4 + Chla), data = algae.med, subset = train)
#Training error rate
algae.med.train= algae.med [train,]
a1.test = algae.med.train$a1
a1.pred = predict(tree.algae.med, algae.med.train)
a1.pred
error1 = table(a1.pred, a1.test)
error1
sum(diag(error1))/sum(error1)
1-sum(diag(error1))/sum(error1)
#Test error rate
algae.med.test = algae.med [-train,]
algae.med.test
a12.test = algae.med.test$a1
a12.pred = predict(tree.algae.med, algae.med.test)
a12.pred
error12 = table(a12.pred, a12.test)
error12
sum(diag(error12))/sum(error12)
1-sum(diag(error12))/sum(error12)
```
The training error rate went down to 86.25% and the test error rate went up to 90% as the seed increased our training error rate went down however the test error rate went up. It might be because the increase in the seed causes it to become more random.
###5a)
```{r}
set.seed(1)
x<- cut((1:nrow(algae)), breaks=5, labels= FALSE)
sample(x)
```
###5b
```{r}
do.chunk <- function(chunkid, chunkdef, algae.med){ # function argument
train = (chunkdef != chunkid)
Xtr = algae.med[train,1:11] # get training set
Ytr = algae.med [train,12] # get true response values in trainig set
Xvl = algae.med [!train,1:11] # get validation set
Yvl = algae.med [!train,12] # get true response values in validation set
lm.a1 <- lm(a1~., data = algae.med [train,1:12])
predYtr = predict(lm.a1) # predict training values
predYvl = predict(lm.a1,Xvl) # predict validation values
data.frame(fold = chunkid,
train.error = mean((predYtr - Ytr)^2), # compute and store training error
val.error = mean((predYvl - Yvl)^2)) # compute and store test error
}
```
###5c)
```{r}
```
###6a)
```{r}
# Bootstrap 100 times
runs <- 1:100
# Lists to save the results
for (i in runs){
print(i)
sampleIndex <- sample(1:nrow(algae.med), size = nrow(algae.med), replace = TRUE)
sampleData <- algae.med[sampleIndex, ]
}
mu = NULL #mean values of boostrap
for(draw in 1:100) mu = c(mu,mean(runs[sample(1:100,size=100,replace=T)]))
# Bind it into a nice data frame
algaedata <- cbind(runs, season, size, speed, mxPH, mnO2, Cl, NO3, NH4, oPO4, PO4, Chla, a1, a2, a3, a4, a5, a6, a7)
algdat <- as.data.frame(algaedata)
algaemodel <- lm(runs ~season+ size+ speed+ mxPH+ mnO2+ Cl+ NO3+ NH4+ oPO4+ PO4+ Chla+ a1+ a2+ a3+ a4+ a5+ a6+ a7, data=algdat)
summary(algaemodel)
```
---
title: "Homework 3"
author: "Idris Rasheed and Faraz Farooq"
date: "May 6, 2017"
output:
pdf_document: default
html_document: default
graphics: yes
geometry: margin=0.75in
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE,
cache=TRUE,
fig.width=5,
fig.height=5,
fig.align='center')
indent1 = ' '
indent2 = paste(rep(indent1, 2), collapse='')
```
###1)
\begin{equation}
p(1+e^z)=\frac{(e^z)(1+e^z)}{1+e^z}
\end{equation}
\begin{equation}
p + p(e^z) = e^z
\end{equation}
\begin{equation}
e^z(1-p)= p
\end{equation}
\begin{equation}
e^z= \frac{p}{1-p}
\end{equation}
\begin{equation}
z(p)=\ln\left(\frac{p}{1-p}\right)
\end{equation}
###2)Given Code
```{r pkg, message=FALSE}
library(tree)
library(plyr)
library(dplyr)
library(randomForest)
library(class)
library(ISLR)
library(ggplot2)
library(reshape2)
library(plyr)
```
```{r, warning=FALSE}
spam = read.table("spambase.dat", header=TRUE, sep="")
spam = spam %>%
mutate(y = factor(y, levels=c(0,1), labels=c("good","spam"))) %>% # label as factors
mutate_at(.cols=vars(-y), .funs=scale) # scale others
# spam[,1:57] = scale(spam[,1:57], center = TRUE, scale = TRUE)
# spam$y = factor(spam$y, levels=c(0,1), labels=c("good","spam"))
```
```{r}
erate <- function(predicted.value, true.value){
return(mean(true.value!=predicted.value))
}
```
```{r record}
records = matrix(NA, nrow=3, ncol=2)
colnames(records) <- c("train.error","test.error")
rownames(records) <- c("tree","knn","logistic")
```
```{r, results="hide"}
set.seed(10)
test.indices = sample(1:nrow(spam), 1000)
spam.train=spam[-test.indices,]
spam.test=spam[test.indices,]
```
```{r, folds-definition}
nfold = 10
set.seed(10)
folds = seq.int(nrow(spam.train)) %>% ## sequential obs ids
cut(breaks = nfold, labels=FALSE) %>% ## sequential fold ids
sample ## random fold ids
```
###2 Our Code
```{r}
set.seed(1) #Set Seed equal to 1
nfold =5 #Split into 5
folds = seq.int(nrow(spam.train)) %>% ## sequential obs ids
cut(breaks = nfold, labels=FALSE) %>% ## sequential fold ids
sample
do.chunk <- function(chunkid, folddef, Xdat, Ydat, k){
train = (folddef!=chunkid)
Xtr = Xdat[train,]
Ytr = Ydat[train]
Xvl = Xdat[!train,]
Yvl = Ydat[!train]
predYtr = knn(train = Xtr, test = Xtr, cl = Ytr, k = k)
predYvl = knn(train= Xtr, test = Xvl, cl = Ytr, k = k)
data.frame(train.error = erate(predYtr, Ytr),
val.error = erate(predYvl, Yvl))
}
```
```{r}
errors=NULL
set.seed(1)
for(k in 1:15){
tmp = ldply(1:nfold, do.chunk,
folddef=folds,
Xdat=dplyr::select(spam.train, -y),
Ydat=spam.train$y, k = k) %>% ## compute fold-wise errors
summarise_all(funs(mean)) %>% ## mean training/test errors
mutate(neighbors = k)
errors = rbind(errors, tmp)
}
```
```{r}
set.seed(1)
errors$train.error <- NULL
errors
min(errors$val.error)
#The minimum value is assigned to neighbors = 11 so
```
```{r}
best.kfold = 11
```
**(Training and Test Errors)** Now that the best number of neighbors has been
determined, compute the training error using `spam.train` dataset along with
`best.kfold` using function `erate()`; similarly, test error using
`spam.test`. Fill in the `knn` row in `records`.
###3
```{r}
set.seed(1)
#Test
pred.sTest = knn(train=spam.train[,-58], test=spam.test[,-58], cl=spam.train[,58], k=best.kfold)
erate.test <- erate(pred.sTest, spam.test[, 58])
erate.test
#Train
pred.sTrain = knn(train=spam.train[,-58], test=spam.train[,-58], cl=spam.train[,58], k=best.kfold)
erate.train <- erate(pred.sTrain, spam.train[,58])
erate.train
```
```{r}
records[2,] <- c(erate.train, erate.test)
records
```
###4
```{r Decision Tree}
tree.train <- tree(y~.,spam.train,
control = tree.control(nrow(spam.train),
mincut = 5,
minsize = 10,
mindev = 0.003))
summary(tree.train)
plot(tree.train)
text(tree.train, pretty = 0)
tree.train
title("Tree Train")
```
###5
```{r Tree size}
set.seed(1)
cv_tree <- cv.tree(tree.train,rand = folds, FUN=prune.misclass, K = 10)
best.size.cv= cv_tree$size[which.min(cv_tree$dev)]
plot(cv_tree$size, cv_tree$dev, type = "b", xlab = "Tree Size", ylab = "Deviation", main = "Misclassification vs Tree Size")
best.size.cv
```
```{r pruned}
set.seed(1)
spam.tree <- tree(y~.,spam.train)
tree.pruned <- prune.misclass(spam.tree, best=36)
plot(tree.pruned)
text(tree.pruned, pretty = 0)
tree.pruned
title("Pruned Tree")
```
###6
```{r Misclass computation}
pruned.predict.train <- predict(tree.pruned, spam.train[,-58], type = "class")
tree.TN <- erate(pruned.predict.train, spam.train[,58])
tree.TN
pruned.predict.test <- predict(tree.pruned,spam.test[,-58], type = "class")
tree.TT <- erate(pruned.predict.test , spam.test[,58])
tree.TT
```
```{r}
records[1,]<- c(tree.TT, tree.TN)
records
```
###7
```{r}
set.seed (1)
bag.spam <- randomForest(y~.,data=spam.train, mtry=57, importance =TRUE, ntree=1000)
bag.spam
```
###9
```{r}
do.chunk.glm <- function(chunkid, folddef, Xdat, threshold){
train = (folddef!=chunkid)
Xtr <- Xdat[train,1:57]
Xvl <- Xdat[!train,1:57]
Ytr <- Xdat[train,58]
Yvl <- Xdat[!train,58]
fitXtr <-glm(y ~., data=Xdat[train,], family= binomial)
predYtr <- ifelse(predict(fitXtr, Xtr, type="response")>(threshold),"good","spam")
predYvl <- ifelse(predict(fitXtr, Xvl, type="response")>(threshold),"good","spam")
data.frame(train.error <- erate(predYtr, Ytr),
val.error <- erate(predYvl, Yvl))
}
```
```{r}
# errors = NULL
# set.seed(1)
# sequence = seq(0,1,0.02)
# for(k in 1:length(sequence)){
# tmp =ldply(1:nfold, do.chunk.glm,
# folddef=folds,
# Xdat = spam.train, threshold = sequence[k]) %>%
# summarise_all(funs(mean)) %>%
# mutate(thresholds=k)
# errors = rbind(errors,tmp)
# }
#
# errors
# min(errors$train.error....erate.predYtr..Ytr.)
# min(errors$val.error....erate.predYvl..Yvl.)
# errors
records[3,] <- c(.3929462,3929434)
records
```
The minimum error shows up for the threshold at 1 so that would be the optimal one for us. Also our code is correct and gave us the necessary logistic errors to fill up the records, we had to comment the code out because despite a correct output, a numeric 0,1, error is causing it to not knit properly. Thus we manually saved our results and added those to records.
###10
```{r}
#install.packages("ROCR")
library(ROCR)
pred.pt.prune <-predict(tree.pruned, spam.test[,-58])
predprune <- prediction(pred.pt.prune[,2],spam.test[,58])
perf <- performance(predprune,"tpr","fpr")
plot(perf, main= "Decision Tree")
abline(0,1)
```
```{r}
#Random Forest
randomf.predict <-predict(bag.spam, spam.test[,-58] , type="prob")
randomf.pred <- prediction(randomf.predict[,2],spam.test[,58])
randomf.perf <- performance(randomf.pred,"tpr","fpr")
plot(randomf.perf, main="Random Forest")
abline(0,1)
```
```{r}
library(arm)
mod_spam.train<-spam.train
mod_spam.train$y<-(mod_spam.train$y=="spam")*1
mod_spam.test<-spam.test
mod_spam.test$y<-(mod_spam.test$y=="spam")*1
logistic_reg<-bayesglm(y~. ,family='binomial' ,data=mod_spam.train)
lg.predict <-predict(logistic_reg, mod_spam.test , type="response")
lg.pred <- prediction(lg.predict,spam.test[,58])
lg.perf <- performance(lg.pred,"tpr","fpr")
plot(lg.perf, main= "Logistic Regression")
abline(0,1)
```
```{r}
mod_spam.train<-spam.train
mod_spam.train$y<-(mod_spam.train$y=="spam")*1
mod_spam.test<-spam.test
mod_spam.test$y<-(mod_spam.test$y=="spam")*1
result_1 <- class::knn(train=mod_spam.train[,-58],test=mod_spam.train[-58],cl=mod_spam.train[,58],k=best.kfold,prob=TRUE)
knn.prob<-attr(result_1,"prob")
pred_knn <- prediction(knn.prob, mod_spam.train[,58])
perf_knn <- performance(pred_knn, "tpr", "fpr")
plot(perf_knn, main="ROC curve for knn")
abline(0,1)
```
###11
Based on ROC and the records result, KNN is the best method. We also decided to answer what we had numerically before in records. The best model with the smallest error was with KNN, then decision tree, and then last logistic regression. The ROC Curves mainly look similar for all of them, except KNN has a linear line which is expected to have for a ROC curve. Also it is different compared to the rest so it clearly sticks out as the best classification method.
###14
```{r algae data}
algae <- read.table('algaeBloom.txt',header=F,dec='.',
col.names=c('season','size','speed','mxPH', 'mnO2','Cl','NO3','NH4','oPO4',
'PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),
na.strings=c('XXXXXXX'))
attach(algae)
```
```{r algae Variable Standardization and Discretization}
algae.ds <- algae[, 1:11] %>%
mutate_each(funs(ifelse(is.na(.),.,(.-mean(.,na.rm=T))/sd(.,na.rm=T))),which(sapply(., is.numeric)))
```
```{r setting levels}
algae.ds[,"a1"] <- algae$a1
algae.ds <- algae.ds %>% mutate(a1 =as.factor(ifelse(a1 <= 0.5, "low", "high")))
```
###15
```{r algae Training/Test Sets}
algae.train <- algae.ds
algae.test <- read.table('algaeTest.txt', fill = T, header=F,dec='.', col.names=c('season','size','speed','mxPH', 'mnO2','Cl','NO3','NH4','oPO4', 'PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),na.strings=c('XXXXXXX'))
#algae.train
#algae.test
```
The outputs of the data sets are not shown since it would make this PDF very long and tedious to read through.
###16
```{r}
set.seed(1)
algae.train <- tree(a1~season+size+speed+mxPH+mnO2+Cl+NO3+NH4+oPO4+PO4+Chla,
algae.ds,
control = tree.control(nrow(algae.train),
mincut = 10,
minsize = 22,
mindev = 0.03))
summary(algae.train)
plot(algae.train)
text(algae.train, pretty = 0)
algae.train
title("Algae Train")
```
```{r}
set.seed(1)
cv.algae <- cv.tree(algae.train, FUN=prune.misclass, K = 10)
best.size.algae= cv.algae$size[which.min(cv.algae$dev)]
plot(cv.algae$size, cv.algae$dev, type = "b", xlab = "Algae Tree Size", ylab = "Deviation", main = "Misclassification vs Tree Size")
best.size.algae
```
```{r}
set.seed(1)
a1.tree <- tree(a1~., algae.ds)
algae.tree.pruned <- prune.misclass(a1.tree, best=2)
plot(algae.tree.pruned)
text(algae.tree.pruned, pretty = 0)
algae.tree.pruned
title("Pruned Algae Tree")
```
```{r}
# algae.pruned.predict.train <- predict(algae.tree.pruned, algae.train[,-12], type = "class")
# algae.tree.TN <- erate(algae.pruned.predict.train, algae.train[,12])
# algae.tree.TN
#
# algae.pruned.predict.test <- predict(algae.tree.pruned, algae.test[,-18], type = "class")
# algae.tree.TT <- erate(algae.pruned.predict.test , algae.test[,18])
# algae.tree.TT
```
our erate function is not running and we can't figure out what is wrong so we commented it out.
---
title: "Homework 4"
author: "Idris Rasheed and Faraz Farooq, Spring 2017"
date: "__Due on May 31, 2017 at 11:59 pm__"
graphics: yes
geometry: margin=0.75in
output: pdf_document
---
```{r setup, include = FALSE}
library(plyr)
library(dplyr)
library(ggplot2)
library(ROCR)
library(NbClust)
library(cluster)
knitr::opts_chunk$set(echo=TRUE, cache=TRUE,
fig.width=6, fig.height=5,
fig.align='center')
## indents are for indenting r code as formatted text
## They may need to be adjusted depending on your OS
indent1 = ' '
indent2 = ' '
indent3 = ' '
runsoln = TRUE
library(class)
library(pander)
```
```{r out.width='50%', indent=indent1}
dat = read.table('nonlinear.csv') %>% mutate(Y=as.factor(Y))
ggplot(dat, aes(x=X1, y=X2)) +
geom_point(aes(color=Y), size=3) +
xlim(-5,5) + ylim(-5,5)
```
###1a)
```{r}
nonlinglm.fit <- glm(Y ~ X1 + X2, family= binomial, data= dat)
nonlinglm.fit
nonlin.training <- predict(nonlinglm.fit, type = "response")
round(nonlin.training, digits = 2)
#Fitting to a model to plot ROC Curve
pred.nonlin <- prediction(nonlin.training, dat$Y)
perf.nonlin <- performance(pred.nonlin, measure="tpr", x.measure="fpr")
plot(perf.nonlin, col=2, lwd=3, main="ROC curve")
abline(0,1)
# FPR
fpr = performance(pred.nonlin, "fpr")@y.values[[1]]
cutoff = performance(pred.nonlin, "fpr")@x.values[[1]]
# FNR
fnr = performance(pred.nonlin,"fnr")@y.values[[1]]
rate = as.data.frame(cbind(Cutoff=cutoff, FPR=fpr, FNR=fnr))
rate$distance = sqrt((rate[,2])^2+(rate[,3])^2)
index = which.min(rate$distance)
best = rate$Cutoff[index]
best
```
Our pthresh is .6823254
###1b)
```{r, out.width='50%', indent=indent2}
# grid of points over sample space
gr = expand.grid(X1=seq(-5, 5, by=0.1), # sample points in X1
X2=seq(-5, 5, by=0.1)) # sample points in X2
```
```{r}
predclass1 <- predict(nonlinglm.fit, gr)
class1 <- predclass1[which(predclass1 < best)]
class2 <- predclass1[which(predclass1 >= best)]
grr <- gr
grr$Y <- predclass1
ggplot(grr, aes(x=X1, y=X2)) +
geom_point(aes(color=Y), size=3) +
xlim(-5,5) + ylim(-5,5)
```
###1c)
```{r}
polylinglm.fit <- glm(Y ~ poly(X1,2) + poly(X2,2) + X1:X2, family= binomial,data = dat)
summary(polylinglm.fit)
polylin.training <- predict(polylinglm.fit, type = "response")
round(polylin.training, digits = 2)
#Fitting to a model to plot ROC Curve
pred.polylin <- prediction(polylin.training, dat$Y)
perf.polylin <- performance(pred.polylin, measure="tpr", x.measure="fpr")
plot(perf.polylin, col=2, lwd=3, main="ROC curve")
abline(0,1)
# FPR
fpr = performance(pred.polylin, "fpr")@y.values[[1]]
cutoff = performance(pred.polylin, "fpr")@x.values[[1]]
# FNR
fnr = performance(pred.polylin,"fnr")@y.values[[1]]
rate = as.data.frame(cbind(Cutoff=cutoff, FPR=fpr, FNR=fnr))
rate$distance = sqrt((rate[,2])^2+(rate[,3])^2)
index = which.min(rate$distance)
best1 = rate$Cutoff[index]
best1 #.728744
predclass2 <- predict(polylinglm.fit, gr)
class11 <- predclass2[which(predclass2 < best1)]
class22 <- predclass2[which(predclass2 >= best1)]
grr2 <- gr
grr2$Y <- predclass2
ggplot(grr2, aes(x=X1, y=X2)) +
geom_point(aes(color=Y), size=3) +
xlim(-5,5) + ylim(-5,5)
```
We get the error for the warning because we are overfitting, our threshold went up as well. by that we are modeling a regression with more parameters than observations. We have too many interactions and they are not limited. This is a better model because through decision boundary rules we can fit complex decision boundaries or a simple hypothesis by increasing the order.
###1d)
```{r}
polyglm5fit<- glm( Y~ X1 +X2 +X1^2 + X2^2 + X1^3 +X2^3 +X1^4 + X2^4 + X1^5 +X2^5 , family = binomial, data = dat)
polylin5.training <- predict(polyglm5fit, type = "response")
round(polylin5.training, digits = 2)
#Fitting to a model to plot ROC Curve
pred.polylin5 <- prediction(polylin5.training, dat$Y)
perf.polylin5 <- performance(pred.polylin5, measure="tpr", x.measure="fpr")
plot(perf.polylin5, col=2, lwd=3, main="ROC curve")
abline(0,1)
# FPR
fpr = performance(pred.polylin5, "fpr")@y.values[[1]]
cutoff = performance(pred.polylin5, "fpr")@x.values[[1]]
# FNR
fnr = performance(pred.polylin5,"fnr")@y.values[[1]]
rate = as.data.frame(cbind(Cutoff=cutoff, FPR=fpr, FNR=fnr))
rate$distance = sqrt((rate[,2])^2+(rate[,3])^2)
index = which.min(rate$distance)
best2 = rate$Cutoff[index]
best2 #.728744
predclass3 <- predict(polyglm5fit, gr)
class111 <- predclass3[which(predclass3 < best2)]
class222 <- predclass3[which(predclass3 >= best2)]
predclass3 <- predict(polyglm5fit, gr)
grr3 <- gr
grr3$Y <- predclass3
ggplot(grr3, aes(x=X1, y=X2)) +
geom_point(aes(color=Y), size=3) +
xlim(-5,5) + ylim(-5,5)
```
This decision boundary seems to be underfitting, however the dark points are not centered in the middle as opposed to the 2nd order one. The line is clearer so I feel this could not fit a complex model as well as the second order or first case. The best approach would be the 2nd order as it could deal with complex models despite overfitting. Also the third one gave a similar pthresh at .683254.
```{r, indent=indent1}
library(ROCR)
data(Default, package="ISLR")
str(Default)
```
###2a)
```{r}
def.glm <- glm(default ~ balance, family = binomial, data =Default)
pred1 <- predict(def.glm, type = "response")
#round(pred1, digits = 2)
pred2 <- prediction(pred1, Default$default)
TP <- pred2@tp[[1]]
FP <- pred2@fp[[1]]
FN <- pred2@fn[[1]]
f1 <- (2*TP)/((2*TP)+FP +FN)
TP[c(1,2,3,4,5,6,7,8,9,10)]
FP[c(1,2,3,4,5,6,7,8,9,10)]
FN[c(1,2,3,4,5,6,7,8,9,10)]
f1[c(1,2,3,4,5,6,7,8,9,10)]
```
```{r}
cutoff <- pred2@cutoffs[[1]]
plot(cutoff, f1, xlab= "Pthreshold", ylab= "f1", main= "F1 as a function of cutoffs")
```
```{r}
index1 = which.max(f1)
pthresh <- cutoff[index1]
pthresh
```
Our pthreshold is .3208755
###2c)
```{r}
pos <- filter(Default, default == "Yes")
neg <- filter(Default, default == "No")
head(pos, n=3)
head(neg, n=3)
```
```{r}
set.seed(1)
neg1 <- neg[sample(nrow(neg), 333), ]
neg.fit <- glm(default ~ balance, family = binomial, data= neg1)
summary(neg.fit)
```
###2d)
Using data Default, create a scatter plot showing default as function of balance. Recode variable to replicate right of Figure 4.2: i.e., use 1 for Yes and 0 for No. Also, plot the estimated probabilities of default from using def.glm model, coloring each point by the class label determined by using threshold determined with F1 score. Create an analogous plot using undersampled data and resulting model. Note that threshold here would be 0.5 since we have not determined using F1 score
```{r}
#Scatter Plot
plot(Default$balance, Default$default, main = "default vs balance", xlab = "balance", ylab="default")
```
```{r}
levels(Default$default)
levels(Default$default)[1]<-"0"
levels(Default$default)[2]<-"1"
Default$default <- factor(Default$default)
levels(Default$default)
```
###2e)
Depending on the type of problem. In the case of underlying processes, if we have a manufacturer and he would run into problems because the underlying process makes the relationship between all the x's that contribute to a significant improvement in y a lot more complex so the imbalance of the different classes would. He would be better off with trying to find the main few x's that are contributing to y. This method would still be better than him trying to sample. In the case of sampling, we can run into oversampling or undersampling. If he undersamples, he could remove the majority classes which could be important so he could lose out on that information. In this case, he would mess up the decision boundary and cause false positive representation of the class(i.e, a negative classified as a positive). If he oversamples, he would just duplicate too many minority classes, this would cause duplicates and in the decision boundary, since the algorithm would see one instance multiple times. It would end up overfitting. So he is better off with underlying processes as he can focus in on maybe just the few x's that contribute to y without getting rid of or dealing with false positives.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment