idrissrasheed/Data_Mining_HW1.Rmd

## Data_Mining_HW1.Rmd
---
title: "PSTAT 131 HW1"
author: "Idris Rasheed and Faraz Farooq"
date: "April 12, 2017"
output: pdf_document
---
###1
(a) Looking up customers of a company according to their profitability.

No. This is a simple accounting calculation.

(b) Computing the total sales of a company.

No. We are just calculating a simple value, not extracting any information from total sales.

(c) Predicting the future stock price of a company using historical records.

Yes. A data mining task will involve building a predictive model to forecast the future stock price.

(d) Sorting a student database based on student identification numbers.

No. This is a database query task.

(e) Predicting the outcomes of tossing a (fair) pair of dice.

No. This is a probability problem because the pair of dice are fair. If the dice were not fair, then we could use data mining to understand the probability.

(f) Extracting the frequencies of a sound wave.
No. This a task pertaining to signal processing.

###Importing Algae Files and libraries
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(alr4)
library(outliers)
```

```{r}
algae <- read.table('algaeBloom.txt',header=F,dec='.', col.names=c('season','size','speed','mxPH','mnO2','Cl','NO3','NH4','oPO4', 'PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),
                    na.strings=c('XXXXXXX'))
attach(algae)
head(algae,3)
```

###2a)
```{r}
library(dplyr)
algae %>%
  group_by(season) %>%
summarise(n = n())
```
So the number of observations for autumn, spring, summer, winter are 40,53,45, and 62 respectively.

###2b)
```{r}
colMeans(algae[,5:11], na.rm = TRUE)
var(algae[5], na.rm = TRUE)
var(algae[6], na.rm =TRUE)
var(algae[7], na.rm =TRUE)
var(algae[8], na.rm =TRUE)
var(algae[9], na.rm =TRUE)
var(algae[10], na.rm =TRUE)
var(algae[11], na.rm = TRUE)
```

There are missing values since we had to put na.rm, because the code would not run without them due to NA's. The mean of mnO2 is bigger than NO3 however the variance of NO3 is larger. This is a puzzling connection.For mnO2 AND NO3 the mean for mnO2 is bigger than NO3 9.11 >3.28 but the variance for NO3 is larger 14.261 > 5.71. The mean's and variance might also be affected by the number of missing observations.

###2c
```{r}
#Column Medians
median(algae$mnO2, na.rm = TRUE)
median(algae$Cl, na.rm = TRUE)
median(algae$NO3, na.rm = TRUE)
median(algae$NH4, na.rm = TRUE)
median(algae$oPO4, na.rm = TRUE)
median(algae$PO4, na.rm = TRUE)
median(algae$Chla, na.rm = TRUE)
```

```{r}
#Column Median Absolute Deviation
mad(algae$mnO2, na.rm = TRUE)
mad(algae$Cl, na.rm = TRUE)
mad(algae$NO3, na.rm = TRUE)
mad(algae$NH4, na.rm = TRUE)
mad(algae$oPO4, na.rm = TRUE)
mad(algae$PO4, na.rm = TRUE)
mad(algae$Chla, na.rm = TRUE)
```
The values for the mean and variance seem to be higher than the values of the median and MAD for these chemicals. So based on this the data is very spread out, so it makes it tougher to fit it to a normal distribution.

###3a-b)
```{r}
hist(algae$mxPH, freq = FALSE, main = "Histogram of mxPH", col = "red") # It looks a bit skewed to the right.
lines(density(na.omit(algae$mxPH)), col="green")
lines(rug(na.omit(algae$mxPH)))

```
The distribution is slightly skewed to the right.

###3c)
```{r}
boxplot(algae$a1 ~ algae$size, main = "A conditional BoxPlot of Algae a1")
```

###3d)
```{r}
boxplot(algae$NO3 ~ algae$size, main = "A conditional BoxPlot of Algae NO3")
boxplot(algae$NH4 ~ algae$size, main = "A conditional BoxPlot of Algae NH4")
outlierTest(lm(algae$NO3 ~ algae$size))
outlierTest(lm(algae$NH4 ~algae$size))

```
There are outlier's for both since we see them in the boxplot, they are 1.5 distance away from the quantiles. However,it is tough to tell how many so we did an outlier test, they both represent outlier's. However, there is only one observation similar to both so we would only consider the 153rd observation as an outlier.

###3e
From 2c)
mean of NO3 was 3.28 and NH4 is 501.3 and variance was 14.26 and 3851585
Medians are 2.67 and 103.16 and MAD are 2.17 and 111.675

Outliers are defined as having higher variance from the rest of the data points. The computation of the mean and variance take outliers into account, thus making them sensitive to outliers.  Because of this, the mean and variance have weak resistance to outliers and are not sufficient estimators. On the other hand, MAD and median are better estimators because they are less sensitive to outliers, and MAD is a more robust estimator than the sample variance and mean in the presence of outliers.  This explains why the means and variances of NO3 and NH4 are so different from their respective medians and MADs. The chemicals' means and variances values are affected by outlier, and their medians and MADs are better estimators for these chemicals.


## Data_Mining_HW2.Rmd
---
title: "Homework 2 PSTAT 131"
author: "Idris Rasheed and Faraz Farooq"
date: "__Due on April 25th, 2017 at 11:59 pm__"
graphics: yes
geometry: margin=0.75in
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE,
                      cache=TRUE,
                      fig.width=5,
                      fig.height=5,
                      fig.align='center')
indent1 = '    '
indent2 = paste(rep(indent1, 2), collapse='')
solcode = TRUE
r = function(x, digits=2){ round(x, digits=digits) }
```

###1.
     **Image compression with PCA**

    Bitmap image can be read in via the following command:

```{r readbitmap,indent=indent1}
#install.packages('bmp')
library(bmp)
library(dplyr)
img = read.bmp('image1.bmp')
img = t(img[ncol(img):1,])      # fix orientation
img = img - mean(img)           # subtract overall mean
```

    Plot the image in grayscale:

```{r plot-bitmap,indent=indent1}
gs = grey((0:255)/255)
image(img, asp=1, col=gs, yaxs='r', xaxt='n', yaxt='n')
```

###1a)
```{r}
??bmp::xaxt
??bmp::yaxt
??bmp:image
```

###1b)
```{r}
pca_img <- prcomp(img)
str(pca_img)
names(pca_img)
```

###1c)
```{r}
X <- img
PHI <- pca_img$rotation
Z <- pca_img$x
(norm(Z-(X%*%PHI), type = "F"))^2
#This shoudld be zero but something is not working with our code
```

###1d)
```{r}
Q <- pca_img$rotation
(norm((t(Q)%*%Q)-diag(512), type = "F"))^2
```

###1e)
```{r}
new_img <- Z[,10:100]%*%t(PHI[,10:100])
image(new_img, asp=1, col=gs)
```

###1f)
```{r}
percent_var_expl <- pca_img$sdev^2 / sum(pca_img$sdev^2)
screeplot(pca_img, npcs = 10, type = "lines")
plot(percent_var_expl)
cumVar <- cumsum(pca_img$sdev^2 / sum(pca_img$sdev^2))
plot(cumVar)
summary(prcomp( .9*(pca_img$sdev^2 / sum(pca_img$sdev^2))))

```
From the plot's we can tell there are about 18 principal components needed to explain 90% of the variance

###3a)
```{r}
algae <- read.table('algaeBloom.txt',header=F,dec='.',
col.names=c('season','size','speed','mxPH','mnO2','Cl','NO3','NH4','oPO4',
'PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),
na.strings=c('XXXXXXX'))
attach(algae)

sum(is.na(algae))

colSums(is.na(algae))

```
i)There are 33 total observations that contain missing values
ii)The only variables that contain missing values are mxPH (1), mn02(2), Cl(10), NO3(2), NH4(2),oPo4(2), Po4(2), Chla(12)

###3b)
```{r}
algae.del <- algae%>%filter(complete.cases(.))
algae.del
nrow(algae.del)
```
There are 184 observations in algae.del

###3c)
```{r}
algae.med <- algae %>%mutate_each(funs(ifelse(is.na(.), median(., na.rm=TRUE), .)))
nrow(algae.med)
algae.med[c(48,62,199),]
```
There are 200 observations

###3d)
```{r}
cor(algae.med, use= "complete.obs", method = "pearson")

prediction <- predict(lm(PO4~oPO4, data = algae.med))
prediction[28]

```
48.04407 is our value for the 28th observation


###4a)
```{r}
library(tree)
library(ISLR)
set.seed(1)
dim(algae.med)
train = sample(1:nrow(algae.med), 0.8*dim(algae.med)[1])
train
tree.algae.med = tree(lm(a1~ season + size + speed + mxPH +mnO2 + Cl + NO3 + NH4 +oPO4 + PO4 + Chla), data = algae.med, subset = train)
#Training error rate
algae.med.train= algae.med [train,]
a1.test = algae.med.train$a1
a1.pred = predict(tree.algae.med, algae.med.train)
error1 = table(a1.pred, a1.test)
error1
sum(diag(error1))/sum(error1)
1-sum(diag(error1))/sum(error1)

#Test error rate
algae.med.test = algae.med [-train,]
algae.med.test
a12.test = algae.med.test$a1
a12.pred = predict(tree.algae.med, algae.med.test)
error12 = table(a12.pred, a12.test)
error12
sum(diag(error12))/sum(error12)
1-sum(diag(error12))/sum(error12)


```

The training error rate is 90.625% and the test error rate is 82.5%

###4b)
```{r}
set.seed(2)
dim(algae.med)
train = sample(1:nrow(algae.med), 0.8*dim(algae.med)[1])
train
tree.algae.med = tree(lm(a1~ season + size + speed + mxPH +mnO2 + Cl + NO3 + NH4 +oPO4 + PO4 + Chla), data = algae.med, subset = train)
#Training error rate
algae.med.train= algae.med [train,]
a1.test = algae.med.train$a1
a1.pred = predict(tree.algae.med, algae.med.train)
a1.pred
error1 = table(a1.pred, a1.test)
error1
sum(diag(error1))/sum(error1)
1-sum(diag(error1))/sum(error1)

#Test error rate
algae.med.test = algae.med [-train,]
algae.med.test
a12.test = algae.med.test$a1
a12.pred = predict(tree.algae.med, algae.med.test)
a12.pred
error12 = table(a12.pred, a12.test)
error12

sum(diag(error12))/sum(error12)
1-sum(diag(error12))/sum(error12)

```
The training error rate went down to 86.25% and the test error rate went up to 90% as the seed increased our training error rate went down however the test error rate went up. It might be because the increase in the seed causes it to become more random.

###5a)
```{r}
set.seed(1)
x<- cut((1:nrow(algae)), breaks=5, labels= FALSE)
sample(x)
```

###5b
```{r}
do.chunk <- function(chunkid, chunkdef, algae.med){ # function argument
train = (chunkdef != chunkid)
Xtr = algae.med[train,1:11] # get training set
Ytr = algae.med [train,12] # get true response values in trainig set
Xvl = algae.med [!train,1:11] # get validation set
Yvl = algae.med [!train,12] # get true response values in validation set
lm.a1 <- lm(a1~., data = algae.med [train,1:12])
predYtr = predict(lm.a1) # predict training values
predYvl = predict(lm.a1,Xvl) # predict validation values
data.frame(fold = chunkid,
train.error = mean((predYtr - Ytr)^2), # compute and store training error
val.error = mean((predYvl - Yvl)^2)) # compute and store test error
}
```

###5c)

```{r}

```


###6a)
```{r}

# Bootstrap 100 times
runs <- 1:100

# Lists to save the results

for (i in runs){
  print(i)
  sampleIndex <- sample(1:nrow(algae.med), size = nrow(algae.med), replace = TRUE)
  sampleData <- algae.med[sampleIndex, ]

}

mu = NULL #mean values of boostrap
for(draw in 1:100) mu = c(mu,mean(runs[sample(1:100,size=100,replace=T)]))


# Bind it into a nice data frame
algaedata <- cbind(runs, season, size, speed, mxPH, mnO2, Cl, NO3, NH4, oPO4, PO4, Chla, a1, a2, a3, a4, a5, a6, a7)
algdat <- as.data.frame(algaedata)

algaemodel <- lm(runs ~season+ size+ speed+ mxPH+ mnO2+ Cl+ NO3+ NH4+ oPO4+ PO4+ Chla+ a1+ a2+ a3+ a4+ a5+ a6+ a7, data=algdat)
summary(algaemodel)
```

## Data_Mining_HW3.Rmd
---
title: "Homework 3"
author: "Idris Rasheed and Faraz Farooq"
date: "May 6, 2017"
output:
  pdf_document: default
  html_document: default
graphics: yes
geometry: margin=0.75in
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE,
                      cache=TRUE,
                      fig.width=5,
                      fig.height=5,
                      fig.align='center')
indent1 = '    '
indent2 = paste(rep(indent1, 2), collapse='')
```


###1)
  \begin{equation}
  p(1+e^z)=\frac{(e^z)(1+e^z)}{1+e^z}
  \end{equation}

  \begin{equation}
  p + p(e^z) = e^z
  \end{equation}

  \begin{equation}
  e^z(1-p)= p
  \end{equation}

  \begin{equation}
  e^z= \frac{p}{1-p}
  \end{equation}

  \begin{equation}
  z(p)=\ln\left(\frac{p}{1-p}\right)
  \end{equation}

###2)Given Code
```{r pkg, message=FALSE}
library(tree)
library(plyr)
library(dplyr)
library(randomForest)
library(class)

library(ISLR)
library(ggplot2)
library(reshape2)
library(plyr)

```


```{r, warning=FALSE}
spam = read.table("spambase.dat", header=TRUE, sep="")
spam = spam %>%
    mutate(y = factor(y, levels=c(0,1), labels=c("good","spam"))) %>%   # label as factors
    mutate_at(.cols=vars(-y), .funs=scale)                              # scale others
# spam[,1:57] = scale(spam[,1:57], center = TRUE, scale = TRUE)
# spam$y = factor(spam$y, levels=c(0,1), labels=c("good","spam"))
```

```{r}
erate <- function(predicted.value, true.value){
  return(mean(true.value!=predicted.value))
}
```

```{r record}
records = matrix(NA, nrow=3, ncol=2)
colnames(records) <- c("train.error","test.error")
rownames(records) <- c("tree","knn","logistic")
```

```{r, results="hide"}
set.seed(10)
test.indices = sample(1:nrow(spam), 1000)
spam.train=spam[-test.indices,]
spam.test=spam[test.indices,]
```

```{r, folds-definition}
nfold = 10
set.seed(10)
folds = seq.int(nrow(spam.train)) %>%       ## sequential obs ids
    cut(breaks = nfold, labels=FALSE) %>%   ## sequential fold ids
    sample                                  ## random fold ids
```

###2 Our Code
```{r}
set.seed(1) #Set Seed equal to 1
nfold =5 #Split into 5
folds = seq.int(nrow(spam.train)) %>%       ## sequential obs ids
    cut(breaks = nfold, labels=FALSE) %>%   ## sequential fold ids
    sample

do.chunk <- function(chunkid, folddef, Xdat, Ydat, k){

  train = (folddef!=chunkid)

  Xtr = Xdat[train,]
  Ytr = Ydat[train]

  Xvl = Xdat[!train,]
  Yvl = Ydat[!train]

  predYtr = knn(train = Xtr, test = Xtr, cl = Ytr, k = k)
  predYvl = knn(train= Xtr, test = Xvl, cl = Ytr, k = k)

  data.frame(train.error = erate(predYtr, Ytr),
             val.error = erate(predYvl, Yvl))
}
```

```{r}
errors=NULL
set.seed(1)
for(k in 1:15){
  tmp = ldply(1:nfold, do.chunk,
              folddef=folds,
              Xdat=dplyr::select(spam.train, -y),
              Ydat=spam.train$y, k = k)    %>%  ## compute fold-wise errors
        summarise_all(funs(mean)) %>%           ## mean training/test errors
        mutate(neighbors = k)
  errors = rbind(errors, tmp)
}
```

```{r}
set.seed(1)
errors$train.error <- NULL
errors
min(errors$val.error)
#The minimum value is assigned to neighbors = 11 so
```

```{r}
best.kfold = 11
```

**(Training and Test Errors)** Now that the best number of neighbors has been
   determined, compute the training error using `spam.train` dataset along with
   `best.kfold` using function `erate()`; similarly, test error using
   `spam.test`.  Fill in the `knn` row in `records`.

###3
```{r}
set.seed(1)
#Test
pred.sTest = knn(train=spam.train[,-58], test=spam.test[,-58], cl=spam.train[,58], k=best.kfold)
erate.test <- erate(pred.sTest, spam.test[, 58])
erate.test

#Train
pred.sTrain = knn(train=spam.train[,-58], test=spam.train[,-58], cl=spam.train[,58], k=best.kfold)
erate.train <- erate(pred.sTrain, spam.train[,58])
erate.train
```
```{r}
records[2,] <- c(erate.train, erate.test)
records
```

###4
```{r Decision Tree}
tree.train <- tree(y~.,spam.train,
                  control = tree.control(nrow(spam.train),
                  mincut = 5,
                  minsize = 10,
                  mindev = 0.003))
summary(tree.train)
plot(tree.train)
text(tree.train, pretty = 0)
tree.train
title("Tree Train")
```

###5

```{r Tree size}
set.seed(1)
cv_tree <- cv.tree(tree.train,rand = folds, FUN=prune.misclass, K = 10)
best.size.cv= cv_tree$size[which.min(cv_tree$dev)]
plot(cv_tree$size, cv_tree$dev, type = "b", xlab = "Tree Size", ylab = "Deviation", main = "Misclassification vs Tree Size")
best.size.cv
```
```{r pruned}
set.seed(1)
spam.tree <- tree(y~.,spam.train)
tree.pruned <- prune.misclass(spam.tree, best=36)
plot(tree.pruned)
text(tree.pruned, pretty = 0)
tree.pruned
title("Pruned Tree")
```


###6
```{r Misclass computation}
pruned.predict.train <- predict(tree.pruned, spam.train[,-58], type = "class")
tree.TN <- erate(pruned.predict.train, spam.train[,58])
tree.TN

pruned.predict.test <- predict(tree.pruned,spam.test[,-58], type = "class")
tree.TT <- erate(pruned.predict.test , spam.test[,58])
tree.TT
```
```{r}
records[1,]<- c(tree.TT, tree.TN)
records
```

###7
```{r}
set.seed (1)
bag.spam <- randomForest(y~.,data=spam.train, mtry=57, importance =TRUE, ntree=1000)
bag.spam
```

###9
```{r}
do.chunk.glm <- function(chunkid, folddef, Xdat, threshold){

  train = (folddef!=chunkid)

  Xtr <- Xdat[train,1:57]
  Xvl <- Xdat[!train,1:57]

  Ytr <- Xdat[train,58]
  Yvl <- Xdat[!train,58]

  fitXtr <-glm(y ~., data=Xdat[train,], family= binomial)
  predYtr <- ifelse(predict(fitXtr, Xtr, type="response")>(threshold),"good","spam")
  predYvl <- ifelse(predict(fitXtr, Xvl, type="response")>(threshold),"good","spam")

  data.frame(train.error <- erate(predYtr, Ytr),
            val.error <- erate(predYvl, Yvl))
}
```

```{r}
# errors = NULL
# set.seed(1)
# sequence = seq(0,1,0.02)
# for(k in 1:length(sequence)){
#   tmp =ldply(1:nfold, do.chunk.glm,
#               folddef=folds,
#               Xdat = spam.train, threshold = sequence[k]) %>%
#         summarise_all(funs(mean)) %>%
#         mutate(thresholds=k)
#   errors = rbind(errors,tmp)
# }
#
# errors
# min(errors$train.error....erate.predYtr..Ytr.)
# min(errors$val.error....erate.predYvl..Yvl.)
# errors
records[3,] <- c(.3929462,3929434)
records
```

The minimum error shows up for the threshold at 1 so that would be the optimal one for us. Also our code is correct and gave us the necessary logistic errors to fill up the records, we had to comment the code out because despite a correct output, a numeric 0,1, error is causing it to not knit properly. Thus we manually saved our results and added those to records.

###10

```{r}
#install.packages("ROCR")
library(ROCR)
pred.pt.prune <-predict(tree.pruned, spam.test[,-58])
predprune <- prediction(pred.pt.prune[,2],spam.test[,58])
perf <- performance(predprune,"tpr","fpr")
plot(perf, main= "Decision Tree")
abline(0,1)

```


```{r}
#Random Forest
randomf.predict <-predict(bag.spam, spam.test[,-58] , type="prob")
randomf.pred <- prediction(randomf.predict[,2],spam.test[,58])
randomf.perf <- performance(randomf.pred,"tpr","fpr")
plot(randomf.perf, main="Random Forest")
abline(0,1)

```
```{r}
library(arm)

mod_spam.train<-spam.train
mod_spam.train$y<-(mod_spam.train$y=="spam")*1

mod_spam.test<-spam.test
mod_spam.test$y<-(mod_spam.test$y=="spam")*1
logistic_reg<-bayesglm(y~. ,family='binomial' ,data=mod_spam.train)

lg.predict <-predict(logistic_reg, mod_spam.test , type="response")


lg.pred <- prediction(lg.predict,spam.test[,58])
lg.perf <- performance(lg.pred,"tpr","fpr")
plot(lg.perf, main= "Logistic Regression")
abline(0,1)
```

```{r}
mod_spam.train<-spam.train
mod_spam.train$y<-(mod_spam.train$y=="spam")*1

mod_spam.test<-spam.test
mod_spam.test$y<-(mod_spam.test$y=="spam")*1


result_1 <- class::knn(train=mod_spam.train[,-58],test=mod_spam.train[-58],cl=mod_spam.train[,58],k=best.kfold,prob=TRUE)

knn.prob<-attr(result_1,"prob")


pred_knn <- prediction(knn.prob, mod_spam.train[,58])
perf_knn <- performance(pred_knn, "tpr", "fpr")

plot(perf_knn,  main="ROC curve for knn")
abline(0,1)

```


###11

Based on ROC and the records result, KNN is the best method. We also decided to answer what we had numerically before in records. The best model with the smallest error was with KNN, then decision tree, and then last logistic regression. The ROC Curves mainly look similar for all of them, except KNN has a linear line which is expected to have for a ROC curve. Also it is different compared to the rest so it clearly sticks out as the best classification method.

###14
```{r algae data}
algae <- read.table('algaeBloom.txt',header=F,dec='.',
col.names=c('season','size','speed','mxPH', 'mnO2','Cl','NO3','NH4','oPO4',
'PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),
na.strings=c('XXXXXXX'))
attach(algae)
```

```{r algae Variable Standardization and Discretization}
algae.ds <- algae[, 1:11] %>%
mutate_each(funs(ifelse(is.na(.),.,(.-mean(.,na.rm=T))/sd(.,na.rm=T))),which(sapply(., is.numeric)))
```

```{r setting levels}
algae.ds[,"a1"] <- algae$a1
algae.ds <- algae.ds %>% mutate(a1 =as.factor(ifelse(a1 <= 0.5, "low", "high")))
```


###15
```{r algae Training/Test Sets}
algae.train <- algae.ds
algae.test <- read.table('algaeTest.txt', fill = T, header=F,dec='.', col.names=c('season','size','speed','mxPH', 'mnO2','Cl','NO3','NH4','oPO4', 'PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),na.strings=c('XXXXXXX'))
#algae.train
#algae.test
```
The outputs of the data sets are not shown since it would make this PDF very long and tedious to read through.

###16
```{r}
set.seed(1)
algae.train <- tree(a1~season+size+speed+mxPH+mnO2+Cl+NO3+NH4+oPO4+PO4+Chla,
                  algae.ds,
                  control = tree.control(nrow(algae.train),
                  mincut = 10,
                  minsize = 22,
                  mindev = 0.03))
summary(algae.train)
plot(algae.train)
text(algae.train, pretty = 0)
algae.train
title("Algae Train")
```

```{r}
set.seed(1)
cv.algae <- cv.tree(algae.train, FUN=prune.misclass, K = 10)
best.size.algae= cv.algae$size[which.min(cv.algae$dev)]
plot(cv.algae$size, cv.algae$dev, type = "b", xlab = "Algae Tree Size", ylab = "Deviation", main = "Misclassification vs Tree Size")
best.size.algae
```

```{r}
set.seed(1)
a1.tree <- tree(a1~., algae.ds)
algae.tree.pruned <- prune.misclass(a1.tree, best=2)
plot(algae.tree.pruned)
text(algae.tree.pruned, pretty = 0)
algae.tree.pruned
title("Pruned Algae Tree")

```

```{r}
# algae.pruned.predict.train <- predict(algae.tree.pruned, algae.train[,-12], type = "class")
# algae.tree.TN <- erate(algae.pruned.predict.train, algae.train[,12])
# algae.tree.TN
#
# algae.pruned.predict.test <- predict(algae.tree.pruned, algae.test[,-18], type = "class")
# algae.tree.TT <- erate(algae.pruned.predict.test , algae.test[,18])
# algae.tree.TT

```
our erate function is not running and we can't figure out what is wrong so we commented it out.

## Data_Mining_HW4.Rmd
---
title: "Homework 4"
author: "Idris Rasheed and Faraz Farooq, Spring 2017"
date: "__Due on May 31, 2017 at 11:59 pm__"
graphics: yes
geometry: margin=0.75in
output: pdf_document
---

```{r setup, include = FALSE}
library(plyr)
library(dplyr)
library(ggplot2)
library(ROCR)
library(NbClust)
library(cluster)

knitr::opts_chunk$set(echo=TRUE, cache=TRUE,
                      fig.width=6, fig.height=5,
                      fig.align='center')

## indents are for indenting r code as formatted text
## They may need to be adjusted depending on your OS
indent1 = '    '
indent2 = '        '
indent3 = '            '
runsoln = TRUE
library(class)
library(pander)
```

```{r out.width='50%', indent=indent1}
dat = read.table('nonlinear.csv') %>% mutate(Y=as.factor(Y))
ggplot(dat, aes(x=X1, y=X2)) +
    geom_point(aes(color=Y), size=3) +
    xlim(-5,5) + ylim(-5,5)
```


###1a)

```{r}
nonlinglm.fit <- glm(Y ~ X1 + X2, family= binomial, data= dat)
nonlinglm.fit

nonlin.training <- predict(nonlinglm.fit, type = "response")
round(nonlin.training, digits = 2)
#Fitting to a model to plot ROC Curve
pred.nonlin <- prediction(nonlin.training, dat$Y)
perf.nonlin <- performance(pred.nonlin, measure="tpr", x.measure="fpr")
plot(perf.nonlin, col=2, lwd=3, main="ROC curve")
abline(0,1)
# FPR
fpr = performance(pred.nonlin, "fpr")@y.values[[1]]
cutoff = performance(pred.nonlin, "fpr")@x.values[[1]]
# FNR
fnr = performance(pred.nonlin,"fnr")@y.values[[1]]
rate = as.data.frame(cbind(Cutoff=cutoff, FPR=fpr, FNR=fnr))
rate$distance = sqrt((rate[,2])^2+(rate[,3])^2)
index = which.min(rate$distance)
best = rate$Cutoff[index]
best

```

Our pthresh is .6823254

###1b)

```{r, out.width='50%', indent=indent2}
# grid of points over sample space
gr = expand.grid(X1=seq(-5, 5, by=0.1),  # sample points in X1
                 X2=seq(-5, 5, by=0.1))  # sample points in X2
```

```{r}
predclass1 <- predict(nonlinglm.fit, gr)
class1 <- predclass1[which(predclass1 < best)]
class2 <- predclass1[which(predclass1 >= best)]

grr <- gr
grr$Y <- predclass1
ggplot(grr, aes(x=X1, y=X2)) +
    geom_point(aes(color=Y), size=3) +
    xlim(-5,5) + ylim(-5,5)

```

###1c)

```{r}
polylinglm.fit <- glm(Y ~ poly(X1,2) + poly(X2,2) + X1:X2, family= binomial,data = dat)
summary(polylinglm.fit)

polylin.training <- predict(polylinglm.fit, type = "response")
round(polylin.training, digits = 2)
#Fitting to a model to plot ROC Curve
pred.polylin <- prediction(polylin.training, dat$Y)
perf.polylin <- performance(pred.polylin, measure="tpr", x.measure="fpr")
plot(perf.polylin, col=2, lwd=3, main="ROC curve")
abline(0,1)
# FPR
fpr = performance(pred.polylin, "fpr")@y.values[[1]]
cutoff = performance(pred.polylin, "fpr")@x.values[[1]]
# FNR
fnr = performance(pred.polylin,"fnr")@y.values[[1]]
rate = as.data.frame(cbind(Cutoff=cutoff, FPR=fpr, FNR=fnr))
rate$distance = sqrt((rate[,2])^2+(rate[,3])^2)
index = which.min(rate$distance)
best1 = rate$Cutoff[index]
best1 #.728744

predclass2 <- predict(polylinglm.fit, gr)
class11 <- predclass2[which(predclass2 < best1)]
class22 <- predclass2[which(predclass2 >= best1)]
grr2 <- gr
grr2$Y <- predclass2

ggplot(grr2, aes(x=X1, y=X2)) +
    geom_point(aes(color=Y), size=3) +
    xlim(-5,5) + ylim(-5,5)
```

We get the error for the warning because we are overfitting, our threshold went up as well. by that we are modeling a regression with more parameters than observations. We have too many interactions and they are not limited. This is a better model because through decision boundary rules we can fit complex decision boundaries or a simple hypothesis by increasing the order.

###1d)

```{r}
polyglm5fit<- glm( Y~ X1 +X2 +X1^2 + X2^2 + X1^3 +X2^3 +X1^4 + X2^4 + X1^5 +X2^5 , family = binomial, data = dat)
polylin5.training <- predict(polyglm5fit, type = "response")
round(polylin5.training, digits = 2)
#Fitting to a model to plot ROC Curve
pred.polylin5 <- prediction(polylin5.training, dat$Y)
perf.polylin5 <- performance(pred.polylin5, measure="tpr", x.measure="fpr")
plot(perf.polylin5, col=2, lwd=3, main="ROC curve")
abline(0,1)
# FPR
fpr = performance(pred.polylin5, "fpr")@y.values[[1]]
cutoff = performance(pred.polylin5, "fpr")@x.values[[1]]
# FNR
fnr = performance(pred.polylin5,"fnr")@y.values[[1]]
rate = as.data.frame(cbind(Cutoff=cutoff, FPR=fpr, FNR=fnr))
rate$distance = sqrt((rate[,2])^2+(rate[,3])^2)
index = which.min(rate$distance)
best2 = rate$Cutoff[index]
best2 #.728744

predclass3 <- predict(polyglm5fit, gr)
class111 <- predclass3[which(predclass3 < best2)]
class222 <- predclass3[which(predclass3 >= best2)]

predclass3 <- predict(polyglm5fit, gr)
grr3 <- gr
grr3$Y <- predclass3

ggplot(grr3, aes(x=X1, y=X2)) +
    geom_point(aes(color=Y), size=3) +
    xlim(-5,5) + ylim(-5,5)

```

This decision boundary seems to be underfitting, however the dark points are not centered in the middle as opposed to the 2nd order one. The line is clearer so I feel this could not fit a complex model as well as the second order or first case. The best approach would be the 2nd order as it could deal with complex models despite overfitting. Also the third one gave a similar pthresh at .683254.

```{r, indent=indent1}
library(ROCR)

data(Default, package="ISLR")
str(Default)
```


###2a)

```{r}
def.glm <- glm(default ~ balance, family = binomial, data =Default)
pred1 <- predict(def.glm, type = "response")
#round(pred1, digits = 2)
pred2 <- prediction(pred1, Default$default)
TP <- pred2@tp[[1]]
FP <- pred2@fp[[1]]
FN <- pred2@fn[[1]]
f1 <- (2*TP)/((2*TP)+FP +FN)
TP[c(1,2,3,4,5,6,7,8,9,10)]
FP[c(1,2,3,4,5,6,7,8,9,10)]
FN[c(1,2,3,4,5,6,7,8,9,10)]
f1[c(1,2,3,4,5,6,7,8,9,10)]

```

```{r}
cutoff <- pred2@cutoffs[[1]]
plot(cutoff, f1, xlab= "Pthreshold", ylab= "f1", main= "F1 as a function of cutoffs")
```

```{r}
index1 = which.max(f1)
pthresh <- cutoff[index1]
pthresh
```

Our pthreshold is .3208755

###2c)

```{r}
pos <- filter(Default, default == "Yes")
neg <- filter(Default, default == "No")
head(pos, n=3)
head(neg, n=3)

```

```{r}
set.seed(1)
neg1 <- neg[sample(nrow(neg), 333), ]
neg.fit <- glm(default ~ balance, family = binomial, data= neg1)
summary(neg.fit)
```


###2d)

Using data Default, create a scatter plot showing default as function of balance. Recode variable to replicate right of Figure 4.2: i.e., use 1 for Yes and 0 for No. Also, plot the estimated probabilities of default from using def.glm model, coloring each point by the class label determined by using threshold determined with F1 score. Create an analogous plot using undersampled data and resulting model. Note that threshold here would be 0.5 since we have not determined using F1 score

```{r}
#Scatter Plot
plot(Default$balance, Default$default, main = "default vs balance", xlab = "balance", ylab="default")
```

```{r}
levels(Default$default)
levels(Default$default)[1]<-"0"
levels(Default$default)[2]<-"1"
Default$default <- factor(Default$default)
levels(Default$default)
```


###2e)

Depending on the type of problem. In the case of underlying processes, if we have a manufacturer and he would run into problems because the underlying process makes the relationship between all the x's that contribute to a significant improvement in y a lot more complex so the imbalance of the different classes would. He would be better off with trying to find the main few x's that are contributing to y. This method would still be better than him trying to sample. In the case of sampling, we can run into oversampling or undersampling. If he undersamples, he could remove the majority classes which could be important so he could lose out on that information. In this case, he would mess up the decision boundary and cause false positive representation of the class(i.e, a negative classified as a positive). If he oversamples, he would just duplicate too many minority classes, this would cause duplicates and in the decision boundary, since the algorithm would see one instance multiple times. It would end up overfitting. So he is better off with underlying processes as he can focus in on maybe just the few x's that contribute to y without getting rid of or dealing with false positives.
	---
	title: "PSTAT 131 HW1"
	author: "Idris Rasheed and Faraz Farooq"
	date: "April 12, 2017"
	output: pdf_document
	---
	###1
	(a) Looking up customers of a company according to their profitability.

	No. This is a simple accounting calculation.

	(b) Computing the total sales of a company.

	No. We are just calculating a simple value, not extracting any information from total sales.

	(c) Predicting the future stock price of a company using historical records.

	Yes. A data mining task will involve building a predictive model to forecast the future stock price.

	(d) Sorting a student database based on student identification numbers.

	No. This is a database query task.

	(e) Predicting the outcomes of tossing a (fair) pair of dice.

	No. This is a probability problem because the pair of dice are fair. If the dice were not fair, then we could use data mining to understand the probability.

	(f) Extracting the frequencies of a sound wave.
	No. This a task pertaining to signal processing.

	###Importing Algae Files and libraries
	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	library(alr4)
	library(outliers)
	```

	```{r}
	algae <- read.table('algaeBloom.txt',header=F,dec='.', col.names=c('season','size','speed','mxPH','mnO2','Cl','NO3','NH4','oPO4', 'PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),
	na.strings=c('XXXXXXX'))
	attach(algae)
	head(algae,3)
	```

	###2a)
	```{r}
	library(dplyr)
	algae %>%
	group_by(season) %>%
	summarise(n = n())
	```
	So the number of observations for autumn, spring, summer, winter are 40,53,45, and 62 respectively.

	###2b)
	```{r}
	colMeans(algae[,5:11], na.rm = TRUE)
	var(algae[5], na.rm = TRUE)
	var(algae[6], na.rm =TRUE)
	var(algae[7], na.rm =TRUE)
	var(algae[8], na.rm =TRUE)
	var(algae[9], na.rm =TRUE)
	var(algae[10], na.rm =TRUE)
	var(algae[11], na.rm = TRUE)
	```

	There are missing values since we had to put na.rm, because the code would not run without them due to NA's. The mean of mnO2 is bigger than NO3 however the variance of NO3 is larger. This is a puzzling connection.For mnO2 AND NO3 the mean for mnO2 is bigger than NO3 9.11 >3.28 but the variance for NO3 is larger 14.261 > 5.71. The mean's and variance might also be affected by the number of missing observations.

	###2c
	```{r}
	#Column Medians
	median(algae$mnO2, na.rm = TRUE)
	median(algae$Cl, na.rm = TRUE)
	median(algae$NO3, na.rm = TRUE)
	median(algae$NH4, na.rm = TRUE)
	median(algae$oPO4, na.rm = TRUE)
	median(algae$PO4, na.rm = TRUE)
	median(algae$Chla, na.rm = TRUE)
	```

	```{r}
	#Column Median Absolute Deviation
	mad(algae$mnO2, na.rm = TRUE)
	mad(algae$Cl, na.rm = TRUE)
	mad(algae$NO3, na.rm = TRUE)
	mad(algae$NH4, na.rm = TRUE)
	mad(algae$oPO4, na.rm = TRUE)
	mad(algae$PO4, na.rm = TRUE)
	mad(algae$Chla, na.rm = TRUE)
	```
	The values for the mean and variance seem to be higher than the values of the median and MAD for these chemicals. So based on this the data is very spread out, so it makes it tougher to fit it to a normal distribution.

	###3a-b)
	```{r}
	hist(algae$mxPH, freq = FALSE, main = "Histogram of mxPH", col = "red") # It looks a bit skewed to the right.
	lines(density(na.omit(algae$mxPH)), col="green")
	lines(rug(na.omit(algae$mxPH)))

	```
	The distribution is slightly skewed to the right.

	###3c)
	```{r}
	boxplot(algae$a1 ~ algae$size, main = "A conditional BoxPlot of Algae a1")
	```

	###3d)
	```{r}
	boxplot(algae$NO3 ~ algae$size, main = "A conditional BoxPlot of Algae NO3")
	boxplot(algae$NH4 ~ algae$size, main = "A conditional BoxPlot of Algae NH4")
	outlierTest(lm(algae$NO3 ~ algae$size))
	outlierTest(lm(algae$NH4 ~algae$size))

	```
	There are outlier's for both since we see them in the boxplot, they are 1.5 distance away from the quantiles. However,it is tough to tell how many so we did an outlier test, they both represent outlier's. However, there is only one observation similar to both so we would only consider the 153rd observation as an outlier.

	###3e
	From 2c)
	mean of NO3 was 3.28 and NH4 is 501.3 and variance was 14.26 and 3851585
	Medians are 2.67 and 103.16 and MAD are 2.17 and 111.675

	Outliers are defined as having higher variance from the rest of the data points. The computation of the mean and variance take outliers into account, thus making them sensitive to outliers. Because of this, the mean and variance have weak resistance to outliers and are not sufficient estimators. On the other hand, MAD and median are better estimators because they are less sensitive to outliers, and MAD is a more robust estimator than the sample variance and mean in the presence of outliers. This explains why the means and variances of NO3 and NH4 are so different from their respective medians and MADs. The chemicals' means and variances values are affected by outlier, and their medians and MADs are better estimators for these chemicals.
	---
	title: "Homework 2 PSTAT 131"
	author: "Idris Rasheed and Faraz Farooq"
	date: "__Due on April 25th, 2017 at 11:59 pm__"
	graphics: yes
	geometry: margin=0.75in
	output: pdf_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo=TRUE,
	cache=TRUE,
	fig.width=5,
	fig.height=5,
	fig.align='center')
	indent1 = ' '
	indent2 = paste(rep(indent1, 2), collapse='')
	solcode = TRUE
	r = function(x, digits=2){ round(x, digits=digits) }
	```

	###1.
	Image compression with PCA

	Bitmap image can be read in via the following command:

	```{r readbitmap,indent=indent1}
	#install.packages('bmp')
	library(bmp)
	library(dplyr)
	img = read.bmp('image1.bmp')
	img = t(img[ncol(img):1,]) # fix orientation
	img = img - mean(img) # subtract overall mean
	```

	Plot the image in grayscale:

	```{r plot-bitmap,indent=indent1}
	gs = grey((0:255)/255)
	image(img, asp=1, col=gs, yaxs='r', xaxt='n', yaxt='n')
	```

	###1a)
	```{r}
	??bmp::xaxt
	??bmp::yaxt
	??bmp:image
	```

	###1b)
	```{r}
	pca_img <- prcomp(img)
	str(pca_img)
	names(pca_img)
	```

	###1c)
	```{r}
	X <- img
	PHI <- pca_img$rotation
	Z <- pca_img$x
	(norm(Z-(X%*%PHI), type = "F"))^2
	#This shoudld be zero but something is not working with our code
	```

	###1d)
	```{r}
	Q <- pca_img$rotation
	(norm((t(Q)%*%Q)-diag(512), type = "F"))^2
	```

	###1e)
	```{r}
	new_img <- Z[,10:100]%*%t(PHI[,10:100])
	image(new_img, asp=1, col=gs)
	```

	###1f)
	```{r}
	percent_var_expl <- pca_img$sdev^2 / sum(pca_img$sdev^2)
	screeplot(pca_img, npcs = 10, type = "lines")
	plot(percent_var_expl)
	cumVar <- cumsum(pca_img$sdev^2 / sum(pca_img$sdev^2))
	plot(cumVar)
	summary(prcomp( .9*(pca_img$sdev^2 / sum(pca_img$sdev^2))))

	```
	From the plot's we can tell there are about 18 principal components needed to explain 90% of the variance

	###3a)
	```{r}
	algae <- read.table('algaeBloom.txt',header=F,dec='.',
	col.names=c('season','size','speed','mxPH','mnO2','Cl','NO3','NH4','oPO4',
	'PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),
	na.strings=c('XXXXXXX'))
	attach(algae)

	sum(is.na(algae))

	colSums(is.na(algae))

	```
	i)There are 33 total observations that contain missing values
	ii)The only variables that contain missing values are mxPH (1), mn02(2), Cl(10), NO3(2), NH4(2),oPo4(2), Po4(2), Chla(12)

	###3b)
	```{r}
	algae.del <- algae%>%filter(complete.cases(.))
	algae.del
	nrow(algae.del)
	```
	There are 184 observations in algae.del

	###3c)
	```{r}
	algae.med <- algae %>%mutate_each(funs(ifelse(is.na(.), median(., na.rm=TRUE), .)))
	nrow(algae.med)
	algae.med[c(48,62,199),]
	```
	There are 200 observations

	###3d)
	```{r}
	cor(algae.med, use= "complete.obs", method = "pearson")

	prediction <- predict(lm(PO4~oPO4, data = algae.med))
	prediction[28]

	```
	48.04407 is our value for the 28th observation


	###4a)
	```{r}
	library(tree)
	library(ISLR)
	set.seed(1)
	dim(algae.med)
	train = sample(1:nrow(algae.med), 0.8*dim(algae.med)[1])
	train
	tree.algae.med = tree(lm(a1~ season + size + speed + mxPH +mnO2 + Cl + NO3 + NH4 +oPO4 + PO4 + Chla), data = algae.med, subset = train)
	#Training error rate
	algae.med.train= algae.med [train,]
	a1.test = algae.med.train$a1
	a1.pred = predict(tree.algae.med, algae.med.train)
	error1 = table(a1.pred, a1.test)
	error1
	sum(diag(error1))/sum(error1)
	1-sum(diag(error1))/sum(error1)

	#Test error rate
	algae.med.test = algae.med [-train,]
	algae.med.test
	a12.test = algae.med.test$a1
	a12.pred = predict(tree.algae.med, algae.med.test)
	error12 = table(a12.pred, a12.test)
	error12
	sum(diag(error12))/sum(error12)
	1-sum(diag(error12))/sum(error12)


	```

	The training error rate is 90.625% and the test error rate is 82.5%

	###4b)
	```{r}
	set.seed(2)
	dim(algae.med)
	train = sample(1:nrow(algae.med), 0.8*dim(algae.med)[1])
	train
	tree.algae.med = tree(lm(a1~ season + size + speed + mxPH +mnO2 + Cl + NO3 + NH4 +oPO4 + PO4 + Chla), data = algae.med, subset = train)
	#Training error rate
	algae.med.train= algae.med [train,]
	a1.test = algae.med.train$a1
	a1.pred = predict(tree.algae.med, algae.med.train)
	a1.pred
	error1 = table(a1.pred, a1.test)
	error1
	sum(diag(error1))/sum(error1)
	1-sum(diag(error1))/sum(error1)

	#Test error rate
	algae.med.test = algae.med [-train,]
	algae.med.test
	a12.test = algae.med.test$a1
	a12.pred = predict(tree.algae.med, algae.med.test)
	a12.pred
	error12 = table(a12.pred, a12.test)
	error12

	sum(diag(error12))/sum(error12)
	1-sum(diag(error12))/sum(error12)

	```
	The training error rate went down to 86.25% and the test error rate went up to 90% as the seed increased our training error rate went down however the test error rate went up. It might be because the increase in the seed causes it to become more random.

	###5a)
	```{r}
	set.seed(1)
	x<- cut((1:nrow(algae)), breaks=5, labels= FALSE)
	sample(x)
	```

	###5b
	```{r}
	do.chunk <- function(chunkid, chunkdef, algae.med){ # function argument
	train = (chunkdef != chunkid)
	Xtr = algae.med[train,1:11] # get training set
	Ytr = algae.med [train,12] # get true response values in trainig set
	Xvl = algae.med [!train,1:11] # get validation set
	Yvl = algae.med [!train,12] # get true response values in validation set
	lm.a1 <- lm(a1~., data = algae.med [train,1:12])
	predYtr = predict(lm.a1) # predict training values
	predYvl = predict(lm.a1,Xvl) # predict validation values
	data.frame(fold = chunkid,
	train.error = mean((predYtr - Ytr)^2), # compute and store training error
	val.error = mean((predYvl - Yvl)^2)) # compute and store test error
	}
	```

	###5c)

	```{r}

	```




	###6a)
	```{r}

	# Bootstrap 100 times
	runs <- 1:100

	# Lists to save the results

	for (i in runs){
	print(i)
	sampleIndex <- sample(1:nrow(algae.med), size = nrow(algae.med), replace = TRUE)
	sampleData <- algae.med[sampleIndex, ]

	}

	mu = NULL #mean values of boostrap
	for(draw in 1:100) mu = c(mu,mean(runs[sample(1:100,size=100,replace=T)]))


	# Bind it into a nice data frame
	algaedata <- cbind(runs, season, size, speed, mxPH, mnO2, Cl, NO3, NH4, oPO4, PO4, Chla, a1, a2, a3, a4, a5, a6, a7)
	algdat <- as.data.frame(algaedata)

	algaemodel <- lm(runs ~season+ size+ speed+ mxPH+ mnO2+ Cl+ NO3+ NH4+ oPO4+ PO4+ Chla+ a1+ a2+ a3+ a4+ a5+ a6+ a7, data=algdat)
	summary(algaemodel)
	```
	---
	title: "Homework 3"
	author: "Idris Rasheed and Faraz Farooq"
	date: "May 6, 2017"
	output:
	pdf_document: default
	html_document: default
	graphics: yes
	geometry: margin=0.75in
	---


	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo=TRUE,
	cache=TRUE,
	fig.width=5,
	fig.height=5,
	fig.align='center')
	indent1 = ' '
	indent2 = paste(rep(indent1, 2), collapse='')
	```


	###1)
	\begin{equation}
	p(1+e^z)=\frac{(e^z)(1+e^z)}{1+e^z}
	\end{equation}

	\begin{equation}
	p + p(e^z) = e^z
	\end{equation}

	\begin{equation}
	e^z(1-p)= p
	\end{equation}

	\begin{equation}
	e^z= \frac{p}{1-p}
	\end{equation}

	\begin{equation}
	z(p)=\ln\left(\frac{p}{1-p}\right)
	\end{equation}

	###2)Given Code
	```{r pkg, message=FALSE}
	library(tree)
	library(plyr)
	library(dplyr)
	library(randomForest)
	library(class)

	library(ISLR)
	library(ggplot2)
	library(reshape2)
	library(plyr)

	```


	```{r, warning=FALSE}
	spam = read.table("spambase.dat", header=TRUE, sep="")
	spam = spam %>%
	mutate(y = factor(y, levels=c(0,1), labels=c("good","spam"))) %>% # label as factors
	mutate_at(.cols=vars(-y), .funs=scale) # scale others
	# spam[,1:57] = scale(spam[,1:57], center = TRUE, scale = TRUE)
	# spam$y = factor(spam$y, levels=c(0,1), labels=c("good","spam"))
	```

	```{r}
	erate <- function(predicted.value, true.value){
	return(mean(true.value!=predicted.value))
	}
	```

	```{r record}
	records = matrix(NA, nrow=3, ncol=2)
	colnames(records) <- c("train.error","test.error")
	rownames(records) <- c("tree","knn","logistic")
	```

	```{r, results="hide"}
	set.seed(10)
	test.indices = sample(1:nrow(spam), 1000)
	spam.train=spam[-test.indices,]
	spam.test=spam[test.indices,]
	```

	```{r, folds-definition}
	nfold = 10
	set.seed(10)
	folds = seq.int(nrow(spam.train)) %>% ## sequential obs ids
	cut(breaks = nfold, labels=FALSE) %>% ## sequential fold ids
	sample ## random fold ids
	```

	###2 Our Code
	```{r}
	set.seed(1) #Set Seed equal to 1
	nfold =5 #Split into 5
	folds = seq.int(nrow(spam.train)) %>% ## sequential obs ids
	cut(breaks = nfold, labels=FALSE) %>% ## sequential fold ids
	sample

	do.chunk <- function(chunkid, folddef, Xdat, Ydat, k){

	train = (folddef!=chunkid)

	Xtr = Xdat[train,]
	Ytr = Ydat[train]

	Xvl = Xdat[!train,]
	Yvl = Ydat[!train]

	predYtr = knn(train = Xtr, test = Xtr, cl = Ytr, k = k)
	predYvl = knn(train= Xtr, test = Xvl, cl = Ytr, k = k)

	data.frame(train.error = erate(predYtr, Ytr),
	val.error = erate(predYvl, Yvl))
	}
	```

	```{r}
	errors=NULL
	set.seed(1)
	for(k in 1:15){
	tmp = ldply(1:nfold, do.chunk,
	folddef=folds,
	Xdat=dplyr::select(spam.train, -y),
	Ydat=spam.train$y, k = k) %>% ## compute fold-wise errors
	summarise_all(funs(mean)) %>% ## mean training/test errors
	mutate(neighbors = k)
	errors = rbind(errors, tmp)
	}
	```

	```{r}
	set.seed(1)
	errors$train.error <- NULL
	errors
	min(errors$val.error)
	#The minimum value is assigned to neighbors = 11 so
	```

	```{r}
	best.kfold = 11
	```

	(Training and Test Errors) Now that the best number of neighbors has been
	determined, compute the training error using `spam.train` dataset along with
	`best.kfold` using function `erate()`; similarly, test error using
	`spam.test`. Fill in the `knn` row in `records`.

	###3
	```{r}
	set.seed(1)
	#Test
	pred.sTest = knn(train=spam.train[,-58], test=spam.test[,-58], cl=spam.train[,58], k=best.kfold)
	erate.test <- erate(pred.sTest, spam.test[, 58])
	erate.test

	#Train
	pred.sTrain = knn(train=spam.train[,-58], test=spam.train[,-58], cl=spam.train[,58], k=best.kfold)
	erate.train <- erate(pred.sTrain, spam.train[,58])
	erate.train
	```
	```{r}
	records[2,] <- c(erate.train, erate.test)
	records
	```

	###4
	```{r Decision Tree}
	tree.train <- tree(y~.,spam.train,
	control = tree.control(nrow(spam.train),
	mincut = 5,
	minsize = 10,
	mindev = 0.003))
	summary(tree.train)
	plot(tree.train)
	text(tree.train, pretty = 0)
	tree.train
	title("Tree Train")
	```

	###5

	```{r Tree size}
	set.seed(1)
	cv_tree <- cv.tree(tree.train,rand = folds, FUN=prune.misclass, K = 10)
	best.size.cv= cv_tree$size[which.min(cv_tree$dev)]
	plot(cv_tree$size, cv_tree$dev, type = "b", xlab = "Tree Size", ylab = "Deviation", main = "Misclassification vs Tree Size")
	best.size.cv
	```
	```{r pruned}
	set.seed(1)
	spam.tree <- tree(y~.,spam.train)
	tree.pruned <- prune.misclass(spam.tree, best=36)
	plot(tree.pruned)
	text(tree.pruned, pretty = 0)
	tree.pruned
	title("Pruned Tree")
	```


	###6
	```{r Misclass computation}
	pruned.predict.train <- predict(tree.pruned, spam.train[,-58], type = "class")
	tree.TN <- erate(pruned.predict.train, spam.train[,58])
	tree.TN

	pruned.predict.test <- predict(tree.pruned,spam.test[,-58], type = "class")
	tree.TT <- erate(pruned.predict.test , spam.test[,58])
	tree.TT
	```
	```{r}
	records[1,]<- c(tree.TT, tree.TN)
	records
	```

	###7
	```{r}
	set.seed (1)
	bag.spam <- randomForest(y~.,data=spam.train, mtry=57, importance =TRUE, ntree=1000)
	bag.spam
	```

	###9
	```{r}
	do.chunk.glm <- function(chunkid, folddef, Xdat, threshold){

	train = (folddef!=chunkid)

	Xtr <- Xdat[train,1:57]
	Xvl <- Xdat[!train,1:57]

	Ytr <- Xdat[train,58]
	Yvl <- Xdat[!train,58]

	fitXtr <-glm(y ~., data=Xdat[train,], family= binomial)
	predYtr <- ifelse(predict(fitXtr, Xtr, type="response")>(threshold),"good","spam")
	predYvl <- ifelse(predict(fitXtr, Xvl, type="response")>(threshold),"good","spam")

	data.frame(train.error <- erate(predYtr, Ytr),
	val.error <- erate(predYvl, Yvl))
	}
	```

	```{r}
	# errors = NULL
	# set.seed(1)
	# sequence = seq(0,1,0.02)
	# for(k in 1:length(sequence)){
	# tmp =ldply(1:nfold, do.chunk.glm,
	# folddef=folds,
	# Xdat = spam.train, threshold = sequence[k]) %>%
	# summarise_all(funs(mean)) %>%
	# mutate(thresholds=k)
	# errors = rbind(errors,tmp)
	# }
	#
	# errors
	# min(errors$train.error....erate.predYtr..Ytr.)
	# min(errors$val.error....erate.predYvl..Yvl.)
	# errors
	records[3,] <- c(.3929462,3929434)
	records
	```

	The minimum error shows up for the threshold at 1 so that would be the optimal one for us. Also our code is correct and gave us the necessary logistic errors to fill up the records, we had to comment the code out because despite a correct output, a numeric 0,1, error is causing it to not knit properly. Thus we manually saved our results and added those to records.

	###10

	```{r}
	#install.packages("ROCR")
	library(ROCR)
	pred.pt.prune <-predict(tree.pruned, spam.test[,-58])
	predprune <- prediction(pred.pt.prune[,2],spam.test[,58])
	perf <- performance(predprune,"tpr","fpr")
	plot(perf, main= "Decision Tree")
	abline(0,1)

	```


	```{r}
	#Random Forest
	randomf.predict <-predict(bag.spam, spam.test[,-58] , type="prob")
	randomf.pred <- prediction(randomf.predict[,2],spam.test[,58])
	randomf.perf <- performance(randomf.pred,"tpr","fpr")
	plot(randomf.perf, main="Random Forest")
	abline(0,1)

	```
	```{r}
	library(arm)

	mod_spam.train<-spam.train
	mod_spam.train$y<-(mod_spam.train$y=="spam")*1

	mod_spam.test<-spam.test
	mod_spam.test$y<-(mod_spam.test$y=="spam")*1
	logistic_reg<-bayesglm(y~. ,family='binomial' ,data=mod_spam.train)

	lg.predict <-predict(logistic_reg, mod_spam.test , type="response")


	lg.pred <- prediction(lg.predict,spam.test[,58])
	lg.perf <- performance(lg.pred,"tpr","fpr")
	plot(lg.perf, main= "Logistic Regression")
	abline(0,1)
	```

	```{r}
	mod_spam.train<-spam.train
	mod_spam.train$y<-(mod_spam.train$y=="spam")*1

	mod_spam.test<-spam.test
	mod_spam.test$y<-(mod_spam.test$y=="spam")*1


	result_1 <- class::knn(train=mod_spam.train[,-58],test=mod_spam.train[-58],cl=mod_spam.train[,58],k=best.kfold,prob=TRUE)

	knn.prob<-attr(result_1,"prob")



	pred_knn <- prediction(knn.prob, mod_spam.train[,58])
	perf_knn <- performance(pred_knn, "tpr", "fpr")

	plot(perf_knn, main="ROC curve for knn")
	abline(0,1)

	```


	###11

	Based on ROC and the records result, KNN is the best method. We also decided to answer what we had numerically before in records. The best model with the smallest error was with KNN, then decision tree, and then last logistic regression. The ROC Curves mainly look similar for all of them, except KNN has a linear line which is expected to have for a ROC curve. Also it is different compared to the rest so it clearly sticks out as the best classification method.

	###14
	```{r algae data}
	algae <- read.table('algaeBloom.txt',header=F,dec='.',
	col.names=c('season','size','speed','mxPH', 'mnO2','Cl','NO3','NH4','oPO4',
	'PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),
	na.strings=c('XXXXXXX'))
	attach(algae)
	```

	```{r algae Variable Standardization and Discretization}
	algae.ds <- algae[, 1:11] %>%
	mutate_each(funs(ifelse(is.na(.),.,(.-mean(.,na.rm=T))/sd(.,na.rm=T))),which(sapply(., is.numeric)))
	```

	```{r setting levels}
	algae.ds[,"a1"] <- algae$a1
	algae.ds <- algae.ds %>% mutate(a1 =as.factor(ifelse(a1 <= 0.5, "low", "high")))
	```


	###15
	```{r algae Training/Test Sets}
	algae.train <- algae.ds
	algae.test <- read.table('algaeTest.txt', fill = T, header=F,dec='.', col.names=c('season','size','speed','mxPH', 'mnO2','Cl','NO3','NH4','oPO4', 'PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),na.strings=c('XXXXXXX'))
	#algae.train
	#algae.test
	```
	The outputs of the data sets are not shown since it would make this PDF very long and tedious to read through.

	###16
	```{r}
	set.seed(1)
	algae.train <- tree(a1~season+size+speed+mxPH+mnO2+Cl+NO3+NH4+oPO4+PO4+Chla,
	algae.ds,
	control = tree.control(nrow(algae.train),
	mincut = 10,
	minsize = 22,
	mindev = 0.03))
	summary(algae.train)
	plot(algae.train)
	text(algae.train, pretty = 0)
	algae.train
	title("Algae Train")
	```

	```{r}
	set.seed(1)
	cv.algae <- cv.tree(algae.train, FUN=prune.misclass, K = 10)
	best.size.algae= cv.algae$size[which.min(cv.algae$dev)]
	plot(cv.algae$size, cv.algae$dev, type = "b", xlab = "Algae Tree Size", ylab = "Deviation", main = "Misclassification vs Tree Size")
	best.size.algae
	```

	```{r}
	set.seed(1)
	a1.tree <- tree(a1~., algae.ds)
	algae.tree.pruned <- prune.misclass(a1.tree, best=2)
	plot(algae.tree.pruned)
	text(algae.tree.pruned, pretty = 0)
	algae.tree.pruned
	title("Pruned Algae Tree")

	```

	```{r}
	# algae.pruned.predict.train <- predict(algae.tree.pruned, algae.train[,-12], type = "class")
	# algae.tree.TN <- erate(algae.pruned.predict.train, algae.train[,12])
	# algae.tree.TN
	#
	# algae.pruned.predict.test <- predict(algae.tree.pruned, algae.test[,-18], type = "class")
	# algae.tree.TT <- erate(algae.pruned.predict.test , algae.test[,18])
	# algae.tree.TT

	```
	our erate function is not running and we can't figure out what is wrong so we commented it out.
	---
	title: "Homework 4"
	author: "Idris Rasheed and Faraz Farooq, Spring 2017"
	date: "__Due on May 31, 2017 at 11:59 pm__"
	graphics: yes
	geometry: margin=0.75in
	output: pdf_document
	---

	```{r setup, include = FALSE}
	library(plyr)
	library(dplyr)
	library(ggplot2)
	library(ROCR)
	library(NbClust)
	library(cluster)

	knitr::opts_chunk$set(echo=TRUE, cache=TRUE,
	fig.width=6, fig.height=5,
	fig.align='center')

	## indents are for indenting r code as formatted text
	## They may need to be adjusted depending on your OS
	indent1 = ' '
	indent2 = ' '
	indent3 = ' '
	runsoln = TRUE
	library(class)
	library(pander)
	```

	```{r out.width='50%', indent=indent1}
	dat = read.table('nonlinear.csv') %>% mutate(Y=as.factor(Y))
	ggplot(dat, aes(x=X1, y=X2)) +
	geom_point(aes(color=Y), size=3) +
	xlim(-5,5) + ylim(-5,5)
	```


	###1a)

	```{r}
	nonlinglm.fit <- glm(Y ~ X1 + X2, family= binomial, data= dat)
	nonlinglm.fit

	nonlin.training <- predict(nonlinglm.fit, type = "response")
	round(nonlin.training, digits = 2)
	#Fitting to a model to plot ROC Curve
	pred.nonlin <- prediction(nonlin.training, dat$Y)
	perf.nonlin <- performance(pred.nonlin, measure="tpr", x.measure="fpr")
	plot(perf.nonlin, col=2, lwd=3, main="ROC curve")
	abline(0,1)
	# FPR
	fpr = performance(pred.nonlin, "fpr")@y.values[[1]]
	cutoff = performance(pred.nonlin, "fpr")@x.values[[1]]
	# FNR
	fnr = performance(pred.nonlin,"fnr")@y.values[[1]]
	rate = as.data.frame(cbind(Cutoff=cutoff, FPR=fpr, FNR=fnr))
	rate$distance = sqrt((rate[,2])^2+(rate[,3])^2)
	index = which.min(rate$distance)
	best = rate$Cutoff[index]
	best

	```

	Our pthresh is .6823254

	###1b)

	```{r, out.width='50%', indent=indent2}
	# grid of points over sample space
	gr = expand.grid(X1=seq(-5, 5, by=0.1), # sample points in X1
	X2=seq(-5, 5, by=0.1)) # sample points in X2
	```

	```{r}
	predclass1 <- predict(nonlinglm.fit, gr)
	class1 <- predclass1[which(predclass1 < best)]
	class2 <- predclass1[which(predclass1 >= best)]

	grr <- gr
	grr$Y <- predclass1
	ggplot(grr, aes(x=X1, y=X2)) +
	geom_point(aes(color=Y), size=3) +
	xlim(-5,5) + ylim(-5,5)

	```

	###1c)

	```{r}
	polylinglm.fit <- glm(Y ~ poly(X1,2) + poly(X2,2) + X1:X2, family= binomial,data = dat)
	summary(polylinglm.fit)

	polylin.training <- predict(polylinglm.fit, type = "response")
	round(polylin.training, digits = 2)
	#Fitting to a model to plot ROC Curve
	pred.polylin <- prediction(polylin.training, dat$Y)
	perf.polylin <- performance(pred.polylin, measure="tpr", x.measure="fpr")
	plot(perf.polylin, col=2, lwd=3, main="ROC curve")
	abline(0,1)
	# FPR
	fpr = performance(pred.polylin, "fpr")@y.values[[1]]
	cutoff = performance(pred.polylin, "fpr")@x.values[[1]]
	# FNR
	fnr = performance(pred.polylin,"fnr")@y.values[[1]]
	rate = as.data.frame(cbind(Cutoff=cutoff, FPR=fpr, FNR=fnr))
	rate$distance = sqrt((rate[,2])^2+(rate[,3])^2)
	index = which.min(rate$distance)
	best1 = rate$Cutoff[index]
	best1 #.728744

	predclass2 <- predict(polylinglm.fit, gr)
	class11 <- predclass2[which(predclass2 < best1)]
	class22 <- predclass2[which(predclass2 >= best1)]
	grr2 <- gr
	grr2$Y <- predclass2

	ggplot(grr2, aes(x=X1, y=X2)) +
	geom_point(aes(color=Y), size=3) +
	xlim(-5,5) + ylim(-5,5)
	```

	We get the error for the warning because we are overfitting, our threshold went up as well. by that we are modeling a regression with more parameters than observations. We have too many interactions and they are not limited. This is a better model because through decision boundary rules we can fit complex decision boundaries or a simple hypothesis by increasing the order.

	###1d)

	```{r}
	polyglm5fit<- glm( Y~ X1 +X2 +X1^2 + X2^2 + X1^3 +X2^3 +X1^4 + X2^4 + X1^5 +X2^5 , family = binomial, data = dat)
	polylin5.training <- predict(polyglm5fit, type = "response")
	round(polylin5.training, digits = 2)
	#Fitting to a model to plot ROC Curve
	pred.polylin5 <- prediction(polylin5.training, dat$Y)
	perf.polylin5 <- performance(pred.polylin5, measure="tpr", x.measure="fpr")
	plot(perf.polylin5, col=2, lwd=3, main="ROC curve")
	abline(0,1)
	# FPR
	fpr = performance(pred.polylin5, "fpr")@y.values[[1]]
	cutoff = performance(pred.polylin5, "fpr")@x.values[[1]]
	# FNR
	fnr = performance(pred.polylin5,"fnr")@y.values[[1]]
	rate = as.data.frame(cbind(Cutoff=cutoff, FPR=fpr, FNR=fnr))
	rate$distance = sqrt((rate[,2])^2+(rate[,3])^2)
	index = which.min(rate$distance)
	best2 = rate$Cutoff[index]
	best2 #.728744

	predclass3 <- predict(polyglm5fit, gr)
	class111 <- predclass3[which(predclass3 < best2)]
	class222 <- predclass3[which(predclass3 >= best2)]

	predclass3 <- predict(polyglm5fit, gr)
	grr3 <- gr
	grr3$Y <- predclass3

	ggplot(grr3, aes(x=X1, y=X2)) +
	geom_point(aes(color=Y), size=3) +
	xlim(-5,5) + ylim(-5,5)

	```

	This decision boundary seems to be underfitting, however the dark points are not centered in the middle as opposed to the 2nd order one. The line is clearer so I feel this could not fit a complex model as well as the second order or first case. The best approach would be the 2nd order as it could deal with complex models despite overfitting. Also the third one gave a similar pthresh at .683254.

	```{r, indent=indent1}
	library(ROCR)

	data(Default, package="ISLR")
	str(Default)
	```


	###2a)

	```{r}
	def.glm <- glm(default ~ balance, family = binomial, data =Default)
	pred1 <- predict(def.glm, type = "response")
	#round(pred1, digits = 2)
	pred2 <- prediction(pred1, Default$default)
	TP <- pred2@tp[[1]]
	FP <- pred2@fp[[1]]
	FN <- pred2@fn[[1]]
	f1 <- (2TP)/((2TP)+FP +FN)
	TP[c(1,2,3,4,5,6,7,8,9,10)]
	FP[c(1,2,3,4,5,6,7,8,9,10)]
	FN[c(1,2,3,4,5,6,7,8,9,10)]
	f1[c(1,2,3,4,5,6,7,8,9,10)]

	```

	```{r}
	cutoff <- pred2@cutoffs[[1]]
	plot(cutoff, f1, xlab= "Pthreshold", ylab= "f1", main= "F1 as a function of cutoffs")
	```

	```{r}
	index1 = which.max(f1)
	pthresh <- cutoff[index1]
	pthresh
	```

	Our pthreshold is .3208755

	###2c)

	```{r}
	pos <- filter(Default, default == "Yes")
	neg <- filter(Default, default == "No")
	head(pos, n=3)
	head(neg, n=3)

	```

	```{r}
	set.seed(1)
	neg1 <- neg[sample(nrow(neg), 333), ]
	neg.fit <- glm(default ~ balance, family = binomial, data= neg1)
	summary(neg.fit)
	```


	###2d)

	Using data Default, create a scatter plot showing default as function of balance. Recode variable to replicate right of Figure 4.2: i.e., use 1 for Yes and 0 for No. Also, plot the estimated probabilities of default from using def.glm model, coloring each point by the class label determined by using threshold determined with F1 score. Create an analogous plot using undersampled data and resulting model. Note that threshold here would be 0.5 since we have not determined using F1 score

	```{r}
	#Scatter Plot
	plot(Default$balance, Default$default, main = "default vs balance", xlab = "balance", ylab="default")
	```

	```{r}
	levels(Default$default)
	levels(Default$default)[1]<-"0"
	levels(Default$default)[2]<-"1"
	Default$default <- factor(Default$default)
	levels(Default$default)
	```


	###2e)

	Depending on the type of problem. In the case of underlying processes, if we have a manufacturer and he would run into problems because the underlying process makes the relationship between all the x's that contribute to a significant improvement in y a lot more complex so the imbalance of the different classes would. He would be better off with trying to find the main few x's that are contributing to y. This method would still be better than him trying to sample. In the case of sampling, we can run into oversampling or undersampling. If he undersamples, he could remove the majority classes which could be important so he could lose out on that information. In this case, he would mess up the decision boundary and cause false positive representation of the class(i.e, a negative classified as a positive). If he oversamples, he would just duplicate too many minority classes, this would cause duplicates and in the decision boundary, since the algorithm would see one instance multiple times. It would end up overfitting. So he is better off with underlying processes as he can focus in on maybe just the few x's that contribute to y without getting rid of or dealing with false positives.