Skip to content

Instantly share code, notes, and snippets.

@dpatchigolla
Created May 4, 2020 04:46
Show Gist options
  • Save dpatchigolla/956096096cc4245c80409023c24ff962 to your computer and use it in GitHub Desktop.
Save dpatchigolla/956096096cc4245c80409023c24ff962 to your computer and use it in GitHub Desktop.
Overview of ML Classification models
<!DOCTYPE html>
<html>
<head>
<title>Title</title>
<meta charset="utf-8">
<style>
@import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
@import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
@import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
body {
font-family: 'Droid Serif';
}
h1, h2, h3 {
font-family: 'Yanone Kaffeesatz';
font-weight: 400;
margin-bottom: 0;
}
.remark-slide-content h1 { font-size: 3em; }
.remark-slide-content h2 { font-size: 2em; }
.remark-slide-content h3 { font-size: 1.6em; }
.footnote {
position: absolute;
bottom: 3em;
font-size: 0.6em;
}
.header {
position: absolute;
top: 1em;
right: 1em;
font-size: 0.6em;
}
li p { line-height: 1.25em; }
.red { color: #fa0000; }
.large { font-size: 2em; }
a, a > code {
color: rgb(249, 38, 114);
text-decoration: none;
}
code {
border-radius: 5px;
}
.boxed {
background: #000;
color: #fff;
padding: 1em;
width: 90%;
border-radius: 5px;
}
.remark-code, .remark-inline-code { font-family: 'Ubuntu Mono'; }
.remark-code-line-highlighted { background-color: #373832; }
.pull-left {
float: left;
width: 47%;
}
.pull-right {
float: right;
width: 47%;
}
.pull-right ~ p {
clear: both;
}
#slideshow .slide .content code {
font-size: 0.8em;
}
#slideshow .slide .content pre code {
font-size: 0.9em;
padding: 15px;
}
.inverse {
background: #272822;
color: #777872;
text-shadow: 0 0 20px #333;
}
.inverse h1, .inverse h2 {
color: #f3f3f3;
line-height: 0.8em;
}
.foot {
vertical-align: super;
font-size: 0.8em;
color: darkblue;
}
/* Slide-specific styling */
#slide-inverse .footnote {
bottom: 12px;
left: 20px;
}
#slide-how .slides {
font-size: 0.9em;
position: absolute;
top: 151px;
right: 140px;
}
#slide-how .slides h3 {
margin-top: 0.2em;
}
#slide-how .slides .first, #slide-how .slides .second {
padding: 1px 20px;
height: 90px;
width: 120px;
-moz-box-shadow: 0 0 10px #777;
-webkit-box-shadow: 0 0 10px #777;
box-shadow: 0 0 10px #777;
}
#slide-how .slides .first {
background: #fff;
position: absolute;
top: 20%;
left: 20%;
z-index: 1;
}
#slide-how .slides .second {
position: relative;
background: #fff;
z-index: 0;
}
/* Two-column layout */
.left-column {
color: #777;
width: 20%;
height: 92%;
float: left;
}
.left-column h2:last-of-type, .left-column h3:last-child {
color: #000;
}
.right-column {
width: 75%;
float: right;
padding-top: 1em;
}
/* Table formatting */
table { border-collapse: collapse }
th, td {
padding: 5px;
border: 1px solid gray;
}
/* Image formatting */
.image-25 img {
width: 25%;
}
.image-30 img {
width: 30%;
}
.image-40 img {
width: 40%;
}
.image-50 img {
width: 50%;
}
.image-60 img {
width: 60%;
}
.image-70 img {
width: 70%;
}
.image-80 img {
width: 80%;
}
.image-90 img {
width: 90%;
}
.image-100 img {
width: 100%;
}
.caption {
font-size: .8em;
}
/* Top Nav*/
.top-column {
color: #777;
font-weight: bold;
border-bottom: 1px solid #777;
margin-bttom: 1em;
}
.top-column ul li {
display:inline;
padding-right: 1em
}
.top-column ul li:last-of-type {
color: #000;
display:inline
}
/*Blockquotes*/
blockquote {
padding-left: 1em;
border-left: 5px solid #5996d0;
}
</style>
</head>
<body>
<textarea id="source">
class: center, middle
.header[
<small>Created with [remark](https://github.com/gnab/remark)</small>
]
# AI Bytes
## Classification
<hr>
Dileep Patchigolla
May 5, 2020
---
# About me
Dileep Patchigolla
- Lead Engineer @ Search
- With Target since July 2018
- MS, Computational Data Analytics - Georgia Tech
- B.Tech, Mechanical Engineering - IIT Madras
---
# Agenda
- What is Classification?
- Statistical approach: Logistic Regression
- Geometric approach: SVM
- Rules based approach: Decision Trees
- Rules based approach, mixing models: Random Forest
---
## Classification
Binary classification:
_Yes / No Problems_
- Should a credit card application be approved?
- Is this transaction genuine?
- Does the mail belong to spam?
- Is this customer male or female?
<br>
<br>
<br>
--
_But what if there are more than two classes?_
Multi-class classification:
- What marketing campaign should I send to a customer?
- Which category does a query belong to?
- Who is the person in an image?
---
## IRIS Dataset
- Features
- Sepal length & width, Petal length & width
- Label
- Setosa, Versicolor, Virginica
.center.image-50[![](iris_sample.png)]
--
<br>
### Binary Classification:
- Let's consider two classes: Versicolor and Virginica
--
- Call one class as 0 and other as 1. Typically based on which case we consider as _positive_. Here the choice is arbitrary
- Come up with a function `$f(x)$` which gives the value of `$y$`, for a given `$x$`
- Model the likelihood of a datapoint belonging to a class. i.e.,
`$\qquad\qquad\qquad\qquad\qquad\qquad P(Y==1|X)$`
---
# Linear Regression - Quick Review
.center.image-50[![](linear_regression.png)]
<br>
`$\qquad\qquad\qquad\qquad\qquad\qquad y = w_{0} + w_{1} x_{1}$`
- `$y$` is continuous and its range can be very wide.
- Fit a straight line on the features, to get an estimate of `$y$`
---
# Logistic Regression
`$f(x) = w_0 + w_{1}x_{1}$`
- Similar to Linear Regression (linear formulation)
- But `$y$` is not continuous. Takes only two values, 0 and 1
--
- Make discrete problem to continuous problem. Predict `$P(Y==1|X)$`, i.e., _probability_ of `$y=1$`
--
- Need some translation that always binds `$f(x)$` between 0 and 1
---
# Sigmoid function
`$P(Y==1|X) = \frac{1}{1+ e^{-f(x)}} = \sigma(f(x))$`
.center.image-50[![](sigmoid.png)]
- Bounded between 0 and 1
- Asymptotically gets closer to 0 and 1
---
# Logit function
- So what does the linear equation `$f(x)$` mean?
--
Lets denote `$P(Y==1|X)$` by `$p$`
`$\qquad\qquad\qquad p = \frac{1}{1+ e^{-f(x)}}$`
--
`$\qquad\qquad\qquad \frac{1}{p} = 1+ e^{-f(x)}$`
--
`$\qquad\qquad\qquad \frac{1-p}{p} = e^{-f(x)}$`
--
`$\qquad\qquad\qquad \frac{p}{1-p} = e^{f(x)}$`
--
`$\qquad\qquad\qquad log(\frac{p}{1-p}) = f(x)$`
- `$log(\frac{p}{1-p})$` is called logit function - log-odds ratio
- odds = `$\frac{\text{#yes}}{\text{#no}}$`
---
# Weights
- `$f(x)$` can take several variables. i.e., `$f(x) = w_0 + w_{1}x_{1} + . . . w_{m}x_{m}$`
- In our example, `$x_{1} = $` Sepal length, `$x_{2} = $` Sepal width etc.
- But what about `$w_{0}, w_{1},...w_{n}$`?
- Use the available data and their labels.
- Create a loss function and try to minimize that
---
# Log Loss
<br/><br/>
`$ \qquad\qquad\qquad\qquad \hat{y} = P(Y==1|X) = \frac{1}{1+ e^{-(w_0 + w_{1}x_{1} + . . . w_{m}x_{m})}} $`
<br/><br/>
`$ \qquad\qquad\qquad \qquad L = -\frac{1}{n} \Sigma_{i=1}^{n} \enspace y_{i} log(\hat{y_{i}}) + (1-y_{i}) log(1-\hat{y_{i}})$`
- a.k.a. Binary Cross Entropy. Minimize this value
- If `$y_{i}$` = 0, first part of the loss is zero and second part is active. Loss = `$log(1-\hat{y}_{i})$`
--
- If `$\hat{y}_{i}$` is close to 0 (_correct prediction_), then `$1-\hat{y}_{i}$` is close to 1 and `$log(1-\hat{y}_{i})$` gets close to 0. Overall loss is zero
--
- If `$\hat{y}_{i}$` is close to 1 (_wrong prediction_), then `$1-\hat{y}_{i}$` is close to 0 and `$log(1-\hat{y}_{i})$` gets close to `$-\infty$`. Overall loss is zero
--
- If y = 1, the roles reverse. Loss = `$log(\hat{y}_{i})$`
---
# Gradient Descent
- Initialize weights randomly
- Iterate until convergence:
- Compute the Loss
- Compute Gradient, and update weights
--
- Fun fact:
- `$\frac{\partial L}{\partial w} = -\frac{1}{n}\Sigma_{i=1}^{n} x_{i}(y_{i} - \hat{y}_{i})$`
- `$\frac{\partial \sigma(x)}{\partial x} = \sigma(x) * (1-\sigma(x))^{[1]}$`
.footnote[
[1] https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x
]
---
# Example
.center.image-50[![](iris_nonseparable.png)]
- `$x_{1}$` = petal length and `$x_{2}$` = petal width
- y = 1 for versicolor and 0 for Virginica
---
# Example (Cont.)
logistic = LogisticRegression(solver="lbfgs")
logistic.fit(iris_data.iloc[:, 2:4], iris_data["Y"])
print(np.round(logistic.intercept_,2)[0], np.round(logistic.coef_,2)[0])
- `$w_{0} = 17.55$`, `$w_{1}$` = -2.78, `$w_{2} = -2.39$`
--
- `$\hat{y} = \frac{1}{1+ e^{-(w_0 + w_{1}x_{1} + w_{2}x_{2})}} $`
--
- But `$\hat{y}$` is not 0 or 1. How do we convert `$\hat{y}$` to a yes/no decision?
--
- Identify a threshold for cutoff
.center.image-40[![](yhat_dist.png)]
---
# Metrics
For a given threshold:
### Confusion Matrix:
.center.image-40[![](confusion_matrix_def.jpeg)]
- Accuracy: `$\frac{\text{True Positive + True Negative}}{\text{Total Count}}$`
- Precision: `$\frac{\text{True Positive}}{\text{True Positive + False Positive}}$`
- Recall/True Positive Rate: `$\frac{\text{True Positive}}{\text{True Positive + False Negative}}$`
- F1-Score: `$\frac{2}{Precision^{-1} + Recall^{-1}}$`
- False Positive Rate: `$\frac{\text{False Positive}}{\text{False Positive + True Negative}}$`
---
# Metrics
For varying threshold:
ROC Curve
.center.image-40[![](roc_definition.png)]
F1 Score at varying threshold
.center.image-40[![](f1_range.png)]
---
# Connection with Neural Nets
<br><br><br>
.center.middle.image-40[![](logistic_nn.png)]
- Logistic Regression is a Neural Network with a single hidden layer, single unit, and sigmoid activation
---
# Logistic Regression
### Review
- `$\hat{y} = \frac{1}{1+ e^{-(w_0 + w_{1}x_{1} + w_{2}x_{2})}} $`
- Returns the probability of y = 1
- Need to identify a threshold by checking for precision and recall (best f1-score is one way)
- A Neural Network with one hidden layer and one neuron, with sigmoid activation
- In case of multi-class classification, can be modeled as multiple one-vs-rest models
---
layout: false
class: center, middle
# SVM
---
# A Geometric Approach
.center.image-50[![](linear_separable.png)]
- How to separate them?
---
# A Geometric Approach
.center.image-50[![](perceptron_result.png)]
- Which line is better?
---
# A Geometric Approach
.center.image-50[![](maximum_margin_iris.png)]
- How about this?
---
# Maximum Margin Classifier
Linearly separable data
.center.image-40[![](maximum_margin_classifier.png)]
- Thick center line is called decision boundary
- Both broken lines are equidistant to the decision boundary. The distance is called Margin (M)
- Choose the center line such that M is maximum
- Some points always touch the margin. These are called _support vectors_
---
# Maximum Margin Classifier
Linearly separable data
.center.image-40[![](maximum_margin_classifier.png)]
- The dotted lines take the form `$x^{T}\beta + \beta_{0} = 1$` and `$x^{T}\beta + \beta_{0} = -1$`
- Dist. between parallel lines of form `$b_{0} + w_{1}x_{1} + w_{2}x_{2} = 0$` and `$b_{1} + w_{1}x_{1} + w_{2}x_{2} = 0$` is `$\frac{|b_{0} - b_{1}|}{\sqrt{w_{1}^{2} + w_{2}^{2}}}$`
- Therefore, M = `$\frac{1}{||\beta||}$`. Maximize M = Minimize `$||\beta||$`
---
# Maximum Margin Classifier
<br>
Objective Function:
.center[Minimize `$||\beta||$`]
s.t. `$\qquad\qquad\qquad\qquad\qquad y_{i} (x_{i}^{T}\beta + \beta_{0}) \geq 1 \quad \forall i=1,2,....N$`
<br><br><br>
- We denote `$y$` as +1/-1
- sign`$(x_{i}^{T}\beta + \beta_{0})$` = class of `$x_{i}$` for every `$i$`
--
- So, what if the data is not separable linearly?
---
# Support Vector Classifier
Not separable linearly
.center.image-40[![](svc.png)]
- An error term `$\xi$` is introduced. Measures distance from margin, to the wrong side.
---
# Objective Function
Minimize `$\frac{1}{2}||\beta||^{2} + C\enspace \Sigma_{i=1}^{n} \xi_{i}$`
s.t. `$\qquad\qquad\qquad \xi_{i} \geq 0, y_{i} (x_{i}^{T}\beta + \beta_{0}) \geq 1 - \xi_{i}\quad \forall i=1,2,....N$`
- C: Cost parameter. Higher C, low tolerance for errors. Lower C, high tolerance for errors.
--
.center.image-60[![](svm_C.png)]
---
# Solution
`$\qquad\qquad\qquad\qquad\qquad\qquad \hat{\beta}=\Sigma_{i=1}^{N} \hat{\alpha_{i}}y_{i}x_{i}$`.foot[1]
- `$\hat{\beta}$` depends only on the Support vector points.
- Predicted Class = `$Sign[x^{T}\beta + \beta_{0}]$` = `$Sign[\Sigma_{i=1}^{N} \hat{\alpha_{i}}y_{i}x_{i} x^{T} + \beta_{0}]$`
- We don't even need `$\beta$` to get predictions. We just need `$\alpha_{i}$` for every support vector.
.footnote[
[1] Elements of Statistical Learning, p 420-421 ]
--
`$\qquad\qquad\qquad\qquad max L_{D} = \Sigma_{i=1}^{N}\alpha_{i} - \frac{1}{2}\Sigma_{i=1}^{N}\Sigma_{\hat{i}=1}^{N}\alpha_{i}\alpha_{\hat{i}}y_{i}y_{\hat{i}}x^{T}_{i}x_{\hat{i}} $`
- `$\alpha$` values depend only on dot product between pairs of training datasets
- Predictions depend only on dot product between support vectors and new data point
- Dot products are central to SVM
---
# Non linear separation?
<br><br>
.center.image-40[![](non_linear_separation.png)]
### Linear classifier doesn't look like a good option.
--
Project data to higher dimensions, and classify there!
.center.image-40[![](projection1.png)]
Can go to more dimensions as well.
.footnote[https://www.andreaperlato.com/theorypost/introduction-to-support-vector-machine/]
---
# Kernels
- `$[x, x^{2}]$` is our new projection space
- We need dot products only. So we want `$x_{1}x_{2} + x_{1}^{2}x_{2}^{2}$`
- Define `$K(x_{1}, x_{2}) = (x_{1}x_{2} + r)^{d}$`
- For `$r=\frac{1}{2}$` and `$d=2$`, `$K(x_{1}, x_{2})$` becomes `$x_{1}x_{2} + x_{1}^{2}x_{2}^{2} + \frac{1}{4}$`, or dot product between `$(x_{1}, x_{1}^{2}, \frac{1}{2})$` and `$(x_{2}, x_{2}^{2}, \frac{1}{2})$`
- The function `$K(x_{1}, x_{2})$` gets us all information without projecting into higher dimensions
- Any function K can act as a kernel as long as it satisfies following property: `$K(x_{1}, x_{2}) = < f(x_{1}), f(x_{2}) > $`
- Above Kernel is called as Polynomial Kernel. `$r$` and `$d$` are hyperparameters
- And this is called **Kernel Trick!!!**
---
# RBF Kernel
- What if we want to expand to infinite dimensions? i.e., `$x, x^{2}, x^{3}, x^{4}, .... $`
--
- `$K(x_{1}, x_{2}) = e^{-\gamma ||x_{1} - x_{2}||^{2}}$`
--
- From Taylor series, `$e^{x} = 1 + x + \frac{x^{2}}{2!} + ... = \Sigma_{n=0}^{\infty} \frac{x^{n}}{n!}$`
--
- Let `$\gamma = \frac{1}{2}$`
`$\qquad e^{-\frac{1}{2} ||x_{1} - x_{2}||^{2}} = e^{\frac{-1}{2} (x_{1}^{2} + x_{2}^{2})} e^{< x_{1}, x_{2} >} $`
`$\qquad e^{\frac{-1}{2} (x_{1}^{2} + x_{2}^{2})} (1 + x_{1}x_{2} + \frac{x_{1}^{2}x_{2}^{2}}{2!} + ....)$`
`$\qquad e^{\frac{-1}{2} (x_{1}^{2} + x_{2}^{2})} < (1 , x_{1} , \frac{x_{1}^{2}}{\sqrt{2!}} ,....), (1 , x_{2} , \frac{x_{2}^{2}}{\sqrt{2!}}, ....) >$`
Let `$e^{\frac{-x_{1}}{2}} = s_{1}, e^{\frac{-x_{2}}{2}} = s_{2}$`
`$\qquad < (s_{1} , s_{1}x_{1} , s_{1}\frac{x_{1}^{2}}{\sqrt{2!}} ,....), (s_{2} , s_{2}x_{2} , s_{2}\frac{x_{2}^{2}}{\sqrt{2!}}, ....) >$`
- Therefore, RBF Kernel summarizes dot-product in infinite dimensions
---
# Example
svm_lin = SVC(kernel="linear")
svm_lin.fit(iris_data.iloc[:, 0:2], iris_data["Y"])
svm_rbf = SVC(C=100000, kernel="rbf", gamma="auto")
svm_rbf.fit(iris_data.iloc[:, 0:2], iris_data["Y"])
.center.image-60[![](iris_svm_rbf.png)]
--
careful not to overfit!
---
layout: false
class: center, middle
# Decision Trees
---
# Decision Trees
Example
.center.image-40[![](decision_rules.png)]
- Multiple classes at once!
---
# Decision Trees
.center.image-60[![](decision_tree_nomenclature.png)]
But how do we decide where to split?
---
# Entropy and Gini-Index
- Node `$m$`, class `$k$`
- Gini Index: Measure of Inequality. Higher the Gini Index, more _mixed_ the data is. Lower, more _pure_ the data.
`$\qquad\qquad\qquad\qquad\qquad\qquad \Sigma_{k=1}^{K}\hat{p_{mk}}(1-\hat{p_{mk}})$`
Ex: Setosa: 50, Versicolor: 50, Virginica: 50. `$p_{1} = p_{2} = p_{3} = \frac{1}{3}$`
Gini = `$\frac{1}{3} * \frac{2}{3} + \frac{1}{3} * \frac{2}{3} + \frac{1}{3} * \frac{2}{3} = \frac{2}{3} $`
---
.image-70[![](decision_tree.png)]
---
class: center, middle
# Thank You
---
</textarea>
<script src="https://remarkjs.com/downloads/remark-latest.min.js">
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-AMS_HTML&delayStartupUntil=configured" type="text/javascript"></script>
<script type="text/javascript">
var slideshow = remark.create({
highlightStyle: 'monokai',
highlightLanguage: 'remark',
countIncrementalSlides: false,
highlightLines: true
});
// Setup MathJax
MathJax.Hub.Config({
tex2jax: {
skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'],
inlineMath: [ ['$','$'], ["\\(","\\)"] ]
}
});
MathJax.Hub.Configured();
</script>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment