vivekkr12/logistic_regression.ipynb

## logistic_regression.ipynb
{
  "cells": [
    {
      "cell_type": "markdown",
      "source": "\n# Regularised Logistic regression\n\n*Binary classification using negative log likelihood loss with $L_2$ regularization penalty*\n\n---\n* [Implementation in Python](../pymlalgo/regression/regularised_logistic_regression.py)\n* [Demo](../demo/regularised_logistic_regression_demo.ipynb)\n---\n\n### Symbols and Conventions\nRefer to [Symbols and Conventions](symbols_and_conventions.ipynb) for details. In summary:\n* $n$ is the number of training examples\n* $d$ is the number of features in each training example (a.k.a the dimension of the training example)\n* $X$ is the features matrix of shape $n \\times d$\n* $Y$ is the labels matrix of shape $n \\times 1$\n* $y_i$ is the label of each training example, $y_i \\in \\{-1, 1\\}$\n* $W$ is the weight matrix of shape $d \\times 1$\n\n### Loss and Cost Function\nThe goal is to predict $$P(y_i | x_i, W)$$\nIn this binary classification problem $y_i \\in \\{-1, 1\\}$ and $P(y_i \u003d 1)$ and $P(y_i \u003d -1)$ are mutually exclusive events. As such:\n$$P(y_i  \u003d 1| x_i, W) + P(y_i  \u003d -1| x_i, W) \u003d 1$$\n$$\\implies P(y_i \u003d -1| x_i, W) \u003d 1 - P(y_i \u003d 1|x_i,W)$$\n\nLet\u0027s assume that $P(y_i \u003d 1| x_i, W) \u003d p_i \\implies P(y_i \u003d -1|x_i, W) \u003d 1-p_i$\n\nThe logistic regression is a generalized form of linear regression where the log of odds ratio is a linear function of $x_i$ and $W$\n$$x_iW \u003d log(\\frac{p_i}{1 - p_i}) \\implies p_i \u003d \\frac{1}{1 + e^{-x_iW}}$$\nand, $$1 - p_i \u003d 1 - \\frac{1}{1 + e^{-x_iW}} \u003d \\frac{1}{1 + e^{x_iW}}$$\n\nIn a general form, $p_i \u003d \\sigma(x_iW)$ and $1 - p_i \u003d 1 - \\sigma(x_iW) \u003d \\sigma(-x_iW)$where $\\sigma$ is called the sigmoid function or the logistic function or the expit function.\n\nWhen, $y_i \u003d 1, P(y_i \u003d 1 | x_i, W) \u003d p_i \u003d \\frac{1}{1 + e^{-y_ix_iW}} \u003d \\sigma(y_ix_iW)$  \nand when, $y_i \u003d -1, P(y_i \u003d -1 | x_i, W) \u003d 1 - p_i \u003d \\frac{1}{1 + e^{-y_ix_iW}} \u003d \\sigma(y_ix_iW)$  \n\nNow, the likelihood can be written in the compact form,\n$$P(y_i|x_i,W) \u003d \\sigma(y_ix_iW) \u003d \\frac{1}{1 + e^{-y_ix_iW}}$$\n\nThe optimization problem can either maximize the likelihood or minimize the negative likelihood. Since the maximum/negative minimum of a function and log of that function would be at the same values, the optimization problem can maximize the log or minimize the negative log of the function. Given that gradient descent finds the minimum of a function, the loss function is chosen such that optimization problem would minimize the negative of log likelihood:\n\n$$\\mathcal{L}(W)\u003d -log(P(y_i|x_i,W)) \u003d -log(\\frac{1}{1 + e^{-y_ix_iW}}) \u003d log(1 + e^{-y_ix_iW})$$\n\nThe cost function minimizes the overall negative log likelihood:\n$$F(W) \u003d -\\frac{1}{n}log(\\prod_{i\u003d1}^{n}P(y_i|x_i,W)) + \\lambda ||W||_2^2$$\n$$\u003d\\frac{1}{n}\\sum_{i\u003d1}^{n}-log(P(y_i|x_i,W)) + \\lambda ||W||_2^2$$\n$$\u003d\\frac{1}{n}\\sum_{i\u003d1}^{n}log(1 + e^{-y_ix_iW}) + \\lambda ||W||_2^2$$\n\nwhere $\\lambda$ is the regularization coefficient (a hyper parameter).\n\nThe goal is to find the value of $W$ for which $F(W)$ is minimum.\n\n### Gradient Derivation\n\nFor $d \u003d 1$ and $n \u003d 1$:  \n$$F(W) \u003d \\frac{1}{1}\\sum_{i\u003d1}^{1}log(1 + e^{-y_ix_iW}) + \\lambda ||W||_2^2$$\n$$F(W) \u003d log(1 + e^{-yxW}) + \\lambda W^2$$ \ntaking the derivative w.r.t $W$ using the chain rule,\n$$\\nabla F(W) \u003d -yxe^{-yxW}\\frac{1}{1 + e^{-yxW}} + 2\\lambda W$$\n$$\\nabla F(W) \u003d \\frac{-yx}{1 + e^{yxW}}+ 2\\lambda W$$\n\nFor $d \u003e 1$ and $n \u003e 1$  \n$$F(W) \u003d \\frac{1}{n}\\sum_{i\u003d1}^{n}log(1 + e^{-y_ix_iW}) + \\lambda ||W||_2^2$$\n\nAs a reminder, $y_i$ is a scalar and $x_i$ is a vector of length $d$, shape $1 \\times d$\n\nUsing the linearity of derivatives,  the results obtained from calculation for $n \u003d 1 \\text{ and } d \u003d 1$, and the identity $\\frac{\\partial}{\\partial{X}} A^TX \u003d \\frac{\\partial}{\\partial{X}} X^TA \u003d A$:\n\n$$\\nabla F(W) \u003d \\frac{1}{n}\\sum_{i\u003d1}^{n}-y_ix_i^T\\frac{1}{1 + e^{y_ix_iW}}+ 2\\lambda W$$\n\nLet\u0027s assume a diagonal matrix $P$ of shape $n \\times n$ whose each diagonal element is $1 - p_i \u003d 1 - \\sigma(y_ix_iW) \u003d \\frac{1}{1 + e^{y_ix_iW}}$   \nUsing $P$, derivative of the cost function can be written in the matrix form as:\n$$\\nabla F(W) \u003d -\\frac{1}{n}X^TPY$$\n\n## Alternate Convention: $y \\in \\{0, 1\\}$\n### Loss and Cost Function\nif $y_i \\in \\{0, 1\\}$, the compact form of the likelihood can be written as \n$$P(y_i|x_i,W) \u003d \\sigma(x_iW)^{y_i}(1 - \\sigma(x_iW))^{1 - y_i}$$\nThus, the loss function is written as\n$$\\mathcal{L}(W) \u003d -y_i log(\\sigma(x_iW)) - (1 - y_i)log(1 - \\sigma(x_iW))$$\n$$\u003dy_ilog(1 + e^{-x_iW}) + (1 - y_i)log(1 + e^{x_iW})$$\n\nThe cost function would be:\n$$F(W) \u003d -\\frac{1}{n}\\sum_{i\u003d1}^{n}y_i log(\\sigma(x_iW)) + (1 - y_i)log(1 - \\sigma(x_iW))$$\n\n### Gradient Derivation\nDifferentiating first term the using chain rule and the identity $\\frac{\\partial}{\\partial{X}} A^TX \u003d \\frac{\\partial}{\\partial{X}} X^TA \u003d A$\n$$\\frac{\\partial}{\\partial W} y_ilog(1 + e^{-x_iW})$$\n$$\u003d-x_i^Ty_i\\frac{e^{-x_iW}}{1 + e^{-x_iW}}$$\n$$\u003d-x_i^Ty_i\\frac{1}{1 + e^{x_iW}}$$\n$$\u003d-x_i^Ty_i(1 - \\sigma(x_iW))$$\n\nDifferentiating the second term using the same properties as the first term\n$$\\frac{\\partial}{\\partial W} (1 - y_i)log(1 + e^{x_iW})$$\n$$\u003dx_i^T(1 - y_i)\\frac{e^{x_iW}}{1 + e^{x_iW}}$$\n$$\u003dx_i^T(1 - y_i)\\frac{1}{1 + e^{-x_iW}}$$\n$$\u003dx_i^T(1 - y_i)\\sigma(x_iW)$$\n\nCollecting both the terms together,\n$$-x_i^Ty_i(1 - \\sigma(x_iW)) + x_i^T(1 - y_i)\\sigma(x_iW)$$\n$$\u003d-x_i^Ty_i + x_i^Ty_i\\sigma(x_iW) + x_i^T\\sigma(x_iW) - x_i^Ty_i\\sigma(x_iW)$$\n$$\u003d-x_i^Ty_i + x_i^T\\sigma(x_iW)$$\n$$\u003dx_i^T(\\sigma(x_iW) - y_i)$$\n\nConverting to matrix form and adding the term $\\frac{1}{n}$\n$$\\nabla F(W) \u003d\\frac{1}{n} X^T (\\sigma(XW) - Y)$$\n\n\n",
      "metadata": {
        "pycharm": {
          "metadata": false
        }
      }
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.8"
    },
    "stem_cell": {
      "cell_type": "raw",
      "source": "",
      "metadata": {
        "pycharm": {
          "metadata": false
        }
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"source": "\n# Regularised Logistic regression\n\nBinary classification using negative log likelihood loss with $L_2$ regularization penalty\n\n---\n* [Implementation in Python](../pymlalgo/regression/regularised_logistic_regression.py)\n* [Demo](../demo/regularised_logistic_regression_demo.ipynb)\n---\n\n### Symbols and Conventions\nRefer to [Symbols and Conventions](symbols_and_conventions.ipynb) for details. In summary:\n* $n$ is the number of training examples\n* $d$ is the number of features in each training example (a.k.a the dimension of the training example)\n* $X$ is the features matrix of shape $n \\times d$\n* $Y$ is the labels matrix of shape $n \\times 1$\n* $y_i$ is the label of each training example, $y_i \\in \\{-1, 1\\}$\n* $W$ is the weight matrix of shape $d \\times 1$\n\n### Loss and Cost Function\nThe goal is to predict $$P(y_i \| x_i, W)$$\nIn this binary classification problem $y_i \\in \\{-1, 1\\}$ and $P(y_i \u003d 1)$ and $P(y_i \u003d -1)$ are mutually exclusive events. As such:\n$$P(y_i \u003d 1\| x_i, W) + P(y_i \u003d -1\| x_i, W) \u003d 1$$\n$$\\implies P(y_i \u003d -1\| x_i, W) \u003d 1 - P(y_i \u003d 1\|x_i,W)$$\n\nLet\u0027s assume that $P(y_i \u003d 1\| x_i, W) \u003d p_i \\implies P(y_i \u003d -1\|x_i, W) \u003d 1-p_i$\n\nThe logistic regression is a generalized form of linear regression where the log of odds ratio is a linear function of $x_i$ and $W$\n$$x_iW \u003d log(\\frac{p_i}{1 - p_i}) \\implies p_i \u003d \\frac{1}{1 + e^{-x_iW}}$$\nand, $$1 - p_i \u003d 1 - \\frac{1}{1 + e^{-x_iW}} \u003d \\frac{1}{1 + e^{x_iW}}$$\n\nIn a general form, $p_i \u003d \\sigma(x_iW)$ and $1 - p_i \u003d 1 - \\sigma(x_iW) \u003d \\sigma(-x_iW)$where $\\sigma$ is called the sigmoid function or the logistic function or the expit function.\n\nWhen, $y_i \u003d 1, P(y_i \u003d 1 \| x_i, W) \u003d p_i \u003d \\frac{1}{1 + e^{-y_ix_iW}} \u003d \\sigma(y_ix_iW)$ \nand when, $y_i \u003d -1, P(y_i \u003d -1 \| x_i, W) \u003d 1 - p_i \u003d \\frac{1}{1 + e^{-y_ix_iW}} \u003d \\sigma(y_ix_iW)$ \n\nNow, the likelihood can be written in the compact form,\n$$P(y_i\|x_i,W) \u003d \\sigma(y_ix_iW) \u003d \\frac{1}{1 + e^{-y_ix_iW}}$$\n\nThe optimization problem can either maximize the likelihood or minimize the negative likelihood. Since the maximum/negative minimum of a function and log of that function would be at the same values, the optimization problem can maximize the log or minimize the negative log of the function. Given that gradient descent finds the minimum of a function, the loss function is chosen such that optimization problem would minimize the negative of log likelihood:\n\n$$\\mathcal{L}(W)\u003d -log(P(y_i\|x_i,W)) \u003d -log(\\frac{1}{1 + e^{-y_ix_iW}}) \u003d log(1 + e^{-y_ix_iW})$$\n\nThe cost function minimizes the overall negative log likelihood:\n$$F(W) \u003d -\\frac{1}{n}log(\\prod_{i\u003d1}^{n}P(y_i\|x_i,W)) + \\lambda \|\|W\|\|_2^2$$\n$$\u003d\\frac{1}{n}\\sum_{i\u003d1}^{n}-log(P(y_i\|x_i,W)) + \\lambda \|\|W\|\|_2^2$$\n$$\u003d\\frac{1}{n}\\sum_{i\u003d1}^{n}log(1 + e^{-y_ix_iW}) + \\lambda \|\|W\|\|_2^2$$\n\nwhere $\\lambda$ is the regularization coefficient (a hyper parameter).\n\nThe goal is to find the value of $W$ for which $F(W)$ is minimum.\n\n### Gradient Derivation\n\nFor $d \u003d 1$ and $n \u003d 1$: \n$$F(W) \u003d \\frac{1}{1}\\sum_{i\u003d1}^{1}log(1 + e^{-y_ix_iW}) + \\lambda \|\|W\|\|_2^2$$\n$$F(W) \u003d log(1 + e^{-yxW}) + \\lambda W^2$$ \ntaking the derivative w.r.t $W$ using the chain rule,\n$$\\nabla F(W) \u003d -yxe^{-yxW}\\frac{1}{1 + e^{-yxW}} + 2\\lambda W$$\n$$\\nabla F(W) \u003d \\frac{-yx}{1 + e^{yxW}}+ 2\\lambda W$$\n\nFor $d \u003e 1$ and $n \u003e 1$ \n$$F(W) \u003d \\frac{1}{n}\\sum_{i\u003d1}^{n}log(1 + e^{-y_ix_iW}) + \\lambda \|\|W\|\|_2^2$$\n\nAs a reminder, $y_i$ is a scalar and $x_i$ is a vector of length $d$, shape $1 \\times d$\n\nUsing the linearity of derivatives, the results obtained from calculation for $n \u003d 1 \\text{ and } d \u003d 1$, and the identity $\\frac{\\partial}{\\partial{X}} A^TX \u003d \\frac{\\partial}{\\partial{X}} X^TA \u003d A$:\n\n$$\\nabla F(W) \u003d \\frac{1}{n}\\sum_{i\u003d1}^{n}-y_ix_i^T\\frac{1}{1 + e^{y_ix_iW}}+ 2\\lambda W$$\n\nLet\u0027s assume a diagonal matrix $P$ of shape $n \\times n$ whose each diagonal element is $1 - p_i \u003d 1 - \\sigma(y_ix_iW) \u003d \\frac{1}{1 + e^{y_ix_iW}}$ \nUsing $P$, derivative of the cost function can be written in the matrix form as:\n$$\\nabla F(W) \u003d -\\frac{1}{n}X^TPY$$\n\n## Alternate Convention: $y \\in \\{0, 1\\}$\n### Loss and Cost Function\nif $y_i \\in \\{0, 1\\}$, the compact form of the likelihood can be written as \n$$P(y_i\|x_i,W) \u003d \\sigma(x_iW)^{y_i}(1 - \\sigma(x_iW))^{1 - y_i}$$\nThus, the loss function is written as\n$$\\mathcal{L}(W) \u003d -y_i log(\\sigma(x_iW)) - (1 - y_i)log(1 - \\sigma(x_iW))$$\n$$\u003dy_ilog(1 + e^{-x_iW}) + (1 - y_i)log(1 + e^{x_iW})$$\n\nThe cost function would be:\n$$F(W) \u003d -\\frac{1}{n}\\sum_{i\u003d1}^{n}y_i log(\\sigma(x_iW)) + (1 - y_i)log(1 - \\sigma(x_iW))$$\n\n### Gradient Derivation\nDifferentiating first term the using chain rule and the identity $\\frac{\\partial}{\\partial{X}} A^TX \u003d \\frac{\\partial}{\\partial{X}} X^TA \u003d A$\n$$\\frac{\\partial}{\\partial W} y_ilog(1 + e^{-x_iW})$$\n$$\u003d-x_i^Ty_i\\frac{e^{-x_iW}}{1 + e^{-x_iW}}$$\n$$\u003d-x_i^Ty_i\\frac{1}{1 + e^{x_iW}}$$\n$$\u003d-x_i^Ty_i(1 - \\sigma(x_iW))$$\n\nDifferentiating the second term using the same properties as the first term\n$$\\frac{\\partial}{\\partial W} (1 - y_i)log(1 + e^{x_iW})$$\n$$\u003dx_i^T(1 - y_i)\\frac{e^{x_iW}}{1 + e^{x_iW}}$$\n$$\u003dx_i^T(1 - y_i)\\frac{1}{1 + e^{-x_iW}}$$\n$$\u003dx_i^T(1 - y_i)\\sigma(x_iW)$$\n\nCollecting both the terms together,\n$$-x_i^Ty_i(1 - \\sigma(x_iW)) + x_i^T(1 - y_i)\\sigma(x_iW)$$\n$$\u003d-x_i^Ty_i + x_i^Ty_i\\sigma(x_iW) + x_i^T\\sigma(x_iW) - x_i^Ty_i\\sigma(x_iW)$$\n$$\u003d-x_i^Ty_i + x_i^T\\sigma(x_iW)$$\n$$\u003dx_i^T(\\sigma(x_iW) - y_i)$$\n\nConverting to matrix form and adding the term $\\frac{1}{n}$\n$$\\nabla F(W) \u003d\\frac{1}{n} X^T (\\sigma(XW) - Y)$$\n\n\n",
	"metadata": {
	"pycharm": {
	"metadata": false
	}
	}
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.8"
	},
	"stem_cell": {
	"cell_type": "raw",
	"source": "",
	"metadata": {
	"pycharm": {
	"metadata": false
	}
	}
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}