byronyi/Peptide Identification.ipynb

## Peptide Identification.ipynb
{
 "metadata": {
  "name": "",
  "signature": "sha256:ff2ba64e8608726e9a97fccf5616469ce28d1b77f799fdb20e1750b929f87895"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Probabilistic Estimation of Peptide Identification"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Bairen Yi, December 6, 2014"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Background: Peptide Identification"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A peptide spetrum has essentially two components: \n",
      "\n",
      "> a range of observable mass-charge (M/Z) ratio\n",
      "\n",
      "and \n",
      "\n",
      "> a set of peaks (with intensity) occur at different M/Z\n",
      "\n",
      "Given the range of the $d$ possible discrete M/Z values, a spectrum $p$ could be embedded into $d$ dimensional space:\n",
      "\n",
      "$$ p \\in \\mathbb{R}^d $$\n",
      "\n",
      "The peptide identification problem could be (or already been) casted into a classification problem: \n",
      "\n",
      "> Given a query spectrum ($q$), classify $q$ into one of classes $r$ represented by all possible peptides ($R$).\n",
      "\n",
      "A naive searching algorithm would be collecting the most representative spetra ($P_j$) for each peptide class $r_j$, and then finding the nearest neighbour (NN) for $q$ against $P$. \n",
      "\n",
      "In fact, this approach has already been used in current peptide library searching approach, where the cosine similarity is used as a distance measure.\n",
      "\n",
      "$$ \\text{sim} (p, q) = \\dfrac{p^Tq}{\\|p\\| \\|q\\|}, p \\in \\mathbb{R}^d, q \\in \\mathbb{R}^d $$"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Problem Formulation"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Given the context, we can formulate the problem of probabilistic estimation of peptide identification as follows.\n",
      "\n",
      "> Given all peptide classes $R$, a set of known spetra $P_j$ that belong to $r_j \\in R$, and a query spetrum $q$, find out the probability that $q$ belongs to each $r_j \\in R$.\n",
      "\n",
      "It should not be, however, confused with the following:\n",
      "\n",
      "> Given a peptide class $r$, a representative spetrum $p_j$ that belongs to $r_j$, and a query spetrum $q$, find out the probability that $q$ belongs to $r$.\n",
      "\n",
      "That problem is deemed without any real solution, because what we are looking for is really a **probabilistic distribution** $\\Pr(q|r_j)$, and we need more than several samples finding out that distribution."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Multi-Class Classification with Massive Number of Classes"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For two class problems, classification is relatively easy, as there is only one dicision boundary to be learned. \n",
      "\n",
      "A simple generalization from two-class problem to $m$-class problem is \"one-vs-rest\" fashion, where $m$ different classifiers are learned, one for each class. The parameter size would simply be $O(md)$, where $m$ is the number of classes and $d$ is the dimension of feature space.\n",
      "\n",
      "Alternatively, \"one-vs-one\" approach could be used, and we need to learn $m(m-1)/2$ classifier, and the number of parameters would be $O(m^2d)$.\n",
      "\n",
      "However, neither approach are applicable in the context of peptide identification, as the number of classes is massive (generally more than 10k, possibly 100k for samples taken from human protein). This will cause problems:\n",
      "\n",
      "1. one-vs-rest: there would be severe class imbalance for training each classifier. Generally one can't fit one peptide against the rest of 10k peptides as the number of spectra samples from latter would be of course 10k times than that of the former.\n",
      "2. one-vs-one: the large $m$ makes $O(m^2)$ number of classifiers prohibited, as we will need to store more than $10,000^2$ models, which would also be computationally too expensive.\n",
      "\n",
      "Therefore, one must consider more sophisticated approach for large number of classes."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Multinomial Logistic Regression"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Some classification algorithms are inheritly multi-class, such as Naive Bayes, Tree-based algorithms, or Nearest Neighbor as mentioned above. As all of them may be worth trying, the softmax function (a.k.a. Multinomial Logistic Regression, MLR) is the one that fits the computation needs well and is also similar to the current model.\n",
      "\n",
      "The formulation of MLR is as follows:\n",
      "\n",
      "Suppose $m$ is the number of peptide classes, $d$ is the number of dimensions for each spetrum, the corresponding parameters $\\beta \\in \\mathbb{R}^{m\\times d}$. \n",
      "\n",
      "Given the query spetrum $q \\in \\mathbb{R}^d$, we would like to predict the probability it belongs to each class $r_j \\in R$.\n",
      "\n",
      "(1)\n",
      "$$ \\hat{y}_j = \\Pr(q | r_j, \\beta) = \\dfrac{\\exp(\\beta_j^T q)}{\\sum_k^m \\exp(\\beta_k^T q)}$$"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Relations between MLR and Current Cosine-Similarity Based Approach"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we will show that, on prediction of $r_j$, the formulation above is just another form of NN embedded with cosine-similarity used by current approach.\n",
      "\n",
      "Given the representative spetrum $p_j$ for each $r_j \\in R$, the cosine-similarity searching could be formulated as finding $r_j$ with $j$ given by\n",
      "\n",
      "(2)\n",
      "$$ \n",
      "\\DeclareMathOperator*{\\argmax}{arg\\,max}\n",
      "j = \\argmax \\limits_j \\text{sim} (p_j, q) = \\argmax \\limits_j \\dfrac{p_j^Tq}{\\|p_j\\| \\|q\\|} \n",
      "$$\n",
      "\n",
      "Given probabilities by (1), one can give the maximum likelihood estimation of $r_j$ as finding\n",
      "\n",
      "(3)\n",
      "$$ \\max \\limits_j \\Pr(q | r_j, \\beta) = \\max \\limits_j \\dfrac{\\exp(\\beta_j^T q)}{\\exp(\\sum_k^m \\beta_k^T q)} $$\n",
      "\n",
      "Taking $\\log$ on (3), we could get\n",
      "\n",
      "(4)\n",
      "$$ \\max \\limits_j \\log \\dfrac{\\exp(\\beta_j^T q)}{\\exp(\\sum_j \\beta_j^T q)} = \\max \\limits_j \\beta_j^T q - \\log \\sum_k^m \\exp(\\beta_k^T p) $$\n",
      "\n",
      "And note that the last term in (3) is irrelavant to choice of $j$, so\n",
      "\n",
      "(5)\n",
      "$$ \n",
      "\\DeclareMathOperator*{\\argmax}{arg\\,max}\n",
      "j = \\argmax \\limits_j \\log \\Pr(q | r_j, \\beta) = \\argmax \\limits_j \\beta_j^T q \n",
      "$$\n",
      "\n",
      "One can easily see the cosine-similarity in (2) is just a normalized version of (5), and normalization of features is just a common pre-processing step for Logistic Regression."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Model Training"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The crucial differece between current approach and MLR based approach, however, lies in the training process to find out $\\beta$.\n",
      "\n",
      "Given identified spectra $p \\in P_j$ that belongs to $r_j$, we would like to estimates $\\beta$ using Maximum-Likelihood Estimation (MLE), by maximizing the log-likelihood\n",
      "\n",
      "(6)\n",
      "$$ \\max \\limits_{\\beta} \\sum_j \\sum_{p \\in P_j} \\log \\Pr(p | r, \\beta) = \\max \\limits_{\\beta} \\sum_j \\sum_{p \\in P_j} \\log \\dfrac{\\exp(\\beta_j^T p)}{\\sum_k^m \\exp(\\beta_k^T p)} $$\n",
      "\n",
      "Which would in fact requires a number of $p \\in P_j$ for each $r_j \\in R$ in training. \n",
      "\n",
      "For comparison, the representative spetrum $p_j$ used in (2) could be also obtained from a set of spetra like those in $P_j$, but the final $p_j$ is usually determined by just avaeraging, with possibly taking some extra de-noising steps (*consensus spetrum*)."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Extension to Unidentified Spetra"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In typical multi-class classification setting, all the classes are known in advance. However, this is not true in peptide identification, as there could be some spectra representing peptides that simply do not appear in training dataset. This could be problematic for cosine-similarity based approach, under which a custom threshold has to be picked manually in a unjustified manner. \n",
      "\n",
      "However, given the formulation of MLR, the problem could be easily solved by adding a term to the denominator. \n",
      "\n",
      "The final output of the MLR model for query $q$ would be $\\hat{y} \\in [0, 1]^{m+1}$,\n",
      "\n",
      "(7)\n",
      "\n",
      "if $q$ belongs to $r_j \\in R$,\n",
      "\n",
      "$$ \\hat{y}_j = \\Pr(q | r_j, \\beta) = \\dfrac{\\exp(\\beta_j^T q)}{1 + \\sum_k^m \\exp(\\beta_k^T q)} $$\n",
      "\n",
      "otherwise, denote unknown spetra as $r_0$, \n",
      "\n",
      "$$ \\hat{y}_0 = \\Pr(q | r_0, \\beta) = \\dfrac{1}{1 + \\sum_k^m \\exp(\\beta_k^T q)} $$"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Alternative Form of Objective Function in Training Step"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Suppose the true value of $y \\in \\{0, 1\\}^{m+1}$ is encoded by\n",
      "\n",
      "(8) \n",
      "\n",
      "if $p$ belongs to $r_j$,\n",
      "$$ y_j = 1 $$\n",
      "if $p$ does not belong to any $r_j$,\n",
      "$$ y_0 = 1 $$\n",
      "otherwise\n",
      "$$ y_j = 0 $$\n",
      "\n",
      "Suppose the number of training samples is $n$, $i$-th training sample as $y^{(i)}$, with the idea of minimizing the loss function, which is more common in machine learning community, the objective function in (6) could be written as\n",
      "\n",
      "(9) \n",
      "$$ \\min \\limits_{\\beta} \\dfrac{1}{n} \\sum_i^n \\text{loss} (y^{(i)}, \\hat{y}^{(i)}) $$\n",
      "\n",
      "where the loss function (**log-loss**) is given as\n",
      "\n",
      "(10)\n",
      "$$ \\text{loss} (y, \\hat{y}) = -\\sum_j^{m+1} y_j\\log\\hat{y}_j $$\n",
      "\n",
      "One may note the similarity between this function and the definition of entropy, which is exactly the reason why MLR is often referred to \"maximum entropy classifier\" as well."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Lasso: L1-based Feature Selection and Sparse Encoding of Parameters"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note that the training step in (5) requires more samples than number of dimensions of feature space, which is difficult to achieve for each class $r$, as there are usually over thousands of observable discrete M/Z values, and number of different $r$ is quite large as well. For ill-posed regression problem (number of dimensions larger than sample size), one need to consider regularization technique to control the model complexity and avoid overfitting. \n",
      "\n",
      "The advantage of using generalized linear models is the availability of mature theories and techniques developed in statistics over decades. LASSO (Least Absolute Shrinkage and Selection Operator) is a technique that selects the most discriminant features over high dimensional feature space by L1-norm regularization of parameters, which has been used extensively in compressive sensing. [This page](http://scikit-learn.org/stable/auto_examples/applications/plot_tomography_l1_reconstruction.html) gives a good example on image reconstruction using LASSO to sparsify an image, with comparison to ridge regression using L2-norm regularization. \n",
      "\n",
      "> The advantage of using L1-penalty over L2-penalty is that the resulting parameters would become sparse, and correspondingly both time and space complexity would decrease.\n",
      "\n",
      "Then the training step (9) becomes\n",
      "\n",
      "(11)\n",
      "$$ \\min \\limits_{\\beta} \\dfrac{1}{n} \\sum_i^n \\text{loss} (y^{(i)}, \\hat{y}^{(i)}) + \\lambda \\sum_j^{m+1} \\|\\beta_j\\|_1 $$\n",
      "\n",
      "Or,\n",
      "\n",
      "(12)\n",
      "$$\n",
      "\\min \\limits_{\\beta} \\dfrac{1}{n} \\sum_i^n \\sum_j^{m+1} -y^{(i)}_j\\log\\hat{y}^{(i)}_j  + \\lambda \\sum_j^{m+1} \\|\\beta_j\\|_1\n",
      "$$\n",
      "\n",
      "where $\\lambda \\ge 0$ is a hyper-parameter for L1-regularization."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "",
	"signature": "sha256:ff2ba64e8608726e9a97fccf5616469ce28d1b77f799fdb20e1750b929f87895"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "heading",
	"level": 1,
	"metadata": {},
	"source": [
	"Probabilistic Estimation of Peptide Identification"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Bairen Yi, December 6, 2014"
	]
	},
	{
	"cell_type": "heading",
	"level": 2,
	"metadata": {},
	"source": [
	"Background: Peptide Identification"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"A peptide spetrum has essentially two components: \n",
	"\n",
	"> a range of observable mass-charge (M/Z) ratio\n",
	"\n",
	"and \n",
	"\n",
	"> a set of peaks (with intensity) occur at different M/Z\n",
	"\n",
	"Given the range of the $d$ possible discrete M/Z values, a spectrum $p$ could be embedded into $d$ dimensional space:\n",
	"\n",
	"$$ p \\in \\mathbb{R}^d $$\n",
	"\n",
	"The peptide identification problem could be (or already been) casted into a classification problem: \n",
	"\n",
	"> Given a query spectrum ($q$), classify $q$ into one of classes $r$ represented by all possible peptides ($R$).\n",
	"\n",
	"A naive searching algorithm would be collecting the most representative spetra ($P_j$) for each peptide class $r_j$, and then finding the nearest neighbour (NN) for $q$ against $P$. \n",
	"\n",
	"In fact, this approach has already been used in current peptide library searching approach, where the cosine similarity is used as a distance measure.\n",
	"\n",
	"$$ \\text{sim} (p, q) = \\dfrac{p^Tq}{\\\|p\\\| \\\|q\\\|}, p \\in \\mathbb{R}^d, q \\in \\mathbb{R}^d $$"
	]
	},
	{
	"cell_type": "heading",
	"level": 2,
	"metadata": {},
	"source": [
	"Problem Formulation"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Given the context, we can formulate the problem of probabilistic estimation of peptide identification as follows.\n",
	"\n",
	"> Given all peptide classes $R$, a set of known spetra $P_j$ that belong to $r_j \\in R$, and a query spetrum $q$, find out the probability that $q$ belongs to each $r_j \\in R$.\n",
	"\n",
	"It should not be, however, confused with the following:\n",
	"\n",
	"> Given a peptide class $r$, a representative spetrum $p_j$ that belongs to $r_j$, and a query spetrum $q$, find out the probability that $q$ belongs to $r$.\n",
	"\n",
	"That problem is deemed without any real solution, because what we are looking for is really a probabilistic distribution $\\Pr(q\|r_j)$, and we need more than several samples finding out that distribution."
	]
	},
	{
	"cell_type": "heading",
	"level": 2,
	"metadata": {},
	"source": [
	"Multi-Class Classification with Massive Number of Classes"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"For two class problems, classification is relatively easy, as there is only one dicision boundary to be learned. \n",
	"\n",
	"A simple generalization from two-class problem to $m$-class problem is \"one-vs-rest\" fashion, where $m$ different classifiers are learned, one for each class. The parameter size would simply be $O(md)$, where $m$ is the number of classes and $d$ is the dimension of feature space.\n",
	"\n",
	"Alternatively, \"one-vs-one\" approach could be used, and we need to learn $m(m-1)/2$ classifier, and the number of parameters would be $O(m^2d)$.\n",
	"\n",
	"However, neither approach are applicable in the context of peptide identification, as the number of classes is massive (generally more than 10k, possibly 100k for samples taken from human protein). This will cause problems:\n",
	"\n",
	"1. one-vs-rest: there would be severe class imbalance for training each classifier. Generally one can't fit one peptide against the rest of 10k peptides as the number of spectra samples from latter would be of course 10k times than that of the former.\n",
	"2. one-vs-one: the large $m$ makes $O(m^2)$ number of classifiers prohibited, as we will need to store more than $10,000^2$ models, which would also be computationally too expensive.\n",
	"\n",
	"Therefore, one must consider more sophisticated approach for large number of classes."
	]
	},
	{
	"cell_type": "heading",
	"level": 2,
	"metadata": {},
	"source": [
	"Multinomial Logistic Regression"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Some classification algorithms are inheritly multi-class, such as Naive Bayes, Tree-based algorithms, or Nearest Neighbor as mentioned above. As all of them may be worth trying, the softmax function (a.k.a. Multinomial Logistic Regression, MLR) is the one that fits the computation needs well and is also similar to the current model.\n",
	"\n",
	"The formulation of MLR is as follows:\n",
	"\n",
	"Suppose $m$ is the number of peptide classes, $d$ is the number of dimensions for each spetrum, the corresponding parameters $\\beta \\in \\mathbb{R}^{m\\times d}$. \n",
	"\n",
	"Given the query spetrum $q \\in \\mathbb{R}^d$, we would like to predict the probability it belongs to each class $r_j \\in R$.\n",
	"\n",
	"(1)\n",
	"$$ \\hat{y}_j = \\Pr(q \| r_j, \\beta) = \\dfrac{\\exp(\\beta_j^T q)}{\\sum_k^m \\exp(\\beta_k^T q)}$$"
	]
	},
	{
	"cell_type": "heading",
	"level": 2,
	"metadata": {},
	"source": [
	"Relations between MLR and Current Cosine-Similarity Based Approach"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Next we will show that, on prediction of $r_j$, the formulation above is just another form of NN embedded with cosine-similarity used by current approach.\n",
	"\n",
	"Given the representative spetrum $p_j$ for each $r_j \\in R$, the cosine-similarity searching could be formulated as finding $r_j$ with $j$ given by\n",
	"\n",
	"(2)\n",
	"$$ \n",
	"\\DeclareMathOperator*{\\argmax}{arg\\,max}\n",
	"j = \\argmax \\limits_j \\text{sim} (p_j, q) = \\argmax \\limits_j \\dfrac{p_j^Tq}{\\\|p_j\\\| \\\|q\\\|} \n",
	"$$\n",
	"\n",
	"Given probabilities by (1), one can give the maximum likelihood estimation of $r_j$ as finding\n",
	"\n",
	"(3)\n",
	"$$ \\max \\limits_j \\Pr(q \| r_j, \\beta) = \\max \\limits_j \\dfrac{\\exp(\\beta_j^T q)}{\\exp(\\sum_k^m \\beta_k^T q)} $$\n",
	"\n",
	"Taking $\\log$ on (3), we could get\n",
	"\n",
	"(4)\n",
	"$$ \\max \\limits_j \\log \\dfrac{\\exp(\\beta_j^T q)}{\\exp(\\sum_j \\beta_j^T q)} = \\max \\limits_j \\beta_j^T q - \\log \\sum_k^m \\exp(\\beta_k^T p) $$\n",
	"\n",
	"And note that the last term in (3) is irrelavant to choice of $j$, so\n",
	"\n",
	"(5)\n",
	"$$ \n",
	"\\DeclareMathOperator*{\\argmax}{arg\\,max}\n",
	"j = \\argmax \\limits_j \\log \\Pr(q \| r_j, \\beta) = \\argmax \\limits_j \\beta_j^T q \n",
	"$$\n",
	"\n",
	"One can easily see the cosine-similarity in (2) is just a normalized version of (5), and normalization of features is just a common pre-processing step for Logistic Regression."
	]
	},
	{
	"cell_type": "heading",
	"level": 2,
	"metadata": {},
	"source": [
	"Model Training"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The crucial differece between current approach and MLR based approach, however, lies in the training process to find out $\\beta$.\n",
	"\n",
	"Given identified spectra $p \\in P_j$ that belongs to $r_j$, we would like to estimates $\\beta$ using Maximum-Likelihood Estimation (MLE), by maximizing the log-likelihood\n",
	"\n",
	"(6)\n",
	"$$ \\max \\limits_{\\beta} \\sum_j \\sum_{p \\in P_j} \\log \\Pr(p \| r, \\beta) = \\max \\limits_{\\beta} \\sum_j \\sum_{p \\in P_j} \\log \\dfrac{\\exp(\\beta_j^T p)}{\\sum_k^m \\exp(\\beta_k^T p)} $$\n",
	"\n",
	"Which would in fact requires a number of $p \\in P_j$ for each $r_j \\in R$ in training. \n",
	"\n",
	"For comparison, the representative spetrum $p_j$ used in (2) could be also obtained from a set of spetra like those in $P_j$, but the final $p_j$ is usually determined by just avaeraging, with possibly taking some extra de-noising steps (consensus spetrum)."
	]
	},
	{
	"cell_type": "heading",
	"level": 2,
	"metadata": {},
	"source": [
	"Extension to Unidentified Spetra"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In typical multi-class classification setting, all the classes are known in advance. However, this is not true in peptide identification, as there could be some spectra representing peptides that simply do not appear in training dataset. This could be problematic for cosine-similarity based approach, under which a custom threshold has to be picked manually in a unjustified manner. \n",
	"\n",
	"However, given the formulation of MLR, the problem could be easily solved by adding a term to the denominator. \n",
	"\n",
	"The final output of the MLR model for query $q$ would be $\\hat{y} \\in [0, 1]^{m+1}$,\n",
	"\n",
	"(7)\n",
	"\n",
	"if $q$ belongs to $r_j \\in R$,\n",
	"\n",
	"$$ \\hat{y}_j = \\Pr(q \| r_j, \\beta) = \\dfrac{\\exp(\\beta_j^T q)}{1 + \\sum_k^m \\exp(\\beta_k^T q)} $$\n",
	"\n",
	"otherwise, denote unknown spetra as $r_0$, \n",
	"\n",
	"$$ \\hat{y}_0 = \\Pr(q \| r_0, \\beta) = \\dfrac{1}{1 + \\sum_k^m \\exp(\\beta_k^T q)} $$"
	]
	},
	{
	"cell_type": "heading",
	"level": 2,
	"metadata": {},
	"source": [
	"Alternative Form of Objective Function in Training Step"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Suppose the true value of $y \\in \\{0, 1\\}^{m+1}$ is encoded by\n",
	"\n",
	"(8) \n",
	"\n",
	"if $p$ belongs to $r_j$,\n",
	"$$ y_j = 1 $$\n",
	"if $p$ does not belong to any $r_j$,\n",
	"$$ y_0 = 1 $$\n",
	"otherwise\n",
	"$$ y_j = 0 $$\n",
	"\n",
	"Suppose the number of training samples is $n$, $i$-th training sample as $y^{(i)}$, with the idea of minimizing the loss function, which is more common in machine learning community, the objective function in (6) could be written as\n",
	"\n",
	"(9) \n",
	"$$ \\min \\limits_{\\beta} \\dfrac{1}{n} \\sum_i^n \\text{loss} (y^{(i)}, \\hat{y}^{(i)}) $$\n",
	"\n",
	"where the loss function (log-loss) is given as\n",
	"\n",
	"(10)\n",
	"$$ \\text{loss} (y, \\hat{y}) = -\\sum_j^{m+1} y_j\\log\\hat{y}_j $$\n",
	"\n",
	"One may note the similarity between this function and the definition of entropy, which is exactly the reason why MLR is often referred to \"maximum entropy classifier\" as well."
	]
	},
	{
	"cell_type": "heading",
	"level": 2,
	"metadata": {},
	"source": [
	"Lasso: L1-based Feature Selection and Sparse Encoding of Parameters"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Note that the training step in (5) requires more samples than number of dimensions of feature space, which is difficult to achieve for each class $r$, as there are usually over thousands of observable discrete M/Z values, and number of different $r$ is quite large as well. For ill-posed regression problem (number of dimensions larger than sample size), one need to consider regularization technique to control the model complexity and avoid overfitting. \n",
	"\n",
	"The advantage of using generalized linear models is the availability of mature theories and techniques developed in statistics over decades. LASSO (Least Absolute Shrinkage and Selection Operator) is a technique that selects the most discriminant features over high dimensional feature space by L1-norm regularization of parameters, which has been used extensively in compressive sensing. [This page](http://scikit-learn.org/stable/auto_examples/applications/plot_tomography_l1_reconstruction.html) gives a good example on image reconstruction using LASSO to sparsify an image, with comparison to ridge regression using L2-norm regularization. \n",
	"\n",
	"> The advantage of using L1-penalty over L2-penalty is that the resulting parameters would become sparse, and correspondingly both time and space complexity would decrease.\n",
	"\n",
	"Then the training step (9) becomes\n",
	"\n",
	"(11)\n",
	"$$ \\min \\limits_{\\beta} \\dfrac{1}{n} \\sum_i^n \\text{loss} (y^{(i)}, \\hat{y}^{(i)}) + \\lambda \\sum_j^{m+1} \\\|\\beta_j\\\|_1 $$\n",
	"\n",
	"Or,\n",
	"\n",
	"(12)\n",
	"$$\n",
	"\\min \\limits_{\\beta} \\dfrac{1}{n} \\sum_i^n \\sum_j^{m+1} -y^{(i)}_j\\log\\hat{y}^{(i)}_j + \\lambda \\sum_j^{m+1} \\\|\\beta_j\\\|_1\n",
	"$$\n",
	"\n",
	"where $\\lambda \\ge 0$ is a hyper-parameter for L1-regularization."
	]
	}
	],
	"metadata": {}
	}
	]
	}