chokkan/classification.ipynb

## classification.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, a feature vector $x$ is represented by a mapping (dictionary) object `x` whose keys are feature names and values are feature values. In other words, feature indices are represented by any strings rather than by integers. For example,\n",
    "\n",
    "```\n",
    "x = {}\n",
    "x['darling'] = 1\n",
    "x['photo'] = 1\n",
    "x['attach'] = 1\n",
    "```\n",
    "\n",
    "This representation is useful because the feature space for a natural language is high dimensional and sparse. If we define a feature space as occurrences of every word, \n",
    "\n",
    "* the number of the dimension of the feature space ($m$) equals to the total number of words in the language, which typically amounts to 1M words.\n",
    "* although a feature vector is represented by $m$-dimensional vector, most elements in the vector are zero; only a limited number of elements corresponding to the word in a sentence have non-zero values.\n",
    "\n",
    "A binary label $y$ is either `+1` (positive) or `-1` (negative)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import collections\n",
    "import functools\n",
    "import math\n",
    "import operator\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the example used in the lecture, $[(x_1, y_1), (x_2, y_2)]$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Hi darling, my photo in attached file\n",
    "x1 = {'@bias': 1, 'hi_darl':1, 'darl_my':1, 'my_photo':1, 'photo_attach':1, 'attach_file':1}\n",
    "y1 = +1\n",
    "\n",
    "# Hi Mark, Kyoto photo in attached file\n",
    "x2 = {'@bias': 1, 'hi_mark':1, 'mark_kyoto':1, 'kyoto_photo':1, 'photo_attach':1, 'attach_file':1}\n",
    "y2 = -1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'@bias': 1,\n",
       " 'attach_file': 1,\n",
       " 'darl_my': 1,\n",
       " 'hi_darl': 1,\n",
       " 'my_photo': 1,\n",
       " 'photo_attach': 1}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Perceptron"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def dot_product(w, x):\n",
    "    \"\"\"Inner product, w \\cdot x.\n",
    "    \n",
    "    Args:\n",
    "        w: weight vector (model) as a mapping object: feature -> weight.\n",
    "        x: feature vector as a mapping object: feature -> value.\n",
    "    Returns:\n",
    "        the inner product, w \\cdot x.\n",
    "\n",
    "    \"\"\"\n",
    "\n",
    "    a = 0.\n",
    "    for f, v in x.iteritems():\n",
    "        a += w.get(f, 0.) * v\n",
    "    return a\n",
    "\n",
    "def predict(w, x):\n",
    "    \"\"\"Predict the label of an instance.\n",
    "    \n",
    "    Args:\n",
    "        w: weight vector (model) as a mapping object: feature -> weight.\n",
    "        x: feature vector as a mapping object: feature -> value.\n",
    "    Returns:\n",
    "        the predicted label: +1 (true) or -1 (false).\n",
    "    \"\"\"\n",
    "    return +1 if dot_product(w, x) > 0 else -1    \n",
    "\n",
    "def update_perceptron(w, x, y):\n",
    "    \"\"\"Update the model with a training instance (x, y).\n",
    "    \n",
    "    Args:\n",
    "        w: weight vector (model) as a mapping object: feature -> weight.\n",
    "        x: feature vector of the training instance as a mapping object.\n",
    "        y: label of the training instance, -1 or +1.\n",
    "\n",
    "    \"\"\"\n",
    "    yp = predict(w, x)\n",
    "    if yp * y < 0:\n",
    "        for f, v in x.iteritems():\n",
    "            w[f] += y * v"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The weight vector (model) as a dictionary object that automatically sets missing values to zero (`collections.defaultdict`). The initial model is empty (no feature)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(float, {})"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "w = collections.defaultdict(float)\n",
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predict the label for the instance $x_1$ (incorrect prediction: this should be $+1$ because $y_1 = +1$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict(w, x1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The predicted label was negative because the score for the instance is $0$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_product(w, x1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Update the model with the instance $(x_1, y_1)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(float,\n",
       "            {'@bias': 1.0,\n",
       "             'attach_file': 1.0,\n",
       "             'darl_my': 1.0,\n",
       "             'hi_darl': 1.0,\n",
       "             'my_photo': 1.0,\n",
       "             'photo_attach': 1.0})"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "update_perceptron(w, x1, y1)\n",
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predict the label for the instance $x_2$ (incorrect prediction: this should be $-1$ because $y_2 = -1$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict(w, x2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The predicted label was positive because the score for the instance is $3$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.0"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_product(w, x2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Update the model with the instance $(x_2, y_2)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(float,\n",
       "            {'@bias': 0.0,\n",
       "             'attach_file': 0.0,\n",
       "             'darl_my': 1.0,\n",
       "             'hi_darl': 1.0,\n",
       "             'hi_mark': -1.0,\n",
       "             'kyoto_photo': -1.0,\n",
       "             'mark_kyoto': -1.0,\n",
       "             'my_photo': 1.0,\n",
       "             'photo_attach': 0.0})"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "update_perceptron(w, x2, y2)\n",
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us predict labels for instances $x_1$ and $x_2$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict(w, x1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_product(w, x1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict(w, x2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-3.0"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_product(w, x2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Logistic regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def dot_product(w, x):\n",
    "    \"\"\"Inner product, w \\cdot x.\n",
    "    \n",
    "    Args:\n",
    "        w: weight vector (model) as a mapping object: feature -> weight.\n",
    "        x: feature vector as a mapping object: feature -> value.\n",
    "    Returns:\n",
    "        the inner product, w \\cdot x.\n",
    "\n",
    "    \"\"\"\n",
    "    a = 0.\n",
    "    for f, v in x.iteritems():\n",
    "        a += w.get(f, 0.) * v\n",
    "    return a\n",
    "\n",
    "def probability(w, x):\n",
    "    \"\"\"Compute P(+1|x).\n",
    "    \n",
    "    Args:\n",
    "        w: weight vector (model) as a mapping object: feature -> weight.\n",
    "        x: feature vector as a mapping object: feature -> value.\n",
    "    Returns:\n",
    "        the probability of the instance x being classified as positive.\n",
    "\n",
    "    \"\"\"\n",
    "    a = dot_product(w, x)\n",
    "    return 1. / (1 + math.exp(-a)) if -100. < a else 0.\n",
    "\n",
    "def update_logress(w, x, y, eta=1.0):\n",
    "    \"\"\"Update the model with a training instance (x, y).\n",
    "    \n",
    "    Args:\n",
    "        w: weight vector (model) as a mapping object: feature -> weight.\n",
    "        x: feature vector of the training instance as a mapping object.\n",
    "        y: label of the training instance, -1 or +1.\n",
    "        eta: the learning rate for updating the model (default: 1.0).\n",
    "\n",
    "    \"\"\"\n",
    "\n",
    "    # Update the model (feature weights) with a training instance (x, y)\n",
    "    y = (y + 1) / 2        # convert {-1,1} to {0,1}\n",
    "    p = probability(w, x)\n",
    "    g = y - p\n",
    "    for f, v in x.iteritems():\n",
    "        w[f] += eta * g * v"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The weight vector (model) as a dictionary object that automatically sets missing values to zero (`collections.defaultdict`). The initial model is empty (no feature)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(float, {})"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "w = collections.defaultdict(float)\n",
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute $P(+1|x_1)$ on the initial model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "probability(w, x1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The probability (0.5) means that the model has no clue for classifying the instance $x_1$ (because the model is empty)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Update the model with the instance $(x_1, y_1)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(float,\n",
       "            {'@bias': 0.5,\n",
       "             'attach_file': 0.5,\n",
       "             'darl_my': 0.5,\n",
       "             'hi_darl': 0.5,\n",
       "             'my_photo': 0.5,\n",
       "             'photo_attach': 0.5})"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "update_logress(w, x1, y1)\n",
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The weights for the features in the instance $x_1$ are set to $0.5$ based on the ammount of the error, $(y - p) = (1 - 0.5) = 0.5$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute $P(+1|x_2)$ on the initial model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8175744761936437"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "probability(w, x2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The probability $P(+1|x_2)$ should be zero (in other words, $P(-1|x_2) = 1 - P(+1|x_2)$ should be one) because $y_2 = -1$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Update the model with the instance $(x_2, y_2)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(float,\n",
       "            {'@bias': -0.31757447619364365,\n",
       "             'attach_file': -0.31757447619364365,\n",
       "             'darl_my': 0.5,\n",
       "             'hi_darl': 0.5,\n",
       "             'hi_mark': -0.8175744761936437,\n",
       "             'kyoto_photo': -0.8175744761936437,\n",
       "             'mark_kyoto': -0.8175744761936437,\n",
       "             'my_photo': 0.5,\n",
       "             'photo_attach': -0.31757447619364365})"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "update_logress(w, x2, y2)\n",
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ammount of the error for the instance $x_2$, $(y - p) = (0 - 0.81757...) = -0.81757...$ We can interpret feature weights as follows:\n",
    "\n",
    "* 0.5: the feature appears only in $x_1$\n",
    "* -0.8...: the feature appears only in $x_2$\n",
    "* -0.3...: the feature appears in both $x_1$ and $x_2$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us predict labels for instances $x_1$ and $x_2$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6335035042481402"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "probability(w, x1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.032125669946444585"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "probability(w, x2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The both look good, but the classifier is leaning negative because it got a larger error from $x_2$ than that from $x_1$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sentiment analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us build a sentiment predictor (positive/negative) by using [sentence polarity dataset v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz) distributed in [Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preparing the data set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, download the dataset and extract files in the tar-ball (*.tar.gz)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2015-11-20 13:35:13--  http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz\n",
      "Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 128.84.154.137\n",
      "Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|128.84.154.137|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 487770 (476K) [application/x-gzip]\n",
      "Saving to: ‘rt-polaritydata.tar.gz’\n",
      "\n",
      "100%[======================================>] 487,770      413KB/s   in 1.2s   \n",
      "\n",
      "2015-11-20 13:35:14 (413 KB/s) - ‘rt-polaritydata.tar.gz’ saved [487770/487770]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rt-polaritydata.README.1.0.txt\n",
      "rt-polaritydata/rt-polarity.neg\n",
      "rt-polaritydata/rt-polarity.pos\n"
     ]
    }
   ],
   "source": [
    "!tar xvzf rt-polaritydata.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us check the training instances in the tar-ball."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \r\n",
      "the gorgeously elaborate continuation of \" the lord of the rings \" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . \r\n",
      "effective but too-tepid biopic\r\n",
      "if you sometimes like to go to the movies to have fun , wasabi is a good place to start . \r\n",
      "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . \r\n"
     ]
    }
   ],
   "source": [
    "!head -n5 rt-polaritydata/rt-polarity.pos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "simplistic , silly and tedious . \r\n",
      "it's so laddish and juvenile , only teenage boys could possibly find it funny . \r\n",
      "exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . \r\n",
      "[garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . \r\n",
      "a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification . \r\n"
     ]
    }
   ],
   "source": [
    "!head -n5 rt-polaritydata/rt-polarity.neg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Merge positive and negative instances after inserting '+1' at the beginning of each line in the positive data and '-1' that in the negative data. Sort the order of the instances at random."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!sed \"s/^/+1 /g\" rt-polaritydata/rt-polarity.pos > positives.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!sed \"s/^/-1 /g\" rt-polaritydata/rt-polarity.neg > negatives.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!sort -R positives.txt negatives.txt > data.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+1 a pleasant enough comedy that should have found a summer place . \r\n",
      "-1 the only thing in pauline and paulette that you haven't seen before is a scene featuring a football field-sized oriental rug crafted out of millions of vibrant flowers . \r\n",
      "-1 the problematic characters and overly convenient plot twists foul up shum's good intentions . \r\n",
      "-1 it will probably prove interesting to ram dass fans , but to others it may feel like a parody of the mellow , peace-and-love side of the '60s counterculture . \r\n",
      "-1 if all of eight legged freaks was as entertaining as the final hour , i would have no problem giving it an unqualified recommendation . \r\n",
      "+1 sweetly sexy , funny and touching . \r\n",
      "-1 the film seems all but destined to pop up on a television screen in the background of a scene in a future quentin tarantino picture\r\n",
      "+1 while not all that bad of a movie , it's nowhere near as good as the original . \r\n",
      "+1 it remains to be seen whether statham can move beyond the crime-land action genre , but then again , who says he has to ? \r\n",
      "-1 tom green and an ivy league college should never appear together on a marquee , especially when the payoff is an unschooled comedy like stealing harvard , which fails to keep 80 minutes from seeming like 800 . \r\n"
     ]
    }
   ],
   "source": [
    "!head -n 10 data.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Count the number of positive and negative instances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5331\r\n"
     ]
    }
   ],
   "source": [
    "!grep '^+1' data.txt | wc -l"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5331\r\n"
     ]
    }
   ],
   "source": [
    "!grep '^-1' data.txt | wc -l"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implementing a feature extractor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use a stop list distributed on the Web."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2015-11-20 13:35:17--  http://www.textfixer.com/resources/common-english-words.txt\n",
      "Resolving www.textfixer.com (www.textfixer.com)... 216.172.104.5\n",
      "Connecting to www.textfixer.com (www.textfixer.com)|216.172.104.5|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 551 [text/plain]\n",
      "Saving to: ‘common-english-words.txt’\n",
      "\n",
      "100%[======================================>] 551         --.-K/s   in 0s      \n",
      "\n",
      "2015-11-20 13:35:17 (67.9 MB/s) - ‘common-english-words.txt’ saved [551/551]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget http://www.textfixer.com/resources/common-english-words.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your"
     ]
    }
   ],
   "source": [
    "!cat common-english-words.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from stemming.porter2 import stem\n",
    "\n",
    "stoplist = set(open('common-english-words.txt').read().split(','))\n",
    "\n",
    "def is_non_stop(x):\n",
    "    return x not in stoplist\n",
    "\n",
    "def has_alnum(x):\n",
    "    return any((c.isalnum() for c in x))\n",
    "\n",
    "def feature(s):\n",
    "    \"\"\"Feature extractor (from a sequence of words).\n",
    "    \n",
    "    Args:\n",
    "        s: a list of words in a sentence.\n",
    "    Returns:\n",
    "        feature vector as a mapping object: feature -> value.\n",
    "        \n",
    "    \"\"\"\n",
    "    # Remove stop words (find words x \\in s where is_non_stop(x) is True)\n",
    "    x = filter(is_non_stop, s)\n",
    "    # Apply stemming (apply stem(i) for all i \\in x)\n",
    "    x = map(stem, x)\n",
    "    # Remove non alphanumeric words.\n",
    "    x = filter(has_alnum, x)\n",
    "    # Append the bias feature\n",
    "    x.append('@bias')\n",
    "    # Unigram features (the number of occurrences of each word)\n",
    "    return collections.Counter(x)\n",
    "\n",
    "def T2F(text):\n",
    "    \"\"\"Feature extractor (from a natural sentence).\n",
    "    \n",
    "    Args:\n",
    "        text: a sentence.\n",
    "    Returns:\n",
    "        feature vector as a mapping object: feature -> value.\n",
    "    \n",
    "    \"\"\"\n",
    "    return feature(text.lower().split(' '))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us check the feature extractor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'@bias': 1, 'silli': 1, 'simplist': 1, 'tedious': 1})"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "T2F('simplistic , silly and tedious .')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'@bias': 1,\n",
       "         'boy': 1,\n",
       "         'find': 1,\n",
       "         'funni': 1,\n",
       "         'it': 1,\n",
       "         'juvenil': 1,\n",
       "         'laddish': 1,\n",
       "         'possibl': 1,\n",
       "         'teenag': 1})"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "T2F(\"it's so laddish and juvenile , only teenage boys could possibly find it funny . \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the data set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the instances in `data.txt` and store each instance in an `Instance` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "class Instance:\n",
    "    def __init__(self, x, y, text):\n",
    "        self.x = x\n",
    "        self.y = y\n",
    "        self.text = text\n",
    "    def __repr__(self):\n",
    "        return repr((self.y, self.x))\n",
    "\n",
    "D = []\n",
    "for line in open('data.txt'):\n",
    "    pos = line.find(' ')\n",
    "    if pos == -1:\n",
    "        continue\n",
    "    y = int(line[:pos])\n",
    "    x = T2F(line[pos+1:])\n",
    "    D.append(Instance(x, y, line))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1, Counter({'summer': 1, 'comedi': 1, '@bias': 1, 'pleasant': 1, 'enough': 1, 'place': 1, 'found': 1}))"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "D[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training with perceptron"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def training_with_perceptron(D, max_iterations=10):\n",
    "    \"\"\"Training a linear binary classifier with perceptron.\n",
    "    \n",
    "    Args:\n",
    "        D: training set, a list of Instance objects.\n",
    "        max_iterations: the number of iterations.\n",
    "    Returns:\n",
    "        w: weight vector (model) as a mapping object: feature -> weight.\n",
    "\n",
    "    \"\"\"\n",
    "    w = collections.defaultdict(float)\n",
    "    for epoch in range(max_iterations):\n",
    "        random.shuffle(D)   # This lazy implementation alters D.\n",
    "        for d in D:\n",
    "            update_perceptron(w, d.x, d.y)\n",
    "    return w"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "w = training_with_perceptron(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-8.0"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_product(w, T2F('simplistic , silly and tedious .'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.0"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_product(w, T2F('guaranteed to move anyone who ever shook , rattled , or rolled .'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "M = sorted(w.iteritems(), key=operator.itemgetter(1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('appar', -12.0),\n",
       " ('snake', -11.0),\n",
       " ('well-intent', -11.0),\n",
       " ('unless', -11.0),\n",
       " ('schneider', -10.0),\n",
       " ('prettiest', -10.0),\n",
       " ('demm', -10.0),\n",
       " ('incoher', -10.0),\n",
       " ('ballist', -10.0),\n",
       " ('purport', -9.0)]"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "M[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('explod', 9.0),\n",
       " ('glorious', 9.0),\n",
       " ('frailti', 9.0),\n",
       " ('resist', 9.0),\n",
       " ('smith', 9.0),\n",
       " ('confid', 10.0),\n",
       " ('tape', 10.0),\n",
       " ('optimist', 10.0),\n",
       " ('refresh', 11.0),\n",
       " ('engross', 13.0)]"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "M[-10:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training with logistic regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def training_with_logistic_regression(D, max_iterations=10, eta0=0.25):\n",
    "    \"\"\"Training a linear binary classifier with logistic regression.\n",
    "    \n",
    "    Args:\n",
    "        D: training set, a list of Instance objects.\n",
    "        max_iterations: the number of iterations.\n",
    "        eta0: the initial learning rate.\n",
    "    Returns:\n",
    "        w: weight vector (model) as a mapping object: feature -> weight.\n",
    "\n",
    "    \"\"\"\n",
    "    t = 0\n",
    "    T = len(D) * max_iterations\n",
    "    w = collections.defaultdict(float)\n",
    "    for epoch in range(max_iterations):\n",
    "        random.shuffle(D)   # This lazy implementation alters D.\n",
    "        for d in D:\n",
    "            eta = eta0 * (1 - t / (T+1))\n",
    "            update_logress(w, d.x, d.y, eta)\n",
    "    return w"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "w = training_with_logistic_regression(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0049367426713377415"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "probability(w, T2F('simplistic , silly and tedious .'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7515928142237298"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "probability(w, T2F('guaranteed to move anyone who ever shook , rattled , or rolled .'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "M = sorted(w.iteritems(), key=operator.itemgetter(1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('bore', -3.8791828412417098),\n",
       " ('unless', -3.7829369145348104),\n",
       " ('appar', -3.758309568480576),\n",
       " ('wast', -3.6960486949296003),\n",
       " ('snake', -3.6184562166109298),\n",
       " ('mediocr', -3.553742416416723),\n",
       " ('routin', -3.4720750656289727),\n",
       " (\"wasn't\", -3.42828223524663),\n",
       " ('incoher', -3.393891755683168),\n",
       " ('generic', -3.387698473383638)]"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "M[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('optimist', 3.1985022632691846),\n",
       " ('lane', 3.2043386515334706),\n",
       " ('examin', 3.2555649003689764),\n",
       " ('confid', 3.3621029766188024),\n",
       " ('resist', 3.399846251977437),\n",
       " ('unexpect', 3.468014876606892),\n",
       " ('smarter', 3.7142599381159553),\n",
       " ('glorious', 3.8256576447392465),\n",
       " ('refresh', 4.131514135332804),\n",
       " ('engross', 4.666381021792194)]"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "M[-10:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Closed evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def predict(w, d):\n",
    "    d.label = +1 if dot_product(w, d.x) > 0 else -1\n",
    "\n",
    "def predict_all(w, D):\n",
    "    map(functools.partial(predict, w), D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "predict_all(w, D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "D[0].label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def num_correct_predictions(D):\n",
    "    return sum(1 for d in D if d.y == d.label)\n",
    "\n",
    "def num_true_positives(D):\n",
    "    return sum(1 for d in D if d.y == 1 and d.y == d.label)\n",
    "\n",
    "def num_gold_positives(D):\n",
    "    return sum(1 for d in D if d.y == 1)\n",
    "\n",
    "def num_predicted_positives(D):\n",
    "    return sum(1 for d in D if d.label == 1)\n",
    "    \n",
    "def compute_accuracy(D):\n",
    "    return num_correct_predictions(D) / float(len(D))\n",
    "\n",
    "def compute_precision(D):\n",
    "    return num_true_positives(D) / float(num_predicted_positives(D))\n",
    "\n",
    "def compute_recall(D):\n",
    "    return num_true_positives(D) / float(num_gold_positives(D))\n",
    "\n",
    "def compute_f1(D):\n",
    "    p = compute_precision(D)\n",
    "    r = compute_recall(D)\n",
    "    return 2 * p * r / (p + r) if 0 < p + r else 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10344"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "num_correct_predictions(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10662"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9701744513224536"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_accuracy(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9881207400194741"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_precision(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9517914087413243"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_recall(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9696158991018535"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_f1(D)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cross validation (open evaluation)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "N = 10\n",
    "for n in range(N):\n",
    "    train_set = [D[i] for i in range(len(D)) if i % N != n]\n",
    "    test_set = [D[i] for i in range(len(D)) if i % N == n]\n",
    "    w = training_with_logistic_regression(train_set)\n",
    "    predict_all(w, test_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7481710748452448"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_accuracy(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.7451816160118606, 0.7542674920277621, 0.7496970261955812)"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_precision(D), compute_recall(D), compute_f1(D)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}