Skip to content

Instantly share code, notes, and snippets.

@chokkan
Last active December 19, 2017 18:35
Show Gist options
  • Save chokkan/6512c39313fa0e471923 to your computer and use it in GitHub Desktop.
Save chokkan/6512c39313fa0e471923 to your computer and use it in GitHub Desktop.
Jupyter notebook for classification.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, a feature vector $x$ is represented by a mapping (dictionary) object `x` whose keys are feature names and values are feature values. In other words, feature indices are represented by any strings rather than by integers. For example,\n",
"\n",
"```\n",
"x = {}\n",
"x['darling'] = 1\n",
"x['photo'] = 1\n",
"x['attach'] = 1\n",
"```\n",
"\n",
"This representation is useful because the feature space for a natural language is high dimensional and sparse. If we define a feature space as occurrences of every word, \n",
"\n",
"* the number of the dimension of the feature space ($m$) equals to the total number of words in the language, which typically amounts to 1M words.\n",
"* although a feature vector is represented by $m$-dimensional vector, most elements in the vector are zero; only a limited number of elements corresponding to the word in a sentence have non-zero values.\n",
"\n",
"A binary label $y$ is either `+1` (positive) or `-1` (negative)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import collections\n",
"import functools\n",
"import math\n",
"import operator\n",
"import random"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the example used in the lecture, $[(x_1, y_1), (x_2, y_2)]$."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Hi darling, my photo in attached file\n",
"x1 = {'@bias': 1, 'hi_darl':1, 'darl_my':1, 'my_photo':1, 'photo_attach':1, 'attach_file':1}\n",
"y1 = +1\n",
"\n",
"# Hi Mark, Kyoto photo in attached file\n",
"x2 = {'@bias': 1, 'hi_mark':1, 'mark_kyoto':1, 'kyoto_photo':1, 'photo_attach':1, 'attach_file':1}\n",
"y2 = -1"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'@bias': 1,\n",
" 'attach_file': 1,\n",
" 'darl_my': 1,\n",
" 'hi_darl': 1,\n",
" 'my_photo': 1,\n",
" 'photo_attach': 1}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Perceptron"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def dot_product(w, x):\n",
" \"\"\"Inner product, w \\cdot x.\n",
" \n",
" Args:\n",
" w: weight vector (model) as a mapping object: feature -> weight.\n",
" x: feature vector as a mapping object: feature -> value.\n",
" Returns:\n",
" the inner product, w \\cdot x.\n",
"\n",
" \"\"\"\n",
"\n",
" a = 0.\n",
" for f, v in x.iteritems():\n",
" a += w.get(f, 0.) * v\n",
" return a\n",
"\n",
"def predict(w, x):\n",
" \"\"\"Predict the label of an instance.\n",
" \n",
" Args:\n",
" w: weight vector (model) as a mapping object: feature -> weight.\n",
" x: feature vector as a mapping object: feature -> value.\n",
" Returns:\n",
" the predicted label: +1 (true) or -1 (false).\n",
" \"\"\"\n",
" return +1 if dot_product(w, x) > 0 else -1 \n",
"\n",
"def update_perceptron(w, x, y):\n",
" \"\"\"Update the model with a training instance (x, y).\n",
" \n",
" Args:\n",
" w: weight vector (model) as a mapping object: feature -> weight.\n",
" x: feature vector of the training instance as a mapping object.\n",
" y: label of the training instance, -1 or +1.\n",
"\n",
" \"\"\"\n",
" yp = predict(w, x)\n",
" if yp * y < 0:\n",
" for f, v in x.iteritems():\n",
" w[f] += y * v"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The weight vector (model) as a dictionary object that automatically sets missing values to zero (`collections.defaultdict`). The initial model is empty (no feature)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(float, {})"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"w = collections.defaultdict(float)\n",
"w"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predict the label for the instance $x_1$ (incorrect prediction: this should be $+1$ because $y_1 = +1$)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-1"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(w, x1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The predicted label was negative because the score for the instance is $0$."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_product(w, x1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Update the model with the instance $(x_1, y_1)$."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(float,\n",
" {'@bias': 1.0,\n",
" 'attach_file': 1.0,\n",
" 'darl_my': 1.0,\n",
" 'hi_darl': 1.0,\n",
" 'my_photo': 1.0,\n",
" 'photo_attach': 1.0})"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"update_perceptron(w, x1, y1)\n",
"w"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predict the label for the instance $x_2$ (incorrect prediction: this should be $-1$ because $y_2 = -1$)."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(w, x2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The predicted label was positive because the score for the instance is $3$."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"3.0"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_product(w, x2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Update the model with the instance $(x_2, y_2)$."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(float,\n",
" {'@bias': 0.0,\n",
" 'attach_file': 0.0,\n",
" 'darl_my': 1.0,\n",
" 'hi_darl': 1.0,\n",
" 'hi_mark': -1.0,\n",
" 'kyoto_photo': -1.0,\n",
" 'mark_kyoto': -1.0,\n",
" 'my_photo': 1.0,\n",
" 'photo_attach': 0.0})"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"update_perceptron(w, x2, y2)\n",
"w"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us predict labels for instances $x_1$ and $x_2$."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(w, x1)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"3.0"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_product(w, x1)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-1"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(w, x2)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-3.0"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_product(w, x2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Logistic regression"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def dot_product(w, x):\n",
" \"\"\"Inner product, w \\cdot x.\n",
" \n",
" Args:\n",
" w: weight vector (model) as a mapping object: feature -> weight.\n",
" x: feature vector as a mapping object: feature -> value.\n",
" Returns:\n",
" the inner product, w \\cdot x.\n",
"\n",
" \"\"\"\n",
" a = 0.\n",
" for f, v in x.iteritems():\n",
" a += w.get(f, 0.) * v\n",
" return a\n",
"\n",
"def probability(w, x):\n",
" \"\"\"Compute P(+1|x).\n",
" \n",
" Args:\n",
" w: weight vector (model) as a mapping object: feature -> weight.\n",
" x: feature vector as a mapping object: feature -> value.\n",
" Returns:\n",
" the probability of the instance x being classified as positive.\n",
"\n",
" \"\"\"\n",
" a = dot_product(w, x)\n",
" return 1. / (1 + math.exp(-a)) if -100. < a else 0.\n",
"\n",
"def update_logress(w, x, y, eta=1.0):\n",
" \"\"\"Update the model with a training instance (x, y).\n",
" \n",
" Args:\n",
" w: weight vector (model) as a mapping object: feature -> weight.\n",
" x: feature vector of the training instance as a mapping object.\n",
" y: label of the training instance, -1 or +1.\n",
" eta: the learning rate for updating the model (default: 1.0).\n",
"\n",
" \"\"\"\n",
"\n",
" # Update the model (feature weights) with a training instance (x, y)\n",
" y = (y + 1) / 2 # convert {-1,1} to {0,1}\n",
" p = probability(w, x)\n",
" g = y - p\n",
" for f, v in x.iteritems():\n",
" w[f] += eta * g * v"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The weight vector (model) as a dictionary object that automatically sets missing values to zero (`collections.defaultdict`). The initial model is empty (no feature)."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(float, {})"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"w = collections.defaultdict(float)\n",
"w"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compute $P(+1|x_1)$ on the initial model."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.5"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"probability(w, x1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The probability (0.5) means that the model has no clue for classifying the instance $x_1$ (because the model is empty)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Update the model with the instance $(x_1, y_1)$."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(float,\n",
" {'@bias': 0.5,\n",
" 'attach_file': 0.5,\n",
" 'darl_my': 0.5,\n",
" 'hi_darl': 0.5,\n",
" 'my_photo': 0.5,\n",
" 'photo_attach': 0.5})"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"update_logress(w, x1, y1)\n",
"w"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The weights for the features in the instance $x_1$ are set to $0.5$ based on the ammount of the error, $(y - p) = (1 - 0.5) = 0.5$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compute $P(+1|x_2)$ on the initial model."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.8175744761936437"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"probability(w, x2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The probability $P(+1|x_2)$ should be zero (in other words, $P(-1|x_2) = 1 - P(+1|x_2)$ should be one) because $y_2 = -1$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Update the model with the instance $(x_2, y_2)$."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(float,\n",
" {'@bias': -0.31757447619364365,\n",
" 'attach_file': -0.31757447619364365,\n",
" 'darl_my': 0.5,\n",
" 'hi_darl': 0.5,\n",
" 'hi_mark': -0.8175744761936437,\n",
" 'kyoto_photo': -0.8175744761936437,\n",
" 'mark_kyoto': -0.8175744761936437,\n",
" 'my_photo': 0.5,\n",
" 'photo_attach': -0.31757447619364365})"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"update_logress(w, x2, y2)\n",
"w"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ammount of the error for the instance $x_2$, $(y - p) = (0 - 0.81757...) = -0.81757...$ We can interpret feature weights as follows:\n",
"\n",
"* 0.5: the feature appears only in $x_1$\n",
"* -0.8...: the feature appears only in $x_2$\n",
"* -0.3...: the feature appears in both $x_1$ and $x_2$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us predict labels for instances $x_1$ and $x_2$."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.6335035042481402"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"probability(w, x1)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.032125669946444585"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"probability(w, x2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The both look good, but the classifier is leaning negative because it got a larger error from $x_2$ than that from $x_1$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sentiment analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us build a sentiment predictor (positive/negative) by using [sentence polarity dataset v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz) distributed in [Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preparing the data set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, download the dataset and extract files in the tar-ball (*.tar.gz)."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2015-11-20 13:35:13-- http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz\n",
"Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 128.84.154.137\n",
"Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|128.84.154.137|:80... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 487770 (476K) [application/x-gzip]\n",
"Saving to: ‘rt-polaritydata.tar.gz’\n",
"\n",
"100%[======================================>] 487,770 413KB/s in 1.2s \n",
"\n",
"2015-11-20 13:35:14 (413 KB/s) - ‘rt-polaritydata.tar.gz’ saved [487770/487770]\n",
"\n"
]
}
],
"source": [
"!wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rt-polaritydata.README.1.0.txt\n",
"rt-polaritydata/rt-polarity.neg\n",
"rt-polaritydata/rt-polarity.pos\n"
]
}
],
"source": [
"!tar xvzf rt-polaritydata.tar.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us check the training instances in the tar-ball."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \r\n",
"the gorgeously elaborate continuation of \" the lord of the rings \" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . \r\n",
"effective but too-tepid biopic\r\n",
"if you sometimes like to go to the movies to have fun , wasabi is a good place to start . \r\n",
"emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . \r\n"
]
}
],
"source": [
"!head -n5 rt-polaritydata/rt-polarity.pos"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"simplistic , silly and tedious . \r\n",
"it's so laddish and juvenile , only teenage boys could possibly find it funny . \r\n",
"exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . \r\n",
"[garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . \r\n",
"a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification . \r\n"
]
}
],
"source": [
"!head -n5 rt-polaritydata/rt-polarity.neg"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Merge positive and negative instances after inserting '+1' at the beginning of each line in the positive data and '-1' that in the negative data. Sort the order of the instances at random."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"!sed \"s/^/+1 /g\" rt-polaritydata/rt-polarity.pos > positives.txt"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"!sed \"s/^/-1 /g\" rt-polaritydata/rt-polarity.neg > negatives.txt"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"!sort -R positives.txt negatives.txt > data.txt"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+1 a pleasant enough comedy that should have found a summer place . \r\n",
"-1 the only thing in pauline and paulette that you haven't seen before is a scene featuring a football field-sized oriental rug crafted out of millions of vibrant flowers . \r\n",
"-1 the problematic characters and overly convenient plot twists foul up shum's good intentions . \r\n",
"-1 it will probably prove interesting to ram dass fans , but to others it may feel like a parody of the mellow , peace-and-love side of the '60s counterculture . \r\n",
"-1 if all of eight legged freaks was as entertaining as the final hour , i would have no problem giving it an unqualified recommendation . \r\n",
"+1 sweetly sexy , funny and touching . \r\n",
"-1 the film seems all but destined to pop up on a television screen in the background of a scene in a future quentin tarantino picture\r\n",
"+1 while not all that bad of a movie , it's nowhere near as good as the original . \r\n",
"+1 it remains to be seen whether statham can move beyond the crime-land action genre , but then again , who says he has to ? \r\n",
"-1 tom green and an ivy league college should never appear together on a marquee , especially when the payoff is an unschooled comedy like stealing harvard , which fails to keep 80 minutes from seeming like 800 . \r\n"
]
}
],
"source": [
"!head -n 10 data.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Count the number of positive and negative instances."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5331\r\n"
]
}
],
"source": [
"!grep '^+1' data.txt | wc -l"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5331\r\n"
]
}
],
"source": [
"!grep '^-1' data.txt | wc -l"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Implementing a feature extractor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use a stop list distributed on the Web."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2015-11-20 13:35:17-- http://www.textfixer.com/resources/common-english-words.txt\n",
"Resolving www.textfixer.com (www.textfixer.com)... 216.172.104.5\n",
"Connecting to www.textfixer.com (www.textfixer.com)|216.172.104.5|:80... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 551 [text/plain]\n",
"Saving to: ‘common-english-words.txt’\n",
"\n",
"100%[======================================>] 551 --.-K/s in 0s \n",
"\n",
"2015-11-20 13:35:17 (67.9 MB/s) - ‘common-english-words.txt’ saved [551/551]\n",
"\n"
]
}
],
"source": [
"!wget http://www.textfixer.com/resources/common-english-words.txt"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your"
]
}
],
"source": [
"!cat common-english-words.txt"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from stemming.porter2 import stem\n",
"\n",
"stoplist = set(open('common-english-words.txt').read().split(','))\n",
"\n",
"def is_non_stop(x):\n",
" return x not in stoplist\n",
"\n",
"def has_alnum(x):\n",
" return any((c.isalnum() for c in x))\n",
"\n",
"def feature(s):\n",
" \"\"\"Feature extractor (from a sequence of words).\n",
" \n",
" Args:\n",
" s: a list of words in a sentence.\n",
" Returns:\n",
" feature vector as a mapping object: feature -> value.\n",
" \n",
" \"\"\"\n",
" # Remove stop words (find words x \\in s where is_non_stop(x) is True)\n",
" x = filter(is_non_stop, s)\n",
" # Apply stemming (apply stem(i) for all i \\in x)\n",
" x = map(stem, x)\n",
" # Remove non alphanumeric words.\n",
" x = filter(has_alnum, x)\n",
" # Append the bias feature\n",
" x.append('@bias')\n",
" # Unigram features (the number of occurrences of each word)\n",
" return collections.Counter(x)\n",
"\n",
"def T2F(text):\n",
" \"\"\"Feature extractor (from a natural sentence).\n",
" \n",
" Args:\n",
" text: a sentence.\n",
" Returns:\n",
" feature vector as a mapping object: feature -> value.\n",
" \n",
" \"\"\"\n",
" return feature(text.lower().split(' '))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us check the feature extractor."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'@bias': 1, 'silli': 1, 'simplist': 1, 'tedious': 1})"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"T2F('simplistic , silly and tedious .')"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'@bias': 1,\n",
" 'boy': 1,\n",
" 'find': 1,\n",
" 'funni': 1,\n",
" 'it': 1,\n",
" 'juvenil': 1,\n",
" 'laddish': 1,\n",
" 'possibl': 1,\n",
" 'teenag': 1})"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"T2F(\"it's so laddish and juvenile , only teenage boys could possibly find it funny . \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load the data set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read the instances in `data.txt` and store each instance in an `Instance` object."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"class Instance:\n",
" def __init__(self, x, y, text):\n",
" self.x = x\n",
" self.y = y\n",
" self.text = text\n",
" def __repr__(self):\n",
" return repr((self.y, self.x))\n",
"\n",
"D = []\n",
"for line in open('data.txt'):\n",
" pos = line.find(' ')\n",
" if pos == -1:\n",
" continue\n",
" y = int(line[:pos])\n",
" x = T2F(line[pos+1:])\n",
" D.append(Instance(x, y, line))"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(1, Counter({'summer': 1, 'comedi': 1, '@bias': 1, 'pleasant': 1, 'enough': 1, 'place': 1, 'found': 1}))"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"D[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training with perceptron"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def training_with_perceptron(D, max_iterations=10):\n",
" \"\"\"Training a linear binary classifier with perceptron.\n",
" \n",
" Args:\n",
" D: training set, a list of Instance objects.\n",
" max_iterations: the number of iterations.\n",
" Returns:\n",
" w: weight vector (model) as a mapping object: feature -> weight.\n",
"\n",
" \"\"\"\n",
" w = collections.defaultdict(float)\n",
" for epoch in range(max_iterations):\n",
" random.shuffle(D) # This lazy implementation alters D.\n",
" for d in D:\n",
" update_perceptron(w, d.x, d.y)\n",
" return w"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"w = training_with_perceptron(D)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-8.0"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_product(w, T2F('simplistic , silly and tedious .'))"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2.0"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_product(w, T2F('guaranteed to move anyone who ever shook , rattled , or rolled .'))"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"M = sorted(w.iteritems(), key=operator.itemgetter(1))"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('appar', -12.0),\n",
" ('snake', -11.0),\n",
" ('well-intent', -11.0),\n",
" ('unless', -11.0),\n",
" ('schneider', -10.0),\n",
" ('prettiest', -10.0),\n",
" ('demm', -10.0),\n",
" ('incoher', -10.0),\n",
" ('ballist', -10.0),\n",
" ('purport', -9.0)]"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M[:10]"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('explod', 9.0),\n",
" ('glorious', 9.0),\n",
" ('frailti', 9.0),\n",
" ('resist', 9.0),\n",
" ('smith', 9.0),\n",
" ('confid', 10.0),\n",
" ('tape', 10.0),\n",
" ('optimist', 10.0),\n",
" ('refresh', 11.0),\n",
" ('engross', 13.0)]"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M[-10:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training with logistic regression"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def training_with_logistic_regression(D, max_iterations=10, eta0=0.25):\n",
" \"\"\"Training a linear binary classifier with logistic regression.\n",
" \n",
" Args:\n",
" D: training set, a list of Instance objects.\n",
" max_iterations: the number of iterations.\n",
" eta0: the initial learning rate.\n",
" Returns:\n",
" w: weight vector (model) as a mapping object: feature -> weight.\n",
"\n",
" \"\"\"\n",
" t = 0\n",
" T = len(D) * max_iterations\n",
" w = collections.defaultdict(float)\n",
" for epoch in range(max_iterations):\n",
" random.shuffle(D) # This lazy implementation alters D.\n",
" for d in D:\n",
" eta = eta0 * (1 - t / (T+1))\n",
" update_logress(w, d.x, d.y, eta)\n",
" return w"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"w = training_with_logistic_regression(D)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.0049367426713377415"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"probability(w, T2F('simplistic , silly and tedious .'))"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.7515928142237298"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"probability(w, T2F('guaranteed to move anyone who ever shook , rattled , or rolled .'))"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"M = sorted(w.iteritems(), key=operator.itemgetter(1))"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('bore', -3.8791828412417098),\n",
" ('unless', -3.7829369145348104),\n",
" ('appar', -3.758309568480576),\n",
" ('wast', -3.6960486949296003),\n",
" ('snake', -3.6184562166109298),\n",
" ('mediocr', -3.553742416416723),\n",
" ('routin', -3.4720750656289727),\n",
" (\"wasn't\", -3.42828223524663),\n",
" ('incoher', -3.393891755683168),\n",
" ('generic', -3.387698473383638)]"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M[:10]"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('optimist', 3.1985022632691846),\n",
" ('lane', 3.2043386515334706),\n",
" ('examin', 3.2555649003689764),\n",
" ('confid', 3.3621029766188024),\n",
" ('resist', 3.399846251977437),\n",
" ('unexpect', 3.468014876606892),\n",
" ('smarter', 3.7142599381159553),\n",
" ('glorious', 3.8256576447392465),\n",
" ('refresh', 4.131514135332804),\n",
" ('engross', 4.666381021792194)]"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M[-10:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Closed evaluation"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def predict(w, d):\n",
" d.label = +1 if dot_product(w, d.x) > 0 else -1\n",
"\n",
"def predict_all(w, D):\n",
" map(functools.partial(predict, w), D)"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"predict_all(w, D)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"D[0].label"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def num_correct_predictions(D):\n",
" return sum(1 for d in D if d.y == d.label)\n",
"\n",
"def num_true_positives(D):\n",
" return sum(1 for d in D if d.y == 1 and d.y == d.label)\n",
"\n",
"def num_gold_positives(D):\n",
" return sum(1 for d in D if d.y == 1)\n",
"\n",
"def num_predicted_positives(D):\n",
" return sum(1 for d in D if d.label == 1)\n",
" \n",
"def compute_accuracy(D):\n",
" return num_correct_predictions(D) / float(len(D))\n",
"\n",
"def compute_precision(D):\n",
" return num_true_positives(D) / float(num_predicted_positives(D))\n",
"\n",
"def compute_recall(D):\n",
" return num_true_positives(D) / float(num_gold_positives(D))\n",
"\n",
"def compute_f1(D):\n",
" p = compute_precision(D)\n",
" r = compute_recall(D)\n",
" return 2 * p * r / (p + r) if 0 < p + r else 0."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"10344"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num_correct_predictions(D)"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"10662"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(D)"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.9701744513224536"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_accuracy(D)"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.9881207400194741"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_precision(D)"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.9517914087413243"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_recall(D)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.9696158991018535"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_f1(D)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cross validation (open evaluation)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"N = 10\n",
"for n in range(N):\n",
" train_set = [D[i] for i in range(len(D)) if i % N != n]\n",
" test_set = [D[i] for i in range(len(D)) if i % N == n]\n",
" w = training_with_logistic_regression(train_set)\n",
" predict_all(w, test_set)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.7481710748452448"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_accuracy(D)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(0.7451816160118606, 0.7542674920277621, 0.7496970261955812)"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_precision(D), compute_recall(D), compute_f1(D)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment