pkipsy/author-classification.ipynb

## author-classification.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Author Classification Using NLTK\n",
    "\n",
    "### December 2012"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An important new area of research in computational linguistics is **stylometrics**, the quantitative study of linguistic style. The underlying idea is that in creating texts, we index something of ourselves - often in ways that are below our conscious perception, but that are nevertheless amenable to quantitative study. \n",
    "\n",
    "This is supported by research into **author identification**, where investigators have found that a simple measure of lexical diversity can serve as a [linguistic signature](https://arxiv.org/abs/0909.4385), reliably differentiating texts by famous authors such as Herman Melville, Thomas Hardy, and D.H. Lawrence. The fact that a writer's style remains relatively consistent over the years, and across their various works, can offer insight into how the ravages of old-age and dementia affect verbal dexterity. That Agatha Christie and Iris Murdoch suffered from Alzheimer’s disease is clearly evident in their later writing. An [analysis](https://www.theguardian.com/books/2009/apr/03/agatha-christie-alzheimers-research) of their final works reveals that the breadth of their vocabularies had diminished sharply from their hey-days, while repetition — and the use of stop words — had increased markedly in kind. In hindsight, their texts could have served as useful diagnostic tools. \n",
    "\n",
    "These are but some of the interesting applications of **quantitative textual analysis**. Over the past decade, analyses of this type have emerged in everything from questions of authorship (how much help did Jane Austen have from [her editor](http://www.npr.org/templates/story/story.php?storyId=130838304)?) to automatic personality assessments (depressives use more [personal pronouns](http://www.slate.com/blogs/xx_factor/2013/10/09/pronoun_study_how_often_you_say_i_reflects_your_status_power_and_gender.html) than happier folk).\n",
    "\n",
    "Today, we will examine the stylometric problem of **classification**: Can we automatically classify texts according to a particular dimension of interest, such as genre, register, or author? What properties of the texts allow them to be classified in these ways?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Can a classifier reliably identify the author of a text, given two possible alternatives? \n",
    "For this project, we will be working with texts from [Project Gutenberg](https://www.gutenberg.org/) and running a classifier available through the [Natural Language Toolkit](http://www.nltk.org/). Our first step is to retrieve the available text corpus from the web, using [Wget](http://www.gnu.org/software/wget/) at the command line.\n",
    "```\n",
    "#get txt files from Project Gutenberg\n",
    "wget -w 2 -m -H http://www.gutenberg.org/robot/harvest?filetypes[]=txtl&langs[]=en\n",
    "```\n",
    "We will be working with two novels from [Project Gutenberg](https://www.gutenberg.org/): **Jane Eyre**, written by [Charlotte Bronte](https://en.wikipedia.org/wiki/Charlotte_Bront%C3%AB), and **Pride and Prejudice**, written by [Jane Austen](https://en.wikipedia.org/wiki/Jane_Austen). The files are tagged with their author’s name and read into a list as tokens (e.g., ([‘word1’, ‘word2’…], ‘austen’), creating an accessible tagged corpus of texts. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#build corpus of files\n",
    "from __future__ import division\n",
    "import nltk, re, pprint, os\n",
    "\n",
    "#iterate through directory, building tagged corpus for NLTK\n",
    "path = '/users/[username]/Desktop/corpus/'\n",
    "corpus = []\n",
    "listing = os.listdir(path)\n",
    "for infile in listing:\n",
    "    if infile.startswith('.'):\n",
    "        continue               #ignore .DS_Store file\n",
    "    else:\n",
    "        url = path + infile\n",
    "        f = open(url);\n",
    "        raw = f.read()\n",
    "        f.close()\n",
    "        tokens = nltk.word_tokenize(raw)\n",
    "        text = nltk.Text(tokens)\n",
    "        if infile.startswith('austen_pp'):\n",
    "            tupp = (tokens, 'austen')\n",
    "            corpus.append(tupp)\n",
    "        elif infile.startswith('bronte_je'):\n",
    "            tupp = (tokens, 'bronte')\n",
    "            corpus.append(tupp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The order of documents in the corpus is then **shuffled**, to randomize which are seen in training and which at test. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#shuffle documents\n",
    "import random\n",
    "random.shuffle(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A **word frequency analysis** is then run over all the documents, creating, for each text, a word feature vector that reports whether that word occurred in the document or not. Word features cover the top five-hundred most frequent words with a length greater than four. By constraining word length, we select for content words. (Alternately, we could exclude function words like \"the\" and \"a\" that occur in a stop list.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#create single corpus text over which to run frequency analyses\n",
    "all_text = []\n",
    "for i in range(0,len(corpus)-1):\n",
    "    for w in corpus[i][0]:\n",
    "        all_text.append(w)\n",
    "\n",
    "#choose the top 500 most frequent words in each text with length > 4 characters\n",
    "all_text_freq = nltk.FreqDist(w.lower() for w in all_text if len(w) > 4)\n",
    "word_features = all_text_freq.keys()[0:500]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first half of the feature sets are fed to the **classifer** as the training set, the second half as the testing set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#function to extract document features\n",
    "def document_features(document): \n",
    "    document_words = set(document)\n",
    "    features = {}\n",
    "    for word in word_features:\n",
    "        features['contains(%s)' % word] = (word in document_words)\n",
    "    return features\n",
    "\n",
    "#define featuresets, training set, testing set, and classifier\n",
    "featuresets = [(document_features(d), c) for (d,c) in corpus]\n",
    "train_set, test_set = featuresets[0:19], featuresets[20:39]\n",
    "classifier = nltk.NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then test **accuracy**, and find it is at ceiling.\n",
    "```python\n",
    ">>> print nltk.classify.accuracy(classifier, test_set)\n",
    "1.0\n",
    "```\n",
    "We can also examine the **most informative features** that separate the two novelists.\n",
    "```python\n",
    ">>> classifier.show_most_informative_features(10)\n",
    "Most Informative Features\n",
    "    contains(light) = True           bronte : austen =      6.6 : 1.0\n",
    "    contains(taking) = True          bronte : austen =      6.6 : 1.0\n",
    "    contains(chair) = False          austen : bronte =      5.8 : 1.0\n",
    "    contains(black) = False          austen : bronte =      5.8 : 1.0\n",
    "    contains(hands) = False          austen : bronte =      5.8 : 1.0\n",
    "    contains(thoughts) = True        bronte : austen =      5.7 : 1.0\n",
    "    contains(watched) = True         bronte : austen =      5.7 : 1.0\n",
    "    contains(white) = True           bronte : austen =      5.7 : 1.0\n",
    "    contains(strange) = True         bronte : austen =      5.7 : 1.0\n",
    "    contains(expected) = True        bronte : austen =      5.7 : 1.0\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Can the classifier perform just as well when trained on diverse materials? \n",
    "To explore this question, we now train and test the classifier on four novels rather than two, adding **Vitella**, written by Bronte, and **Sense and Sensibility**, written by Austen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#create new test corpus\n",
    "path = '/users/[username]/Desktop/corpus/'\n",
    "test_corpus = []\n",
    "listing = os.listdir(path)\n",
    "for infile in listing:\n",
    "    if infile.startswith('.'):\n",
    "        continue               #ignore .DS_Store file\n",
    "    else:\n",
    "        url = path + infile\n",
    "        f = open(url);\n",
    "        raw = f.read()\n",
    "        f.close()\n",
    "        tokens = nltk.word_tokenize(raw)\n",
    "        text = nltk.Text(tokens)\n",
    "        if infile.startswith('austen'):\n",
    "            tupp = (tokens, 'austen')\n",
    "            test_corpus.append(tupp)\n",
    "        elif infile.startswith('bronte'):\n",
    "            tupp = (tokens, 'bronte')\n",
    "            test_corpus.append(tupp)\n",
    "\n",
    "#shuffle corpora\n",
    "random.shuffle(test_corpus)\n",
    "\n",
    "#create single corpus text over which to run frequency analyses\n",
    "all_text = []\n",
    "for i in range(0,len(test_corpus)-1):\n",
    "    for w in test_corpus[i][0]:\n",
    "        all_text.append(w)\n",
    "\n",
    "#choose the top 500 most frequent words in each text with length > 4 characters\n",
    "all_text_freq = nltk.FreqDist(w.lower() for w in all_text if len(w) > 2)\n",
    "word_features = all_text_freq.keys()[0:500]\n",
    "\n",
    "#function to extract document features\n",
    "def document_features(document):\n",
    "    document_words = set(document)\n",
    "    features = {}\n",
    "    for word in word_features:\n",
    "        features['contains(%s)' % word] = (word in document_words)\n",
    "    return features\n",
    "\n",
    "#define featuresets, training set, testing set, and classifier\n",
    "featuresets = [(document_features(d), c) for (d,c) in test_corpus]\n",
    "train_set, test_set = featuresets[0:29], featuresets[30:59]\n",
    "classifier = nltk.NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "When we test **accuracy**, it is again at ceiling.\n",
    "```python\n",
    "#check classification accuracy\n",
    ">>> print nltk.classify.accuracy(classifier, test_set)\n",
    "1.0\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## But wait.. Is the length of chapters controlled for?\n",
    "The classifier seems to find this problem easy.. but is it too easy? What if chapters from one writer are significantly longer than chapters from the other, skewing the frequency distributions?  To assess this, we can check the **length** of the chapters.\n",
    "```python\n",
    ">>> for i in range(0, len(corpus)-1):\n",
    "        print len(corpus[i][0]), corpus[i][1]\n",
    "992 austen\n",
    "2564 austen\n",
    "1860 austen\n",
    "739 austen\n",
    "1918 austen\n",
    "1265 austen\n",
    "1888 austen\n",
    "3866 austen\n",
    "1427 austen\n",
    "5838 austen\n",
    "2163 austen\n",
    "969 austen\n",
    "1883 austen\n",
    "1905 austen\n",
    "1190 austen\n",
    "1137 austen\n",
    "2682 austen\n",
    "2306 austen\n",
    "2277 austen\n",
    "2011 austen\n",
    "2277 bronte\n",
    "5107 bronte\n",
    "7584 bronte\n",
    "4887 bronte\n",
    "4856 bronte\n",
    "5951 bronte\n",
    "5906 bronte\n",
    "4485 bronte\n",
    "9538 bronte\n",
    "6920 bronte\n",
    "4676 bronte\n",
    "3226 bronte\n",
    "7122 bronte\n",
    "3821 bronte\n",
    "6863 bronte\n",
    "5842 bronte\n",
    "3425 bronte\n",
    "4119 bronte\n",
    "3528 bronte\n",
    "```\n",
    "On visual inspection, there appears to be a difference. So let's quantify that difference..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#count the word length of each corpus\n",
    "len_dict = {'austen':0, 'bronte':0}\n",
    "for i in range(0, len(corpus)-1):\n",
    "    if corpus[i][1] == 'austen':\n",
    "        len_dict['austen'] += len(corpus[i][0])\n",
    "    elif corpus[i][1] == 'bronte':\n",
    "        len_dict['bronte'] += len(corpus[i][0])\n",
    "        \n",
    "#what is the ratio of words in the bronte corpus to the austen corpus?        \n",
    "len_dict['bronte']/len_dict['austen']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This ratio analysis tells us that Bronte’s are, on average, **2½ times** longer than Austen’s!  To correct for this, we can create a new corpus of texts using a proportional slicing measure of 1,000 words each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#create word lists of each novel, without chapter breaks\n",
    "def loadcorpus(filename,author,corpus):\n",
    "    corp = []\n",
    "    path = '/users/[username]/Desktop/corpus/'\n",
    "    listing = os.listdir(path)\n",
    "    for infile in listing:\n",
    "        if infile.startswith('.'):\n",
    "            continue               \n",
    "        else:\n",
    "            url = path + infile\n",
    "            f = open(url);\n",
    "            raw = f.read()\n",
    "            f.close()\n",
    "            tokens = nltk.word_tokenize(raw)\n",
    "            text = nltk.Text(tokens)\n",
    "            if infile.startswith(filename):\n",
    "                corp.append(tokens)\n",
    "    for i in range(0,len(corp)-1):\n",
    "        for w in corp[i]:\n",
    "            corpus.append(w)\n",
    "\n",
    "#pride and prejudice\n",
    "pp = []\n",
    "loadcorpus('austen_pp','austen',pp)\n",
    "\n",
    "#sense and sensibility\n",
    "ss = []\n",
    "loadcorpus('austen_ss','austen',ss)\n",
    "\n",
    "#jane eyre\n",
    "je = []\n",
    "loadcorpus('bronte_je','bronte',je)\n",
    "\n",
    "#vitella\n",
    "v = []\n",
    "loadcorpus('bronte_v','bronte',v)\n",
    "\n",
    "#document corpus\n",
    "document_file = []\n",
    "corpora = (pp,'austen'), (ss,'austen'), (je,'bronte'), (v,'bronte')\n",
    "ranges = (0,999),(1000,1999),(2000,2999),(3000,3999),(4000,4999),(5000,5999),(6000,6999),(7000,7999),(8000,8999),(9000,9999)\n",
    "for c in corpora:\n",
    "    for r in ranges:\n",
    "        slice = (c[0][r[0]:r[1]], c[1])\n",
    "        document_file.append(slice)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do our earlier findings **replicate**, now that length has been controlled for?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#randomize\n",
    "random.shuffle(document_file)\n",
    "\n",
    "#define featuresets, training set, testing set, and classifier\n",
    "featuresets = [(document_features(d), c) for (d,c) in document_file]\n",
    "train_set, test_set = featuresets[0:19], featuresets[20:39]\n",
    "classifier = nltk.NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "#check classification accuracy\n",
    "print nltk.classify.accuracy(classifier, test_set)\n",
    "0.526\n",
    "```\n",
    "Nope! Classification accuracy plummets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## But wait.. How is the random shuffle function working?\n",
    "However.. If we rerun this a few times, we might notice that classification accuracy seems to vary significantly with different random shuffles of the testing and training data. This could very well be because the texts seen in testing and training are not being adequately **counterbalanced**.  For example, in an extreme case, the classifier might get only Austen texts in training, and then be tested on all Bronte texts, leading it to mislabel them as Austen texts. To correct for this possibility, we can manually counterbalanced the texts. (Ideally, we would redo this with a random seed counterbalancer)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#counterbalance training and test sets with selections from novels, authors\n",
    "austen_train = document_file[0:5] + document_file[10:15]\n",
    "austen_test = document_file[5:10] + document_file[15:20]\n",
    "bronte_train = document_file[20:25] + document_file[30:35]\n",
    "bronte_test = document_file[25:30] + document_file[35:40]\n",
    "\n",
    "documents = austen_train + bronte_train + austen_test + bronte_test\n",
    "\n",
    "#define featuresets, training set, testing set, and classifier\n",
    "featuresets = [(document_features(d), c) for (d,c) in documents]\n",
    "train_set, test_set = featuresets[0:20], featuresets[20:40]\n",
    "classifier = nltk.NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "#results are much better\n",
    "print nltk.classify.accuracy(classifier, test_set)\n",
    "0.95\n",
    "```\n",
    "Once training and testing data are adjusted in this way, classification accuracy rebounds, stabilizing at ~95%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What if the classifier is only trained on one novel and needs to ‘generalize’ at test to another? \n",
    "To answer this question, we can train the classifier on **Jane Eyre** and **Pride and Prejudice**, and test on **Vitella** and **Sense and Sensibility**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#counterbalance training and test sets with selections from novels, authors\n",
    "austen_train = document_file[0:10]  #pride and prejudice\n",
    "austen_test = document_file[10:20]  #sense and sensibility\n",
    "bronte_train = document_file[20:30] #jane eyre\n",
    "bronte_test = document_file[30:40]  #vitella\n",
    "\n",
    "train = austen_train + bronte_train\n",
    "test = austen_test + bronte_test\n",
    "documents =  train + test\n",
    "\n",
    "#define featuresets, training set, testing set, and classifier\n",
    "featuresets = [(document_features(d), c) for (d,c) in documents]\n",
    "train_set, test_set = featuresets[0:20], featuresets[20:40]\n",
    "classifier = nltk.NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "#classifier is still very accurate\n",
    "print nltk.classify.accuracy(classifier, test_set)\n",
    "0.95\n",
    "```\n",
    "Apparently this is not a hard task: Accuracy remains stable, at ~95%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is the classifier getting wrong and what is it getting right?\n",
    "To answer this question we can build a **confusion matrix**, with precision, recall and f-scores, which will allow us to rapidly identify where the classifier is fumbling. Since a confusion matrix relies on positives and negatives, the results can only be interpreted this way if one of the authors (e.g., Austen) is identified as the 'target', and the other (e.g., Bronte) as the lure. To get around this issue, we can alternate which one is the target, calculating the scores for each author in turn, and then comparing them. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def checkresults(corpus):\n",
    "    info_dict = {'TP':0,'TN':0,'FP':0,'FN':0}\n",
    "    print '\\t'+'\\t'+'Answer'+' '+'Guess'\n",
    "    for file in corpus:\n",
    "        correct = file[1]\n",
    "        guess = classifier.classify(document_features(file[0]))\n",
    "        if correct == guess:\n",
    "            if correct == 'austen':\n",
    "                info_dict['TP'] += 1\n",
    "                print 'True Positive:'+'\\t'+correct+' '+guess\n",
    "            elif correct == 'bronte':\n",
    "                info_dict['TN'] += 1\n",
    "                print 'True Negative:'+'\\t'+correct+' '+guess\n",
    "        elif correct != guess:\n",
    "            if correct == 'austen':\n",
    "                info_dict['FN'] += 1\n",
    "                print 'False Negative:'+'\\t'+correct+' '+guess\n",
    "            elif correct == 'bronte':\n",
    "                info_dict['FP'] += 1\n",
    "                print 'False Positive:'+'\\t'+correct+' '+guess\n",
    "    precision_a = info_dict['TP']/(info_dict['TP'] + info_dict['FP'])\n",
    "    recall_a = info_dict['TP']/(info_dict['TP'] + info_dict['FN'])\n",
    "    f_score_a = (2*precision_a*recall_a)/(precision_a+recall_a)\n",
    "    precision_b = info_dict['TN']/(info_dict['TN'] + info_dict['FN'])\n",
    "    recall_b = info_dict['TN']/(info_dict['TN'] + info_dict['FP'])\n",
    "    f_score_b = (2*precision_b*recall_b)/(precision_b+recall_b)\n",
    "    print 'The percentage of Austen guesses that were relevant: %4.2f' % float(precision_a)\n",
    "    print 'The percentage of Austen tags that we identified: %4.2f' % float(recall_a)\n",
    "    print 'The percentage of Bronte guesses that were relevant: %4.2f' % float(precision_b)\n",
    "    print 'The percentage of Bronte tags that we identified: %4.2f' % float(recall_a)\n",
    "    print 'Austen F Score: %4.2f' % float(f_score_a)\n",
    "    print 'Bronte F Score: %4.2f' % float(f_score_a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "checkresults(train)\n",
    "                Answer Guess\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Negative:\tbronte bronte\n",
    "True Negative:\tbronte bronte\n",
    "True Negative:\tbronte bronte\n",
    "True Negative:\tbronte bronte\n",
    "True Negative:\tbronte bronte\n",
    "True Negative:\tbronte bronte\n",
    "True Negative:\tbronte bronte\n",
    "True Negative:\tbronte bronte\n",
    "True Negative:\tbronte bronte\n",
    "True Negative:\tbronte bronte\n",
    "The percentage of Austen guesses that were relevant: 1.0\n",
    "The percentage of Austen tags that we identified: 1.0\n",
    "The percentage of Bronte guesses that were relevant: 1.0\n",
    "The percentage of Bronte tags that we identified: 1.0\n",
    "Austen F Score: 1.0\n",
    "Bronte F Score: 1.0\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The classifier is already at ceiling. But what would it get wrong if we only trained it on one author and tested on both?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train = austen_train\n",
    "test = austen_test + bronte_test\n",
    "documents =  train + test\n",
    "\n",
    "#define featuresets, training set, testing set, and classifier\n",
    "featuresets = [(document_features(d), c) for (d,c) in documents]\n",
    "train_set, test_set = featuresets[0:10], featuresets[10:30]\n",
    "classifier = nltk.NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "#accuracy drops to 1/2\n",
    ">>> print 'Accuracy: %4.2f' % nltk.classify.accuracy(classifier, test_set)\n",
    "Accuracy: 0.50\n",
    "#run confusion matrix\n",
    ">>> checkresults(test)\n",
    "\t\t        Answer Guess\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "True Positive:\tausten austen\n",
    "False Positive:\tbronte austen\n",
    "False Positive:\tbronte austen\n",
    "False Positive:\tbronte austen\n",
    "False Positive:\tbronte austen\n",
    "False Positive:\tbronte austen\n",
    "False Positive:\tbronte austen\n",
    "False Positive:\tbronte austen\n",
    "False Positive:\tbronte austen\n",
    "False Positive:\tbronte austen\n",
    "False Positive:\tbronte austen\n",
    "```\n",
    "Here, accuracy drops to 50%, because the classifier **overgeneralizes**, misclassifying every instance of Bronte as Austen."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What if we train the classifier on bigrams, rather than unigrams?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from nltk.collocations import BigramCollocationFinder\n",
    "from nltk.metrics import BigramAssocMeasures as BAM\n",
    "from itertools import chain\n",
    "\n",
    "def bigram_word_features(words, score_fn=BAM.chi_sq, n=200):\n",
    "    bigram_finder = BigramCollocationFinder.from_words(words)\n",
    "    bigrams = bigram_finder.nbest(score_fn, n)\n",
    "    return dict((bg, True) for bg in chain(words, bigrams))\n",
    "\n",
    "featuresets = [(bigram_word_features(d), c) for (d,c) in documents]\n",
    "train_set, test_set = featuresets[0:20], featuresets[20:40]\n",
    "classifier = nltk.NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "#produces perfect accuracy\n",
    ">>> print 'Accuracy: %4.2f' % nltk.classify.accuracy(classifier, test_set)\n",
    "Accuracy: 1.00\n",
    "```\n",
    "If bigrams are trained instead of unigrams, accuracy jumps back up to ceiling."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What if we use a TF-IDF measure?\n",
    "\n",
    "Today, many **information retrieval** models make use of a document-level statistic **tf-idf** to identify the best keywords. tf-idf selects for terms that occur many times in a small number of documents, penalizing terms that occur only a few times in a document, or that occur in many documents. \n",
    "\n",
    "$idf(t,D)= -log_2\\frac{df_t}{D}$\n",
    "\n",
    "Notably, tf-idf incorporates both **term frequency** and **inverse document frequency**. Term frequency (TF)—i.e., a word’s frequency within a specific document—can be computed in multiple ways, such as a raw frequency count, a logarithmically-scaled frequency count, an adjusted count that normalizes for document length, and a Boolean value (1 if present in the document, 0 if absent).\n",
    "\n",
    "Here is a code snippet to get started!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Use a TF-IDF measure\n",
    "texts = []\n",
    "path = '/users/[username]/Desktop/corpus/'\n",
    "listing = os.listdir(path)\n",
    "for infile in listing:\n",
    "    if infile.startswith('.'):\n",
    "        continue              \n",
    "    url = path + infile\n",
    "    f = open(url);\n",
    "    raw = f.read()\n",
    "    f.close()\n",
    "    tokens = nltk.word_tokenize(raw)\n",
    "    text = nltk.Text(tokens)\n",
    "    texts.append(text)\n",
    "\n",
    "#load the texts into a textcollection object\n",
    "collection = nltk.TextCollection(texts)\n",
    "unique_terms = list(set(collection))\n",
    "\n",
    "def TFIDF(document):\n",
    "    word_tfidf = []\n",
    "    for word in unique_terms:\n",
    "        word_tfidf.append(collection.tf_idf(word,document))\n",
    "    return word_tfidf\n",
    "\n",
    "import numpy\n",
    "vectors = [numpy.array(TFIDF(f)) for f in texts]"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}