fayeip/Kaggle-NLP.ipynb

## Kaggle-NLP.ipynb
{
 "worksheets": [
  {
   "cells": [
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "<b>Strategy:</b> <br>\nThe strategy that produced our best score is the SVM Classifier using TF-IDF bigrams as features. Other classifiers we have tried is Naïve Bayes and MaxEnt, with TF-IDF unigrams and bigrams. We used the built-in classifier functions from scikit-learn, with helper functions from NLTK. \n\n<b>Things that worked: </b> <br>\nJanine’s code for preprocessing text (removing punctuation, removing stopwords, lemmatizing) was instrumental in boosting our accuracy scores. This preprocessing was done before passing the training data through the TfidfVectorizer for more preprocessing and tokenizing. \n\n<b>Things that didn’t work: </b> <br>\nWe tried stemming the words with the Porter Stemmer, but that did not increase our scores.<br>\nWe tried using the Decision Tree Classifier in scikit-learn, but our data proved too sparse for this classifier. <br>\nWe had an interesting problem with the OrderedDict data structure. We learned the hard way that the OrderedDict treats identical keys as the same key, hence overwriting earlier versions while keeping the original order. Since we are using questions as keys, this leads to insidious errors when there are two identical questions in the data set. <br>\nWe tried using a spell checker to further normalize the words in the preprocessing phase, but this did not help scores (it actually hurt our scores somewhat). \n\n \n"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "import nltk \nfrom nltk.corpus import gutenberg\nimport random\nfrom random import shuffle \nfrom collections import Counter\nfrom nltk.stem.wordnet import WordNetLemmatizer\nfrom nltk.tokenize.punkt import PunktWordTokenizer\nimport nltk.tag, nltk.data\nfrom nltk.corpus import wordnet\nfrom nltk.stem.porter import PorterStemmer\nimport numpy as np\nfrom sklearn.naive_bayes import MultinomialNB\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.svm import LinearSVC\nfrom sklearn import metrics\nfrom operator import itemgetter\nfrom sklearn.metrics import classification_report\nimport csv\nimport os\nimport collections\nimport operator\nfrom collections import OrderedDict\nimport string\nimport re\nfrom nltk.corpus import reuters\nfrom nltk.corpus import brown",
     "prompt_number": 219,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#function to break up the data into training and test data\ndef prep_data():\n    f = open('data/train.txt','rb')\n    train_raw = f.read()\n    f.close() \n    train_split = train_raw.split('\\n')\n    train_tuples = [ ( line[2:], line[0]) for line in train_split if line != '']\n    shuffle(train_tuples)\n    total_size = len(train_tuples)\n    train_size = int(total_size * 0.9) \n    return train_tuples[:train_size], train_tuples[train_size:]",
     "prompt_number": 220,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#get the test and training sets, print the size\nyahoo_train, yahoo_test = prep_data()\nprint 'The training set size is ' + str(len(yahoo_train))\nprint 'The test set size is ' + str(len(yahoo_test)) ",
     "prompt_number": 221,
     "outputs": [
      {
       "output_type": "stream",
       "text": "The training set size is 2428\nThe test set size is 270\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#transform the training set into a dict, get the keys and vals for the tet and train\ndef get_train_and_test_dicts(yahoo_train_dict, yahoo_test_dict):\n    yahoo_train_keys = yahoo_train_dict.keys()\n    #yahoo_train_keys = [unicode(word) for word in yahoo_train_keys]\n    yahoo_train_vals = yahoo_train_dict.values()\n    yahoo_test_keys = yahoo_test_dict.keys()\n    #yahoo_test_keys = [unicode(word) for word in yahoo_test_keys ]\n    yahoo_test_vals = yahoo_test_dict.values()\n    return np.array(yahoo_train_keys), np.array(yahoo_train_vals), np.array(yahoo_test_keys), np.array(yahoo_test_vals)",
     "prompt_number": 222,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#Created a vocabulary from the training data (the fit in .fit_transform())\n#then turn the words into a TF-IDF weighted word vector(the transform in .fit_transform())\n#Convert a collection of raw documents to a matrix of TF-IDF features\n\n#lol. Don't necessarily need to do preprocessing- Scipylearn can do all the preprocessing for you\n#want to use tfidf vectors to just grab the words that are most relevant\n#Convert a collection of raw documents to a matrix of TF-IDF features.\n#http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n\nvectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words='english', strip_accents='unicode')\n",
     "prompt_number": 223,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#split the test and train into numpy arrays so they can be vectorized and trained with\nyahoo_train_dict  =  dict(yahoo_train)\nyahoo_test_dict  =  dict(yahoo_test)\ntrain_words, train_cats,  test_words,  test_cats = get_train_and_test_dicts(yahoo_train_dict, yahoo_test_dict)",
     "prompt_number": 224,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# print train_cats[1]\n# print train_words[1]\n# print test_cats[1]\n# print test_words[1]",
     "prompt_number": 225,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#example of how the vectorizer works\ntest_string = unicode(train_words[0])\nprint \"Example string: \" + test_string\nprint \"Preprocessed string: \" + vectorizer.build_preprocessor()(test_string)\nprint \"Tokenized string:\" + str(vectorizer.build_tokenizer()(test_string))\nprint \"N-gram data string:\" + str(vectorizer.build_analyzer()(test_string))\nprint \"\\n\"\n\n#Created a vocabulary from the training data (the fit in .fit_transform())\",\n#then turn the words into a TF-IDF weighted word vector(the transform in .fit_transform())\"\nX_train_words = vectorizer.fit_transform(train_words)\n\n#transformed the test data into a TF-IDF weighted word vector in the vocab space of the training data(.transform())\nX_test_words = vectorizer.transform(test_words)\n",
     "prompt_number": 226,
     "outputs": [
      {
       "output_type": "stream",
       "text": "Example string: in reference to the saying what goes up must come down; as the universe expands must'nt it also collapse? \nPreprocessed string: in reference to the saying what goes up must come down; as the universe expands must'nt it also collapse? \nTokenized string:[u'in', u'reference', u'to', u'the', u'saying', u'what', u'goes', u'up', u'must', u'come', u'down', u'as', u'the', u'universe', u'expands', u'must', u'nt', u'it', u'also', u'collapse']\nN-gram data string:[u'reference', u'saying', u'goes', u'come', u'universe', u'expands', u'nt', u'collapse', u'reference saying', u'saying goes', u'goes come', u'come universe', u'universe expands', u'expands nt', u'nt collapse']\n\n\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#function to evaluate results of each model\n# code parts inspired from this presentation: http://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/#1\ndef evaluate_results(model_name, test_cats, predicted, myclassifier, myvectorizer, cv_scores, prwords = True):\n        print model_name\n        print 'The precision for this classifier is ' + str(metrics.precision_score(test_cats, predicted))\n        print 'The recall for this classifier is ' + str(metrics.recall_score(test_cats, predicted))\n        print 'The f1 for this classifier is ' + str(metrics.f1_score(test_cats, predicted))\n        print 'The accuracy for this classifier is ' + str(metrics.accuracy_score(test_cats, predicted))\n        print '\\nHere is the classification report:'\n        print classification_report(test_cats, predicted)\n        print '\\nHere is the confusion matrix:'\n        print metrics.confusion_matrix(test_cats, predicted)\n        print \"\\nCross-validation scores: \" \n        print cv_scores \n        ##print the top 10 words for each category\n        if prwords == True:\n            N = 10\n            vocabulary = np.array([t for t, i in sorted(myvectorizer.vocabulary_.iteritems(), key=itemgetter(1))])\n            for i, label in enumerate(set(test_cats)):\n                topN = np.argsort(myclassifier.coef_[i])[-N:]\n                print \"\\nThe top %d most informative features for topic code %s: \\n%s\" % (N, label, \" \".join(vocabulary[topN]))\n        print \"\\n\"",
     "prompt_number": 227,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "##bayes classifier\ndef get_bayes(X_train_words, train_cats, X_test_words):\n    #build the classifier\n    bayes_classifier = MultinomialNB().fit(X_train_words,train_cats )\n    yahoo_bayes_predicted = bayes_classifier.predict(X_test_words)\n    scores = cross_validation.cross_val_score(bayes_classifier, X_train_words, train_cats, cv=5)\n    return bayes_classifier, yahoo_bayes_predicted, scores",
     "prompt_number": 228,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#SVM classifier\ndef get_svm(X_train_words, train_cats, X_test_words):\n    svm_classifier = LinearSVC().fit(X_train_words, train_cats)\n    yahoo_svm_predicted = svm_classifier.predict(X_test_words)\n    scores = cross_validation.cross_val_score(svm_classifier, X_train_words, train_cats, cv=5)\n    return svm_classifier, yahoo_svm_predicted, scores",
     "prompt_number": 229,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "###now try max ent as a classifier\ndef get_maxent(X_train_words, train_cats, X_test_words):\n    maxent_classifier = LogisticRegression().fit(X_train_words, train_cats)\n    yahoo_maxent_predicted = maxent_classifier.predict(X_test_words)\n    scores = cross_validation.cross_val_score(maxent_classifier, X_train_words, train_cats, cv=5)\n    return maxent_classifier, yahoo_maxent_predicted, scores ",
     "prompt_number": 230,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#get the bayes results \nbayes_classifier, yahoo_bayes_predicted, bayes_cv_scores = get_bayes(X_train_words, train_cats, X_test_words )\nbayes_model_name =  \"MODEL: Multinomial Naive Bayes\"\nevaluate_results(bayes_model_name, test_cats, yahoo_bayes_predicted, bayes_classifier, vectorizer, bayes_cv_scores)\n#then get the SVM results:\nprint \"*****\"\nsvm_classifier, yahoo_svm_predicted, svm_cv_scores = get_svm(X_train_words, train_cats, X_test_words )\nsvm_model_name = \"MODEL: Linear SVC\"\nevaluate_results(svm_model_name, test_cats, yahoo_svm_predicted, svm_classifier, vectorizer, svm_cv_scores)\nprint \"*****\"\n#then max ent results\nmaxent_classifier, yahoo_maxent_predicted, maxent_cv_scores = get_maxent(X_train_words, train_cats, X_test_words )\nmaxent_model_name = \"MODEL: Maximum Entropy\"\nevaluate_results(maxent_model_name, test_cats, yahoo_maxent_predicted, maxent_classifier, vectorizer, maxent_cv_scores)",
     "prompt_number": 231,
     "outputs": [
      {
       "output_type": "stream",
       "text": "MODEL: Multinomial Naive Bayes\nThe precision for this classifier is 0.408151312741\nThe recall for this classifier is 0.366666666667\nThe f1 for this classifier is 0.283308958503\nThe accuracy for this classifier is 0.366666666667\n\nHere is the classification report:\n             precision    recall  f1-score   support\n\n          1       0.30      0.91      0.45        75\n          2       0.69      0.44      0.54        45\n          3       0.70      0.17      0.27        42\n          4       0.80      0.12      0.21        34\n          5       0.00      0.00      0.00        19\n          6       0.00      0.00      0.00        25\n          7       0.00      0.00      0.00        30\n\navg / total       0.41      0.37      0.28       270\n\n\nHere is the confusion matrix:\n[[68  4  2  1  0  0  0]\n [25 20  0  0  0  0  0]\n [31  4  7  0  0  0  0]\n [30  0  0  4  0  0  0]\n [17  1  1  0  0  0  0]\n [25  0  0  0  0  0  0]\n [30  0  0  0  0  0  0]]\n\nCross-validation scores: \n[ 0.36363636  0.35743802  0.36157025  0.35330579  0.35123967]\n\nThe top 10 most informative features for topic code 1: \nxa credit money want like people does know yahoo best\n\nThe top 10 most informative features for topic code 3: \nfree software internet windows web best use computer xa yahoo\n\nThe top 10 most informative features for topic code 2: \nrock like tv music best did movie favorite song xa",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\nThe top 10 most informative features for topic code 5: \nsex want know women boyfriend friend guy like girl love\n\nThe top 10 most informative features for topic code 4: \nworld did does language good need words college word school\n\nThe top 10 most informative features for topic code 7: \nsurgery diet bad cold smoking know way pain does best\n\nThe top 10 most informative features for topic code 6: \nnumber moon stars planet possible gas xa earth world does\n\n\n*****\nMODEL: Linear SVC",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\nThe precision for this classifier is 0.578776474296\nThe recall for this classifier is 0.562962962963\nThe f1 for this classifier is 0.557371402402\nThe accuracy for this classifier is 0.562962962963\n\nHere is the classification report:\n             precision    recall  f1-score   support\n\n          1       0.46      0.61      0.52        75\n          2       0.74      0.76      0.75        45\n          3       0.62      0.60      0.61        42\n          4       0.57      0.62      0.59        34\n          5       0.29      0.21      0.24        19\n          6       0.68      0.52      0.59        25\n          7       0.69      0.30      0.42        30\n\navg / total       0.58      0.56      0.56       270\n\n\nHere is the confusion matrix:\n[[46  5  8  9  3  2  2]\n [ 9 34  1  1  0  0  0]\n [10  5 25  0  2  0  0]\n [11  0  1 21  0  1  0]\n [ 6  1  5  1  4  1  1]\n [ 7  1  0  2  1 13  1]\n [12  0  0  3  4  2  9]]\n\nCross-validation scores: \n[ 0.51652893  0.5392562   0.57644628  0.5392562   0.52479339]\n\nThe top 10 most informative features for topic code 1: \nbusiness bank rich change buy stock usa job credit money",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\nThe top 10 most informative features for topic code 3: \nlaptop java web linux page software windows internet use computer\n\nThe top 10 most informative features for topic code 2: \nfilm episode movies favorite magazine rock music tv movie song\n\nThe top 10 most informative features for topic code 5: \nex women boyfriend guy date relationship girl friend marriage love\n\nThe top 10 most informative features for topic code 4: \nlargest words university atom colleges education study word school college\n\nThe top 10 most informative features for topic code 7: \naids treatment dandruff tamiflu curl weight smoking surgery pain diet\n\nThe top 10 most informative features for topic code 6: \npaupisi acid sky moon earth theory math universe planet stars\n\n\n*****\nMODEL: Maximum Entropy",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\nThe precision for this classifier is 0.628769973987\nThe recall for this classifier is 0.451851851852\nThe f1 for this classifier is 0.410805632041\nThe accuracy for this classifier is 0.451851851852\n\nHere is the classification report:\n             precision    recall  f1-score   support\n\n          1       0.34      0.83      0.48        75\n          2       0.64      0.56      0.60        45\n          3       0.80      0.38      0.52        42\n          4       0.70      0.41      0.52        34\n          5       0.50      0.05      0.10        19\n          6       0.75      0.12      0.21        25\n          7       1.00      0.03      0.06        30\n\navg / total       0.63      0.45      0.41       270\n\n\nHere is the confusion matrix:\n[[62  7  3  3  0  0  0]\n [20 25  0  0  0  0  0]\n [22  4 16  0  0  0  0]\n [20  0  0 14  0  0  0]\n [16  1  1  0  1  0  0]\n [17  2  0  3  0  3  0]\n [27  0  0  0  1  1  1]]\n\nCross-validation scores: \n[ 0.39256198  0.41528926  0.41735537  0.39256198  0.38429752]\n\nThe top 10 most informative features for topic code 1: \nhome stock change business rich people buy job credit money\n\nThe top 10 most informative features for topic code 3: \nxp page linux software web internet windows yahoo use computer",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\nThe top 10 most informative features for topic code 2: \nmovies magazine did xa rock tv music favorite movie song\n\nThe top 10 most informative features for topic code 5: \ngirls marriage relationship date women boyfriend friend guy girl love\n\nThe top 10 most informative features for topic code 4: \nschools spanish colleges education language study words word college school\n\nThe top 10 most informative features for topic code 7: \ncauses blood cure fight rid bad smoking surgery diet pain\n\nThe top 10 most informative features for topic code 6: \neyes math world sky universe moon gas planet stars earth\n\n\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#want to see if preprocessing collection helps accuracy. ",
     "prompt_number": 232,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#spellchecker! tries to correct mispelled words\n#adapted from http://norvig.com/spell-correct.html\ndef words(text): return re.findall('[a-z]+', text.lower()) \n\ndef train(features):\n    model = collections.defaultdict(lambda: 1)\n    for f in features:\n        model[f] += 1\n    return model\n\ndef get_txt_words():\n     mystring = ''\n     gstring = \" \".join(reuters.words())\n     mystring =  mystring + gstring\n     gstring = \" \".join(brown.words())\n     mystring =  mystring + gstring\n     return mystring\n\nNWORDS = train(words(get_txt_words()))\n\nalphabet = 'abcdefghijklmnopqrstuvwxyz'\n\ndef edits1(word):\n   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]\n   deletes    = [a + b[1:] for a, b in splits if b]\n   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]\n   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]\n   inserts    = [a + c + b     for a, b in splits for c in alphabet]\n   return set(deletes + transposes + replaces + inserts)\n\ndef known_edits2(word):\n    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)\n\ndef known(words): return set(w for w in words if w in NWORDS)\n\ndef correct(word):\n    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]\n    return max(candidates, key=NWORDS.get)",
     "prompt_number": 233,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#Class to pre-process the collection\n#class is full of goodies/functions for preprocessing data- remove stopwords,stemm, punk, and convert to lower, lemmatize \nclass PreprocessText:\n\n    #function to remove punct. \n    def remove_punct(self, text):\n      exclude = set(string.punctuation)\n      table = string.maketrans(\"\",\"\")\n      text = text.translate(table, string.punctuation)\n      return text\n\n    #remove stopwords-> A quick way to reduce elminate words that aren't valid key words.\n    def removestopwords(self, tokens):\n      stopwords = nltk.corpus.stopwords.words('english')\n      tokens = [w for w in tokens if w.lower().strip() not in stopwords]\n      return tokens\n\n    ##lemmatize the words to reduce dimensionality. Also,option to do lemmatization based on POS. \n    #wordnet lemmatizer assumes everything is a noun unless otherwise specified, so we need to give\n    #it the wordnet pos if we don't want the default noun lookup. \n    def lemmatize(self, tokens, lemmatize_pos):\n        def get_wordnet_pos( pos_tag):\n            if pos_tag[1].startswith('J'):\n                return (pos_tag[0], wordnet.ADJ)\n            elif pos_tag[1].startswith('V'):\n                return (pos_tag[0], wordnet.VERB)\n            elif pos_tag[1].startswith('N'):\n                return (pos_tag[0], wordnet.NOUN)\n            elif pos_tag[1].startswith('R'):\n                return (pos_tag[0], wordnet.ADV)\n            else:\n                return (pos_tag[0], wordnet.NOUN)\n        lemmatizer = WordNetLemmatizer()\n        if lemmatize_pos:\n            tokens_pos = nltk.tag.pos_tag(tokens)\n            tokens_pos_wordnet = [ get_wordnet_pos(token) for token in tokens_pos]\n            tokens_lemm = [lemmatizer.lemmatize(token[0], token[1]) for token in tokens_pos_wordnet]\n        else:\n            tokens_lemm = [lemmatizer.lemmatize(token) for token in tokens] \n        return tokens_lemm\n    \n    #function that combines above functions in one routine\n    #lots of args to specify what preprocessing routine you want to use\n    def preprocess_txt(self, text, convertlower=True, nopunk=True, stopwords=True, lemmatize_doc=True, lemmatize_pos=True, stemmed=False, correct_wd = True):\n      #convert to lower\n      if convertlower:\n        text = text.lower()\n      # remove punctuation\n      if nopunk:\n        text = self.remove_punct(text)\n      #tokenize text\n      tokens = PunktWordTokenizer().tokenize(text)\n      #remove extra whitespaces\n      tokens = [token.strip() for token in tokens]\n      if correct_wd:\n          tokens = [correct(token) for token in tokens]\n      #remove stopwords\n      if stopwords:\n        tokens = self.removestopwords(tokens)\n      #lemmatize\n      if lemmatize_doc:\n        tokens = self.lemmatize(tokens,lemmatize_pos)\n      #stem\n      if stemmed:\n        porter = PorterStemmer()\n        tokens = [ porter.stem(token) for token in tokens ]\n      #combine the tokens back into a string...need to do this for the tfidf vectorizer\n      token_line = \" \".join(tokens)\n      return token_line\n",
     "prompt_number": 234,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "##now try to pre-process the text before using the classifiers\npt = PreprocessText()\nprint yahoo_train[:1]\n#spell checker didn't work- need more training corpuss for the spell checker....\n#also spell checker slows down the pre-processing. \nyahoo_train_pr = [(pt.preprocess_txt(item[0], stemmed=False, correct_wd=False), item[1]) for item in yahoo_train ]\nprint  yahoo_train_pr[:1]\nyahoo_test_pr = [(pt.preprocess_txt(item[0], stemmed=False, correct_wd = False), item[1]) for item in yahoo_test ]\nyahoo_train_dict_pr  =  OrderedDict(yahoo_train_pr)\nyahoo_test_dict_pr  =  OrderedDict(yahoo_test_pr)\ntrain_words_pr, train_cats_pr, test_words_pr, test_cats_pr = get_train_and_test_dicts(yahoo_train_dict_pr, yahoo_test_dict_pr)\n#create new vectorizer, this time without stop words since we removed them already\nvectorizer_pr = TfidfVectorizer(ngram_range=(1, 2), strip_accents='unicode')\nX_train_words_pr = vectorizer_pr.fit_transform(train_words_pr)\n#transformed the test data into a TF-IDF weighted word vector in the vocab space of the training data(.transform())\nX_test_words_pr = vectorizer_pr.transform(test_words_pr)\n",
     "prompt_number": 235,
     "outputs": [
      {
       "output_type": "stream",
       "text": "[(\"i am looking for a movie theater in the los angeles area that don't mind the sounds of crying babies.? \", '3')]\n[('look movie theater los angeles area dont mind sound cry baby', '3')]",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#get the bayes resuts \nbayes_classifier_pr, yahoo_bayes_predicted_pr, bayes_cv_scores_pr = get_bayes(X_train_words_pr, train_cats_pr, X_test_words_pr )\nbayes_model_name_pr =  \"MODEL: Multinomial Naive Bayes\"\nevaluate_results(bayes_model_name_pr, test_cats_pr, yahoo_bayes_predicted_pr, bayes_classifier_pr, vectorizer_pr, bayes_cv_scores_pr)\n#then get the SVM results:\nprint \"*****\"\nsvm_classifier_pr, yahoo_svm_predicted_pr, svm_cv_scores_pr = get_svm(X_train_words_pr, train_cats_pr, X_test_words_pr )\nsvm_model_name_pr = \"MODEL: Linear SVC\"\nevaluate_results(svm_model_name_pr, test_cats_pr, yahoo_svm_predicted_pr, svm_classifier_pr, vectorizer_pr, svm_cv_scores_pr)\nprint \"*****\"\n#then max ent results\nmaxent_classifier_pr, yahoo_maxent_predicted_pr, maxent_cv_scores_pr = get_maxent(X_train_words_pr, train_cats_pr, X_test_words_pr )\nmaxent_model_name_pr = \"MODEL: Maximum Entropy\"\nevaluate_results(maxent_model_name_pr, test_cats_pr, yahoo_maxent_predicted_pr, maxent_classifier_pr, vectorizer_pr, maxent_cv_scores_pr)",
     "prompt_number": 236,
     "outputs": [
      {
       "output_type": "stream",
       "text": "MODEL: Multinomial Naive Bayes\nThe precision for this classifier is 0.385193065406\nThe recall for this classifier is 0.337037037037\nThe f1 for this classifier is 0.243863242335\nThe accuracy for this classifier is 0.337037037037\n\nHere is the classification report:\n             precision    recall  f1-score   support\n\n          1       0.29      0.91      0.44        75\n          2       0.64      0.36      0.46        45\n          3       0.67      0.10      0.17        42\n          4       0.75      0.09      0.16        34\n          5       0.00      0.00      0.00        19\n          6       0.00      0.00      0.00        25\n          7       0.00      0.00      0.00        30\n\navg / total       0.39      0.34      0.24       270\n\n\nHere is the confusion matrix:\n[[68  5  1  1  0  0  0]\n [29 16  0  0  0  0  0]\n [35  3  4  0  0  0  0]\n [31  0  0  3  0  0  0]\n [17  1  1  0  0  0  0]\n [25  0  0  0  0  0  0]\n [30  0  0  0  0  0  0]]\n\nCross-validation scores: \n[ 0.36157025  0.34575569  0.36438923  0.33954451  0.34782609]\n\nThe top 10 most informative features for topic code 1: \nyahoo make want would people like know best find get\n\nThe top 10 most informative features for topic code 3: \nemail site software internet get web best use yahoo computer\n\nThe top 10 most informative features for topic code 2: \nthink get first like show music best favorite song movie\n\nThe top 10 most informative features for topic code 5: \nrelationship date get want friend guy like woman girl love\n\nThe top 10 most informative features for topic code 4: \nsomeone good find study need go mean college word school",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\nThe top 10 most informative features for topic code 7: \ngood help bad cold diet know way pain best get\n\nThe top 10 most informative features for topic code 6: \nacid possible gas make work moon number world earth many\n\n\n*****\nMODEL: Linear SVC",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\nThe precision for this classifier is 0.568485556735\nThe recall for this classifier is 0.562962962963\nThe f1 for this classifier is 0.558371568277\nThe accuracy for this classifier is 0.562962962963\n\nHere is the classification report:\n             precision    recall  f1-score   support\n\n          1       0.46      0.56      0.50        75\n          2       0.69      0.76      0.72        45\n          3       0.64      0.60      0.62        42\n          4       0.63      0.71      0.67        34\n          5       0.35      0.32      0.33        19\n          6       0.61      0.44      0.51        25\n          7       0.59      0.33      0.43        30\n\navg / total       0.57      0.56      0.56       270\n\n\nHere is the confusion matrix:\n[[42  8  9  9  3  1  3]\n [ 6 34  1  2  1  1  0]\n [12  2 25  0  1  0  2]\n [ 9  0  0 24  0  1  0]\n [ 6  3  2  1  6  1  0]\n [ 7  1  1  2  1 11  2]\n [10  1  1  0  5  3 10]]\n\nCross-validation scores: \n[ 0.57024793  0.57763975  0.54451346  0.53623188  0.5320911 ]\n\nThe top 10 most informative features for topic code 1: \nsanta favorite color house bank business tax credit stock job money",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\nThe top 10 most informative features for topic code 3: \nweb file linux window software page yahoo internet use computer\n\nThe top 10 most informative features for topic code 2: \ncelebrity tv film show favorite rock magazine music movie song\n\nThe top 10 most informative features for topic code 5: \nfamily friend date guy boyfriend marriage relationship woman girl love\n\nThe top 10 most informative features for topic code 4: \nlanguage fast education atom handwritingwizardcom university study word school college\n\nThe top 10 most informative features for topic code 7: \nbad disease tamiflu scurl weight aid treatment surgery diet pain\n\nThe top 10 most informative features for topic code 6: \nsyndicalist paupisi cubit planet gas earth universe acid math moon\n\n\n*****\nMODEL: Maximum Entropy",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\nThe precision for this classifier is 0.612243670224\nThe recall for this classifier is 0.459259259259\nThe f1 for this classifier is 0.419022110779\nThe accuracy for this classifier is 0.459259259259\n\nHere is the classification report:\n             precision    recall  f1-score   support\n\n          1       0.34      0.81      0.48        75\n          2       0.68      0.60      0.64        45\n          3       0.75      0.36      0.48        42\n          4       0.62      0.44      0.52        34\n          5       0.33      0.05      0.09        19\n          6       0.80      0.16      0.27        25\n          7       1.00      0.03      0.06        30\n\navg / total       0.61      0.46      0.42       270\n\n\nHere is the confusion matrix:\n[[61  6  3  5  0  0  0]\n [17 27  0  1  0  0  0]\n [24  3 15  0  0  0  0]\n [19  0  0 15  0  0  0]\n [16  1  1  0  1  0  0]\n [15  2  1  3  0  4  0]\n [25  1  0  0  2  1  1]]\n\nCross-validation scores: \n[ 0.43181818  0.40165631  0.41200828  0.41821946  0.4057971 ]\n\nThe top 10 most informative features for topic code 1: \ntax company buy stock people business find credit job money",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\nThe top 10 most informative features for topic code 3: \nfile linux window page web software internet yahoo use computer\n\nThe top 10 most informative features for topic code 2: \ncelebrity first magazine tv rock show music favorite song movie\n\nThe top 10 most informative features for topic code 5: \nlike marriage date boyfriend relationship friend guy woman girl love\n\nThe top 10 most informative features for topic code 4: \ndegree spanish university education high language study word college school\n\nThe top 10 most informative features for topic code 7: \nfight cold smoking weight aid bad treatment surgery diet pain\n\nThe top 10 most informative features for topic code 6: \nhuman sky math planet universe acid number gas moon earth\n\n\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# Displaying some errors to see what's going on. \ndef print_errors(predicted_pr, test_dict_pr, test_words_pr, length):\n    for i in range(length):\n        if predicted_pr[i] != test_dict_pr[test_words_pr[i]]:\n            print test_words_pr[i]\n            print \"predicted: \" + predicted_pr[i]\n            print \"correct: \" + test_dict_pr[test_words_pr[i]]\n\nprint \"Bayes errors: \"\nprint_errors(yahoo_bayes_predicted_pr, yahoo_test_dict_pr, test_words_pr, 5)\nprint \"*******\"\nprint \"SVM errors: \"\nprint_errors(yahoo_svm_predicted_pr, yahoo_test_dict_pr, test_words_pr, 5)\nprint \"*******\"\nprint \"MaxEnt errors: \"\nprint_errors(yahoo_maxent_predicted_pr, yahoo_test_dict_pr, test_words_pr, 5)",
     "prompt_number": 237,
     "outputs": [
      {
       "output_type": "stream",
       "text": "Bayes errors: \nneed lap top 60000 need prior christmaswhere go\npredicted: 1\ncorrect: 2\ninfer reality class inclusion\npredicted: 1\ncorrect: 5\nwife lie break promise much\npredicted: 1\ncorrect: 4\nnickname georgia besides peach state\npredicted: 1\ncorrect: 7\n*******\nSVM errors: \nneed lap top 60000 need prior christmaswhere go\npredicted: 5\ncorrect: 2\nnickname georgia besides peach state\npredicted: 1\ncorrect: 7\n*******\nMaxEnt errors: \nneed lap top 60000 need prior christmaswhere go\npredicted: 1\ncorrect: 2\ninfer reality class inclusion\npredicted: 1\ncorrect: 5\nnickname georgia besides peach state\npredicted: 1\ncorrect: 7\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#get the actual test data from kaggle\ndef prep_kaggle_testdata():\n    f = open('data/test.csv','rb')\n    test_kaggle_raw = f.read()\n    f.close() \n    test_kaggle_split_lines = test_kaggle_raw.split('\\n')\n    test_kaggle_split =  [line.split(\",\") for line in  test_kaggle_split_lines]\n    test_kaggle_tuples  = [ ( line[0], \" \".join(line[1:]) )  for line in test_kaggle_split]\n    return test_kaggle_tuples[1: ]   ",
     "prompt_number": 238,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "kaggle_test = prep_kaggle_testdata()\nkaggle_test_pr = [(pt.preprocess_txt(item[0], stemmed=False, correct_wd = False), item[1]) for item in kaggle_test ]",
     "prompt_number": 239,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def get_kaggle_test(kaggle_test):\n    kaggle_test_keys =  np.array(kaggle_test.keys())\n    kaggle_test_vals =  np.array(kaggle_test.values())\n    return kaggle_test_keys, kaggle_test_vals",
     "prompt_number": 240,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "##prepare the kaggle test words- turn into a ordered dict \nkaggle_test_dict_pr =  OrderedDict(kaggle_test_pr)\nkaggle_test_number_pr, kaggle_test_words_pr =  get_kaggle_test(kaggle_test_dict_pr)\n#transformed the test data into a TF-IDF weighted word vector in the vocab space of the training data(.transform())\nX_kaggle_test_words_pr = vectorizer_pr.transform(kaggle_test_words_pr)\nkaggle_svm_predicted = svm_classifier_pr.predict(X_kaggle_test_words_pr)\n",
     "prompt_number": 241,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "print kaggle_test_pr[726]\nprint kaggle_test_words_pr[726]\nprint kaggle_test_number_pr[726]",
     "prompt_number": 242,
     "outputs": [
      {
       "output_type": "stream",
       "text": "('727', 'what is your favorite color? ')\nwhat is your favorite color? \n727\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#transformed the test data into a TF-IDF weighted word vector in the vocab space of the training data(.transform())\nX_kaggle_test_words_pr = vectorizer_pr.transform(kaggle_test_words_pr)\nkaggle_svm_predicted_pr = svm_classifier_pr.predict(X_kaggle_test_words_pr)",
     "prompt_number": 243,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#now combined the values back together again....\nkaggle_finished_dict =  zip(kaggle_test_number_pr, kaggle_svm_predicted_pr )\n#print the results into a file: \n\n#delete the keys with no items\nf = open('results_3.csv', 'wb')\nf.write (\"Id,Category\\n\")\nfor k,v in kaggle_finished_dict:\n    f.write(k +\",\" + v + \"\\n\")\nf.close()",
     "prompt_number": 244,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "",
     "prompt_number": 244,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "",
     "prompt_number": 244,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    }
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "name": "",
  "signature": "sha256:010641d07fc4cc4cdadd0672bb3e406502171fb0e0391c34c43f852f7ed6b2f2"
 },
 "nbformat": 3
}