jasonost/Kaggle competition.ipynb

## Kaggle competition.ipynb
{
 "worksheets": [
  {
   "cells": [
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "## Overview and discussion\n\nThis notebook contains most of the code we wrote to load and process the text data, develop features, and test various modeling algorithms and strategies. It also contains our final scoring algorithms for each submission we made in the Kaggle contest.\n\n#### Data loading and processing\n\nLoading the data was fairly straightforward, although we noticed some html-escaped characters in the text (e.g., \"&lt;\" instead of \"<\"). So we converted these back to regular ascii characters before removing any instance of \"&lt;br&gt;\".\n\nWe attempted to identify the title of each question so we could give it more weight in our algorithms. This was simple when the title ended in a question mark (after which the text continued); in these cases, we just took anything up to that first question mark. If there was no mid-text question mark, we instead looked for the first occurence of a question word (who, what, when, where, why, or how) in the middle of the text and took everything before that as the title. Any non-title sentence that exactly replicated the title was removed.\n\nAfter experimenting with various title weights, we ended up simply replicating the title once so that its words and other orthographic features were doubled in our count features.\n\n#### Feature engineering\n\nWe developed a large number of features, including:\n* total count of (non-punctuation) tokens, or \"words\"\n* average length of words\n* total count of stopwords\n* share of total words that are stopwords\n* total count and presence of numerals\n* total count and presence of various questions words\n* total count of any type of punctuation\n* total count of question marks\n* total count of first person words, such as \"I\", \"me\", \"my\", and \"mine\"\n* total count of second person words, such as \"you\", \"your\", and \"yours\"\n* total counts of words tagged as nouns, adjectives, verbs, prepositions, or adverbs\n* most popular non-noun part-of-speech tag\n* most frequent hypernyms of tokens in the text\n\nWe also experimented with lemmatizing each token before feature or tf-idf matrix generation.\n\n#### Model testing\n\nWe tried a variety of models, mostly building off a tf-idf model in scikit-learn (\"sklearn\") that employed a stochastic gradient descent (SGD) version of support vector machines (our approach was inspired by [this sklearn example](http://scikit-learn.org/stable/auto_examples/grid_search_text_feature_extraction.html#example-grid-search-text-feature-extraction-py)). We performed 10-fold cross validation for each model we tested, and ran each 10-fold cross validation experiment ten times to ensure that our models were stable and that we weren't making decisions based purely on random variation in the development samples.\n\nThe basic tf-idf model with no lemmatization performed surprisingly well. By tuning a few parameters (removing stopwords from the tf-idf matrix and using a different loss function in the SGD model), it got an average accuracy around 55-56%. This was very difficult to improve upon.\n\nWe tried various sets of the features that we created, but usually these made the model perform worse. After finding the most important non-term-frequency features in our model, we pruned the extra features to just the top three: total count of words, total count of stopwords, and total count of question marks. These, in combination with a tf-idf matrix that used lemmatized tokens and from which stopwords were removed, was a slight improvement over the basic tf-idf model and ultimately seemed to be our best model (see below).\n\nWe tried a voting model, where we compared the prediction from the basic model and the lemmatized-with-additional-features. If the models predicted different categories for a given document, the algorithm would compare the confidence each model had in its category prediction, and selected the prediction with the higher confidence. This appeared to perform just slightly worse than the lemmatized-with-additional-features model, perhaps because it required using a slightly worse loss function in the SGD models in order to get the prediction confidence levels.\n\nWe also tried adding common bigrams and trigrams into the lemmatized tf-idf model. After testing various thresholds, we took bigrams that appeared in between 0.5% and 5% of documents, and trigrams that appeared in between 0.2% and 2% of documents. Limiting the lower and upper bounds of (relative) frequency appeared to improve the predictive accuracy of the models by focusing attention on more *meaningful* bigrams and trigrams. But these models, even when adding in our additional features or combining in a voting algorithm with the simple tf-idf model, still did not appear to outperform the lemmatized-with-additional-features tf-idf model.\n\nFinally, we experimented with a number of additional algorithms, such as maximum entropy, naive bayes, traditional support vector machines, and an algorithm we designed based on identifying category-specific collocations, yet none of these came  close to our tf-idf models using SGD. We have not included any of this code here. \n\n#### Best model(?)\n\nThe model that performed consistently best on our cross-validation tests, and ultimately gave our highest score on the Kaggle leaderboard, was our [lemmatized-with-additional-features tf-idf model](http://nbviewer.ipython.org/gist/jasonost/14542d75059c0d46c8cc#-mark-tf-idf-on-lemmatized-tokens-along-with-top-additional-features-_best-model_-mark-). This model includes:\n* identifying the title of a question, removing any (exact) duplicate sentences in the question body, then duplicating the title so it will have twice the weight in our term counts and other frequency-based features\n* a tf-idf matrix of lemmatized words with stopwords removed, where the term frequency vectors are normalized prior to applying the inverse document frequency weights\n* three frequency features (each normalized to the range 0 to 1 before inclusion in the model): total word count, total stopword count, and total question mark count\n\nThis classification algorithm was support vector machines with stochastic gradient descent, using the Hubel loss function.\n\nWe believe this is the best model based on the Kaggle test set accuracy, but the accuracy of other models we tried is very close. In particular, the version of this model that [also includes the top non-stopword hypernyms](http://nbviewer.ipython.org/gist/jasonost/14542d75059c0d46c8cc#lemmatized-top-features-and-top-two-hypernyms-for-each-non-stopword-possibly-best-model-) in the tf-idf matrix looks even better on our cross-validation development samples, and it would not be surprising to see this perform better on the remainder of the Kaggle test set. Put simply, we cannot be sure this model's top performance is not simply the result of some random variation, especially since a nontrivial number of questions themselves appear to have been miscategorized by the individuals who originally labeled them.\n\n#### Individual contributions\n\nWe divided the feature engineering and model development work equally. Chalenge created the features involving punctuation, first- and second-person words, part of speech tags, and hypernyms, while Jason generated features involving general word counts, stopwords, numerals, and question words. Chalenge tested maximum entropy and our custom collocations algorithm, and Jason tested Naive Bayes and the tf-idf combined with SGD algorithm. We met a few times to discuss approaches and iterate on models and features."
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "## Importing modules"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# basic NLP\nimport nltk, codecs, string, random, HTMLParser, math\nfrom collections import Counter\nfrom nltk.corpus import brown, wordnet as wn\n\n# scikit-learn\nfrom sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer\nfrom sklearn.linear_model import SGDClassifier\nfrom sklearn.svm import SVC\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.feature_selection import RFE\nimport scipy.sparse as sps\nimport numpy as np\n\nfrom __future__ import division\n\n# creating functions\nh= HTMLParser.HTMLParser()\nsent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')\nstopset = set(nltk.corpus.stopwords.words('english'))",
     "prompt_number": 12,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "## Data loading and processing"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "raw = codecs.open('train.txt', 'r', 'utf-8').readlines()\nquestions = []\nfor row in raw:\n    text = h.unescape(row[2:]).replace('<br>',' ')\n    questions.append((text.strip('\\n'), row[0]))",
     "prompt_number": 2,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Function to process question text into lists of lists of tokens"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def process_text(q):\n    '''\n    Procedure to split string into sentences, tokenize each sentence, find question title (if only one sentence), and \n    delete any duplicate sentences in the question\n    \n    q: string, such as that in the first position of the tuple returned when reading in the data\n    '''\n    new_q = [nltk.word_tokenize(s) for s in sent_tokenizer.tokenize(q)]\n    \n    # if there's only one sentence in the question, check to see if there's a question word mid-sentence\n    # if so, take everything before that word and make it the \"title\", otherwise the whole thing is a title\n    if len(new_q) == 1:\n        sent = new_q[0]\n        first_qword = min([sent.index(w) if w in sent[1:] else 1e6 for w in ['who','what','when','where','why','how']])\n        new_q2 = []\n        if first_qword < 1e6:\n            new_q2.append(sent[:first_qword])\n            new_q2.append(sent[first_qword:])\n        else:\n            new_q2.append(sent)\n        new_q = new_q2\n\n    # remove any non-title sentences that are duplicates of the title\n    if len(new_q) > 1:\n        new_q2 = [new_q[0]] + [s for s in new_q[1:] if tuple(s) != tuple(new_q[0])]\n        new_q = new_q2\n    \n    # duplicate title sentence\n    new_q = new_q + [new_q[0]]\n    \n    return new_q",
     "prompt_number": 3,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "## Feature engineering"
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "POS tagging"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#Function to build text tagger\ndef build_backoff_tagger(train_sents):\n    t0 = nltk.DefaultTagger('NN')\n    t1 = nltk.UnigramTagger(train_sents, backoff=t0)\n    t2 = nltk.BigramTagger(train_sents, backoff=t1)\n    t3 = nltk.TrigramTagger(train_sents, backoff=t2)\n    return t3\n\n# Train a tagger on all of the Brown corpus\ntraining_data = brown.tagged_sents()\nngram_tagger = build_backoff_tagger(training_data)",
     "prompt_number": 4,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def tagged_counts(qtext):\n    '''\n    Function to get the tag count of each major part of speech tag.\n    Returns five (5) values for the above tags\n    qtext: a list of lists of tokens, such as the result of process_text()\n    '''\n    # Create an empty list to add tagged sentences\n    tagged_text =[]\n    # Create a base dict with the major pos tags\n    defaults = {'NN':0, 'JJ':0, 'VB':0, 'IN':0, 'RB':0}\n    \n    #for each sentence create a list of words using specified pattern and remove punctuation then tag it\n    for sent in qtext:\n        tagged_text += ngram_tagger.tag(sent)\n        #tagged_text += nltk.pos_tag(sent)\n    text_counts = dict(Counter([value[:2] for (key, value) in tagged_text]))\n    keys = text_counts.viewkeys() | defaults.viewkeys() \n    tag_counts = {k : text_counts.get(k, 0) + defaults.get(k,0) for k in keys }\n    return tag_counts['NN'], tag_counts['JJ'], tag_counts['VB'], tag_counts['IN'], tag_counts['RB']",
     "prompt_number": 5,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Wordnet hypernyms"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def gethypernyms(qtext):\n    '''\n    Function gets a question which is a list of list \n    '''\n    hypterms = []\n    syn_sets = []\n    s=[]\n    wordlist = [word for sent in qtext for word in sent]\n    for word in wordlist:                  # for each term\n        \"\"\"Get the synsets that tally with the POS tag of the word \n        either verbs or nouns using the word lemma from morphy\"\"\"\n        #Convert word to lemma and check if the word has a lemma to get synsets\n        word_morphy = wn.morphy(word.lower())\n        if word_morphy and word_morphy not in stopset:\n            s = wn.synsets(word_morphy)[:2] # top two hypernyms\n            syn_sets += s\n\n    for syn in syn_sets:                      # for each synset\n        for hyp in syn.hypernyms():    # It has a list of hypernyms\n            hypterms += [hyp.name]  # Extract the hypernym name and add to list\n\n    hypfd = nltk.FreqDist(hypterms)\n    return hypfd.keys()",
     "prompt_number": 6,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Main function to get features"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def ortho(qtext):\n    '''\n    Procedure to calculate for a given question text:\n        - total count of (non-punctuation) tokens, or \"words\"\n        - average length of words\n        - total count of stopwords\n        - share of total words that are stopwords\n        - total count and presence of numerals\n        - total count and presence of various questions words\n        - total count of any type of punctuation\n        - total count of question marks\n        - total count of first person words, such as \"I\", \"me\", \"my\", and \"mine\"\n        - total count of second person words, such as \"you\", \"your\", and \"yours\"\n        - total counts of words tagged as nouns, adjectives, verbs, prepositions, or adverbs\n        - most popular non-noun part-of-speech tag\n        - most frequent hypernyms of tokens in the text\n    qtext: a list of lists of tokens, such as the result of process_text()\n    '''\n    \n    # pulling out basic word stats\n    actual_words = [word for sent in qtext for word in sent if sum(1 for c in word if c not in string.punctuation) > 0]\n    total_words = len(actual_words)\n    avg_length = sum(len(word) for word in actual_words) / total_words\n    \n    # stopwords stats\n    stopwds = set(nltk.corpus.stopwords.words('english'))\n    stopwords_count = len([word for word in actual_words if word.lower() not in stopwds])\n    share_stopwords = stopwords_count / total_words\n    \n    # numerals\n    total_numerals = sum([1 for c in ''.join([word for word in actual_words]) if c in '1234567890'])\n    any_numerals = 1 if total_numerals > 0 else 0\n    \n    features = {'total_words': total_words,\n                'average_length': avg_length,\n                'total_stopwords': stopwords_count,\n                'share_stopwords': share_stopwords,\n                'total_numerals': total_numerals,\n                'any_numerals': any_numerals}\n\n    for w in ['who','what','when','where','why','how']:\n        features[w + '_count'] = sum([1 for sent in qtext for word in sent if word == w])\n        features[w + '_any'] = max([1 if word == w else 0 for sent in qtext for word in sent])\n\n    # punctuations etc\n    features['punctuations'] = sum([sent.count(pmark) for sent in qtext for pmark in string.punctuation])\n    features['question_marks'] = sum([sent.count('?') for sent in qtext])\n    features['first_person'] = 1 if (sum([sent.count(word) for sent in qtext for word in ['i','me','my','mine']])) > 0 else 0\n    features['second_person'] = 1 if (sum([sent.count(word) for sent in qtext for word in ['you','your','yours']])) > 0 else 0\n    features['NN'], features['JJ'], features['VB'], features['IN'], features['RB'] = tagged_counts(qtext) \n    tags = ['JJ','VB','IN','RB']\n    for t in tags:\n        features['top_tag_' + t] = 1 if features[t] == max([features[tt] for tt in tags]) else 0\n    #features['hypernyms'] = gethypernyms(qtext)\n\n    return features",
     "prompt_number": 7,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Lemmatization"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "lmtzr = nltk.WordNetLemmatizer()\n\ndef lemtize(qtext):\n    tokens = [w for sent in qtext for w in sent]\n    for i in range(len(tokens)):\n        res = lmtzr.lemmatize(tokens[i])\n        if res == tokens[i]:\n            tokens[i] = lmtzr.lemmatize(tokens[i], 'v')\n        else:\n            tokens[i] = res\n    return tokens",
     "prompt_number": 8,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Procedure to create feature matrix for sklearn models"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# global variable for feature name array\nfeat_names = []\n\ndef feat_matrix(qlist, feature_list=None, norm=True):\n    '''\n    Procedure to turn a list of N strings into an NxM feature matrix, where M is the number of features generated by\n    the ortho() procedure. \n    \n    qlist: list of strings\n    feature_list: optional array of feature names to include in matrix; if None (the default), all features will be included\n    norm: boolean indicating whether to normalize features to range between 0 and 1 (default is to normalize)\n    '''\n    global feat_names\n\n    # list of features in order, since these are coming out as a dictionary\n    if feature_list:\n        feat_order = feature_list[:]\n        feat_names = feature_list[:]\n    else:\n        feat_order = []\n    \n    for i, q in enumerate(qlist):\n        # create list of list of tokens\n        proc_q = process_text(q)\n        \n        # get features and fill feat_order if necessary\n        feats = ortho(proc_q)\n        \n        # if first string processed, set feat_order and initialize matrix\n        # otherwise, add the array to the bottom of the matrix\n        if i == 0: \n            if not feat_order: \n                feat_order = feats.keys()\n                feat_names = feat_order[:]\n            feat_mat = np.array([[feats[f] for f in feat_order]])\n        else:\n            feat_mat = np.concatenate((feat_mat, np.array([[feats[f] for f in feat_order]])), axis=0)\n    \n    # normalize all feature columns to min of 0 and max of 1\n    if norm:\n        mf = feat_mat.T\n        feat_mat = np.array([(mf[i] - min(mf[i])) / (max(mf[i]) - min(mf[i])) for i in range(mf.shape[0])]).T\n    \n    return feat_mat        ",
     "prompt_number": 9,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Feature selection"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "train_X = [' '.join(lemtize(process_text(q)) + [h.replace('.','_') for h in gethypernyms(process_text(q))]) for q, c in questions]\ntrain_y = [c for q, c in questions]\n\ntrain_addX = feat_matrix(train_X)\ntfidf = TfidfVectorizer(stop_words=stopset)\ntrain_mat = tfidf.fit_transform(train_X, train_y)\n\ntrain_mat_add = sps.csr_matrix(np.concatenate((train_mat.toarray(), train_addX), axis=1))\n\nclf = SGDClassifier(loss='huber')\nfeat = RFE(estimator=clf, n_features_to_select=1, step=1)\nfeat.fit(train_mat_add, train_y)",
     "prompt_number": 32,
     "outputs": [
      {
       "text": "RFE(estimator=SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,\n       fit_intercept=True, l1_ratio=0.15, learning_rate='optimal',\n       loss='huber', n_iter=5, n_jobs=1, penalty='l2', power_t=0.5,\n       random_state=None, rho=None, shuffle=False, verbose=0,\n       warm_start=False),\n  estimator_params={}, n_features_to_select=1, step=1, verbose=0)",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 32
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "sorted(zip(feat_names,[r for i, r in enumerate(list(feat.ranking_)) if i >= train_mat.shape[1]]), key=lambda x: x[1])",
     "prompt_number": 33,
     "outputs": [
      {
       "text": "[('any_numerals', 9),\n ('total_words', 252),\n ('NN', 253),\n ('question_marks', 394),\n ('total_numerals', 421),\n ('second_person', 423),\n ('total_stopwords', 440),\n ('how_any', 1015),\n ('how_count', 1091),\n ('share_stopwords', 1140),\n ('top_tag_VB', 1186),\n ('IN', 1224),\n ('RB', 1227),\n ('why_count', 1885),\n ('VB', 2122),\n ('average_length', 2164),\n ('JJ', 2410),\n ('who_any', 2497),\n ('first_person', 2559),\n ('what_any', 2654),\n ('what_count', 2855),\n ('punctuations', 3124),\n ('where_any', 3829),\n ('when_count', 3961),\n ('why_any', 3986),\n ('top_tag_IN', 4153),\n ('where_count', 4197),\n ('top_tag_JJ', 5067),\n ('when_any', 5741),\n ('top_tag_RB', 6285),\n ('who_count', 6295)]",
       "output_type": "pyout",
       "metadata": {},
       "prompt_number": 33
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "## tf-idf models"
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Basic tf-idf"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# 10 rounds of trials, each with 10-fold cross-validation\nfor j in range(10):\n    scores = []\n    qlist = questions[:]\n    random.shuffle(qlist)\n\n    cutoffs = np.linspace(0,len(qlist),10).astype(int)\n\n    for i in range(len(cutoffs)-1):\n        train_X = [q for q, c in qlist[:cutoffs[i]] + qlist[cutoffs[i+1]:]]\n        train_y = [c for q, c in qlist[:cutoffs[i]] + qlist[cutoffs[i+1]:]]\n        test_X = [q for q, c in qlist[cutoffs[i]:cutoffs[i+1]]]\n        test_y = [c for q, c in qlist[cutoffs[i]:cutoffs[i+1]]]\n\n        tfidf = TfidfVectorizer(stop_words=stopset)\n        train_mat = tfidf.fit_transform(train_X, train_y)\n        test_mat = tfidf.transform(test_X)\n\n        clf = SGDClassifier(loss='huber')\n        fin = clf.fit(train_mat, train_y)\n\n        scores.append(fin.score(test_mat, test_y))\n\n    print sum(scores) / len(scores)\n    print ('\\t%0.4f' * 9) % tuple(scores)",
     "prompt_number": 12,
     "outputs": [
      {
       "output_type": "stream",
       "text": "0.558569305091\n\t0.5886\t0.5000\t0.5733\t0.5833\t0.5485\t0.5167\t0.5633\t0.5633\t0.5900\n0.55670506627",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5485\t0.5567\t0.6233\t0.5533\t0.5552\t0.5067\t0.5400\t0.5900\t0.5367\n0.563755728973",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5987\t0.5967\t0.4967\t0.5733\t0.5418\t0.5567\t0.5800\t0.5867\t0.5433\n0.557092778397",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5686\t0.5800\t0.5633\t0.5300\t0.5819\t0.5267\t0.5767\t0.5233\t0.5633\n0.554856930509",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5686\t0.5700\t0.6200\t0.5233\t0.5452\t0.5567\t0.5500\t0.5333\t0.5267\n0.560045831785",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5853\t0.5933\t0.5333\t0.5600\t0.5385\t0.5100\t0.5567\t0.5600\t0.6033\n0.55963334572",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5318\t0.5733\t0.5933\t0.5900\t0.4783\t0.5833\t0.5633\t0.5667\t0.5567\n0.552635946984",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5585\t0.5333\t0.4933\t0.6100\t0.5585\t0.6000\t0.5500\t0.5033\t0.5667\n0.552640901771",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5719\t0.5933\t0.5633\t0.5200\t0.5585\t0.5667\t0.5333\t0.5367\t0.5300\n0.553033568686",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.6187\t0.5933\t0.5200\t0.5700\t0.5719\t0.5533\t0.4800\t0.5500\t0.5200\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "### <mark>tf-idf on lemmatized tokens, along with top additional features (_best model_)</mark>"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# 10 rounds of trials, each with 10-fold cross-validation\nfor j in range(10):\n    scores = []\n    qlist = questions[:]\n    random.shuffle(qlist)\n    \n    # generating data sets\n    full_X = [' '.join(lemtize(process_text(q))) for q, c in qlist]\n    full_y = [c for q, c in qlist]\n\n    features = ['total_words',\n                'total_stopwords',\n                'question_marks']\n    full_addX = feat_matrix(full_X, feature_list=features)\n\n    # set cutoff values for cross-validation\n    cutoffs = np.linspace(0,len(qlist),10).astype(int)\n\n    for i in range(len(cutoffs)-1):\n        train_X = full_X[:cutoffs[i]] + full_X[cutoffs[i+1]:]\n        train_y = full_y[:cutoffs[i]] + full_y[cutoffs[i+1]:]\n        test_X = full_X[cutoffs[i]:cutoffs[i+1]]\n        test_y = full_y[cutoffs[i]:cutoffs[i+1]]\n\n        tfidf = TfidfVectorizer(stop_words=stopset)\n        train_mat = tfidf.fit_transform(train_X, train_y)\n        test_mat = tfidf.transform(test_X)\n\n        train_addX = np.concatenate((full_addX[:cutoffs[i]], full_addX[cutoffs[i+1]:]), axis=0)\n        test_addX = full_addX[cutoffs[i]:cutoffs[i+1]]\n\n        train_mat_add = sps.csr_matrix(np.concatenate((train_mat.toarray(), train_addX), axis=1))\n        test_mat_add = sps.csr_matrix(np.concatenate((test_mat.toarray(), test_addX), axis=1))\n\n        clf = SGDClassifier(loss='huber')\n        fin = clf.fit(train_mat_add, train_y)\n\n        scores.append(fin.score(test_mat_add, test_y))\n\n    print sum(scores) / len(scores)\n    print ('\\t%0.4f' * 9) % tuple(scores)",
     "prompt_number": 25,
     "outputs": [
      {
       "output_type": "stream",
       "text": "0.57819521863\n\t0.5753\t0.5767\t0.5267\t0.6033\t0.5518\t0.6333\t0.6167\t0.5600\t0.5600\n0.57042239564",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5819\t0.5467\t0.5967\t0.5567\t0.5585\t0.5700\t0.5900\t0.5700\t0.5633\n0.565961848136",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5284\t0.6233\t0.6067\t0.5267\t0.5686\t0.5633\t0.5300\t0.5700\t0.5767\n0.570787811223",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5619\t0.5933\t0.6133\t0.6100\t0.5652\t0.5567\t0.5833\t0.5100\t0.5433\n0.568590362938",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5585\t0.5567\t0.6000\t0.5467\t0.6355\t0.5433\t0.5667\t0.5633\t0.5467\n0.571163136381",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5886\t0.5300\t0.5567\t0.5867\t0.5518\t0.5833\t0.5567\t0.5867\t0.6000\n0.56710764276",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.6488\t0.5433\t0.6067\t0.5667\t0.5418\t0.5367\t0.5133\t0.5733\t0.5733\n0.574117428465",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5485\t0.6000\t0.6000\t0.5733\t0.5686\t0.5600\t0.5767\t0.6033\t0.5367\n0.564856930509",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5452\t0.5933\t0.5767\t0.5400\t0.5686\t0.6100\t0.5833\t0.5100\t0.5567\n0.567092778397",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5719\t0.5600\t0.5633\t0.6000\t0.5786\t0.5533\t0.5367\t0.5667\t0.5733\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Combined voting of prior two algorithms"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# 10 rounds of trials, each with 10-fold cross-validation\nfor j in range(10):\n    # combined voting\n    scores = []\n    qlist = questions[:]\n    random.shuffle(qlist)\n\n    # generating data sets\n    full_X = [' '.join(lemtize(process_text(q))) for q, c in qlist]\n    full_y = [c for q, c in qlist]\n\n    features = ['total_words',\n                'total_stopwords',\n                'question_marks']\n    full_addX = feat_matrix(full_X, feature_list=features)\n\n    cutoffs = np.linspace(0,len(qlist),10).astype(int)\n\n    for i in range(len(cutoffs) - 1):\n        # first model: basic tf-idf\n        train_X = [q for q, c in qlist[:cutoffs[i]] + qlist[cutoffs[i+1]:]]\n        train_y = [c for q, c in qlist[:cutoffs[i]] + qlist[cutoffs[i+1]:]]\n        test_X = [q for q, c in qlist[cutoffs[i]:cutoffs[i+1]]]\n        test_y = [c for q, c in qlist[cutoffs[i]:cutoffs[i+1]]]\n\n        tfidf = TfidfVectorizer(stop_words=stopset)\n        train_mat = tfidf.fit_transform(train_X, train_y)\n        test_mat = tfidf.transform(test_X)\n\n        clf = SGDClassifier(loss='log')\n        fin = clf.fit(train_mat, train_y)\n\n        pred1 = fin.predict(test_mat)\n        pred1_prob = fin.predict_proba(test_mat)\n\n        # second model: lemmatized tf-idf with additional features\n        train_X = full_X[:cutoffs[i]] + full_X[cutoffs[i+1]:]\n        train_y = full_y[:cutoffs[i]] + full_y[cutoffs[i+1]:]\n        test_X = full_X[cutoffs[i]:cutoffs[i+1]]\n        test_y = full_y[cutoffs[i]:cutoffs[i+1]]\n\n        tfidf = TfidfVectorizer(stop_words=stopset)\n        train_mat = tfidf.fit_transform(train_X, train_y)\n        test_mat = tfidf.transform(test_X)\n\n        train_addX = np.concatenate((full_addX[:cutoffs[i]], full_addX[cutoffs[i+1]:]), axis=0)\n        test_addX = full_addX[cutoffs[i]:cutoffs[i+1]]\n\n        train_mat_add = sps.csr_matrix(np.concatenate((train_mat.toarray(), train_addX), axis=1))\n        test_mat_add = sps.csr_matrix(np.concatenate((test_mat.toarray(), test_addX), axis=1))\n\n        clf = SGDClassifier(loss='log')\n        fin = clf.fit(train_mat_add, train_y)\n\n        pred2 = fin.predict(test_mat_add)\n        pred2_prob = fin.predict_proba(test_mat_add)\n\n        guesses = []\n        for i in range(len(pred1)):\n            if pred1[i] == pred2[i]:\n                guesses.append(pred1[i])\n            else:\n                avg_nonpred1 = sum([pred1_prob[i][n] for n in range(len(pred1_prob[i])) if n != int(pred1[i]) - 1]) / \\\n                                (len(pred1_prob[i]) - 1)\n                avg_nonpred2 = sum([pred2_prob[i][n] for n in range(len(pred2_prob[i])) if n != int(pred2[i]) - 1]) / \\\n                                (len(pred2_prob[i]) - 1)\n                if pred1_prob[i][int(pred1[i]) - 1] / avg_nonpred1 > pred2_prob[i][int(pred2[i]) - 1] / avg_nonpred2:\n                    guesses.append(pred1[i])\n                else:\n                    guesses.append(pred2[i])\n\n        scores.append(sum([1 for i in range(len(test_y)) if test_y[i] == guesses[i]]) / len(test_y))\n\n    print sum(scores) / len(scores)\n    print ('\\t%0.4f' * 9) % tuple(scores)",
     "prompt_number": 17,
     "outputs": [
      {
       "output_type": "stream",
       "text": "0.562270531401\n\t0.5686\t0.5733\t0.5933\t0.5367\t0.5619\t0.5967\t0.5433\t0.5600\t0.5267\n0.563374210331",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5552\t0.5433\t0.5333\t0.6000\t0.5552\t0.5967\t0.5533\t0.5500\t0.5833\n0.574136008919",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.6254\t0.5900\t0.5700\t0.5633\t0.5418\t0.5700\t0.6133\t0.5533\t0.5400\n0.565956893348",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5251\t0.5767\t0.5700\t0.6000\t0.5585\t0.6033\t0.5567\t0.5267\t0.5767\n0.579680416202",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5585\t0.5733\t0.5633\t0.5267\t0.5786\t0.6233\t0.5467\t0.6100\t0.6367\n0.572637185681",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5619\t0.5833\t0.5833\t0.5333\t0.5585\t0.5467\t0.5900\t0.6333\t0.5633\n0.571533506751",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5719\t0.5833\t0.6033\t0.5900\t0.5686\t0.5433\t0.5500\t0.5667\t0.5667\n0.5633865973",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5987\t0.5067\t0.5467\t0.5733\t0.5452\t0.5633\t0.5900\t0.5633\t0.5833\n0.56782484826",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5619\t0.5300\t0.5667\t0.6267\t0.5652\t0.5367\t0.5567\t0.5933\t0.5733\n0.565254552211",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5719\t0.5133\t0.5433\t0.5567\t0.6154\t0.5700\t0.5833\t0.5833\t0.5500\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Including bigrams and trigrams in tf-idf matrix"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "for j in range(10):\n    scores = []\n    qlist = questions[:]\n    random.shuffle(qlist)\n\n    cutoffs = np.linspace(0,len(qlist),10).astype(int)\n\n    X = [' '.join(lemtize(process_text(q))) for q, c in qlist]\n    y = [c for q, c in qlist]\n\n    cv = CountVectorizer(stop_words=stopset)\n    cv.fit(X)\n    vecs = cv.transform(X)\n\n    bigrams = [[bg for sent in process_text(q) \n                for bg in nltk.bigrams([w for w in sent if w[0] not in string.punctuation])]\n                   for q, c in qlist]\n\n    bigram_freq = nltk.FreqDist([b for q in bigrams for b in q])\n    bigram_set = [b for b,c in bigram_freq.items() if c < len(qlist) * 0.05 and c >= len(qlist) * 0.005]\n    bigram_array = np.asarray([np.asarray([d.count(b) for b in bigram_set]) for d in bigrams])\n    \n    trigrams = [[bg for sent in process_text(q) \n                 for bg in nltk.trigrams([w for w in sent if w[0] not in string.punctuation])]\n                    for q, c in qlist]\n\n    trigram_freq = nltk.FreqDist([b for q in trigrams for b in q])\n    trigram_set = [b for b,c in trigram_freq.items() if c < len(qlist) * 0.02 and c >= len(qlist) * 0.002]\n    trigram_array = np.asarray([np.asarray([d.count(b) for b in trigram_set]) for d in trigrams])\n    \n    count_mat = sps.csr_matrix(np.concatenate((vecs.toarray(), bigram_array, trigram_array), axis=1))\n    \n    tfidf = TfidfTransformer()\n    tfidf_mat = tfidf.fit_transform(count_mat).toarray()\n\n    for i in range(len(cutoffs)-1):\n        train_mat = sps.csr_matrix(np.concatenate((tfidf_mat[:cutoffs[i]], tfidf_mat[cutoffs[i+1]:])))\n        test_mat = sps.csr_matrix(tfidf_mat[cutoffs[i]:cutoffs[i+1]])\n\n        train_y = y[:cutoffs[i]] + y[cutoffs[i+1]:]\n        test_y = y[cutoffs[i]:cutoffs[i+1]]\n\n        clf = SGDClassifier(loss='huber')\n        fin = clf.fit(train_mat, train_y)\n\n        scores.append(fin.score(test_mat, test_y))\n\n    print sum(scores) / len(scores)\n    print ('\\t%0.4f' * 9) % tuple(scores)",
     "prompt_number": 15,
     "outputs": [
      {
       "output_type": "stream",
       "text": "0.564121144556\n\t0.5853\t0.5700\t0.5867\t0.5267\t0.5418\t0.5733\t0.5300\t0.5500\t0.6133\n0.571528551963",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.6120\t0.5633\t0.5700\t0.6100\t0.5151\t0.5900\t0.5667\t0.5567\t0.5600\n0.566345844172",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5652\t0.5633\t0.5700\t0.5733\t0.5686\t0.5600\t0.5467\t0.5733\t0.5767\n0.557452000495",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5753\t0.5600\t0.5867\t0.5600\t0.5452\t0.5667\t0.5167\t0.5167\t0.5900\n0.567826086957",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5418\t0.5667\t0.5733\t0.5833\t0.5886\t0.5100\t0.5900\t0.5900\t0.5667\n0.565999009042",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.6321\t0.5700\t0.5367\t0.6200\t0.5652\t0.5400\t0.5667\t0.5233\t0.5400\n0.568564350303",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5151\t0.6100\t0.5533\t0.5367\t0.6087\t0.5633\t0.5800\t0.5567\t0.5933\n0.565252074817",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5953\t0.5467\t0.5667\t0.5167\t0.5853\t0.5700\t0.5367\t0.6067\t0.5633\n0.568929765886",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5318\t0.5833\t0.5933\t0.5033\t0.5786\t0.5633\t0.5900\t0.5833\t0.5933\n0.565233494364",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5819\t0.4800\t0.5700\t0.5733\t0.5485\t0.6267\t0.6200\t0.5367\t0.5500\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Bigrams and trigrams tf-idf with top additional features"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "for j in range(10):\n    scores = []\n    qlist = questions[:]\n    random.shuffle(qlist)\n\n    cutoffs = np.linspace(0,len(qlist),10).astype(int)\n\n    X = [' '.join(lemtize(process_text(q))) for q, c in qlist]\n    y = [c for q, c in qlist]\n\n    features = ['total_words',\n                'total_stopwords',\n                'question_marks']\n    full_addX = feat_matrix(X, feature_list=features)\n\n    cv = CountVectorizer(stop_words=stopset)\n    cv.fit(X)\n    vecs = cv.transform(X)\n\n    bigrams = [[bg for sent in process_text(q) \n                for bg in nltk.bigrams([w for w in sent if w[0] not in string.punctuation])]\n                   for q, c in qlist]\n\n    bigram_freq = nltk.FreqDist([b for q in bigrams for b in q])\n    bigram_set = [b for b,c in bigram_freq.items() if c < len(qlist) * 0.05 and c >= len(qlist) * 0.005]\n    bigram_array = np.asarray([np.asarray([d.count(b) for b in bigram_set]) for d in bigrams])\n    \n    trigrams = [[bg for sent in process_text(q) \n                 for bg in nltk.trigrams([w for w in sent if w[0] not in string.punctuation])]\n                    for q, c in qlist]\n\n    trigram_freq = nltk.FreqDist([b for q in trigrams for b in q])\n    trigram_set = [b for b,c in trigram_freq.items() if c < len(qlist) * 0.02 and c >= len(qlist) * 0.002]\n    trigram_array = np.asarray([np.asarray([d.count(b) for b in trigram_set]) for d in trigrams])\n    \n    count_mat = sps.csr_matrix(np.concatenate((vecs.toarray(), bigram_array, trigram_array), axis=1))\n    \n    tfidf = TfidfTransformer()\n    tfidf_mat = np.concatenate((tfidf.fit_transform(count_mat).toarray(), full_addX), axis=1)\n\n    for i in range(len(cutoffs)-1):\n        train_mat = sps.csr_matrix(np.concatenate((tfidf_mat[:cutoffs[i]], tfidf_mat[cutoffs[i+1]:])))\n        test_mat = sps.csr_matrix(tfidf_mat[cutoffs[i]:cutoffs[i+1]])\n\n        train_y = y[:cutoffs[i]] + y[cutoffs[i+1]:]\n        test_y = y[cutoffs[i]:cutoffs[i+1]]\n\n        clf = SGDClassifier(loss='huber')\n        fin = clf.fit(train_mat, train_y)\n\n        scores.append(fin.score(test_mat, test_y))\n\n    print sum(scores) / len(scores)\n    print ('\\t%0.4f' * 9) % tuple(scores)",
     "prompt_number": 16,
     "outputs": [
      {
       "output_type": "stream",
       "text": "0.561885296668\n\t0.5351\t0.5600\t0.6033\t0.5600\t0.5552\t0.5867\t0.5367\t0.5533\t0.5667\n0.565608819522",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5886\t0.5600\t0.5867\t0.5733\t0.5552\t0.5467\t0.5467\t0.5600\t0.5733\n0.559682893596",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.6020\t0.5233\t0.5333\t0.5900\t0.5418\t0.5533\t0.5333\t0.5767\t0.5833\n0.5563346959",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5385\t0.5133\t0.5833\t0.5533\t0.5652\t0.5500\t0.5500\t0.5800\t0.5733\n0.573369255543",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5418\t0.5867\t0.5133\t0.5767\t0.5552\t0.6133\t0.5967\t0.6033\t0.5733\n0.570056980057",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5452\t0.5933\t0.6300\t0.5567\t0.6087\t0.5100\t0.6133\t0.5267\t0.5467\n0.56337049424",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5452\t0.6067\t0.5967\t0.5633\t0.5552\t0.5567\t0.5800\t0.5467\t0.5200\n0.566358231141",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.6020\t0.5667\t0.5433\t0.5500\t0.5652\t0.5400\t0.5567\t0.5900\t0.5833\n0.572256905735",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5184\t0.5633\t0.5500\t0.6067\t0.5753\t0.6133\t0.5700\t0.5533\t0.6000\n0.561548371114",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5819\t0.5800\t0.5467\t0.5400\t0.5987\t0.5967\t0.5000\t0.5433\t0.5667\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "##### Lemmatized, top features, and top two hypernyms for each non-stopword (possibly best model)"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# 10 rounds of trials, each with 10-fold cross-validation\nfor j in range(10):\n    scores = []\n    qlist = questions[:]\n    random.shuffle(qlist)\n    \n    # generating data sets\n    full_X = [' '.join(lemtize(process_text(q)) + [h.replace('.','_') for h in gethypernyms(process_text(q))]) for q, c in qlist]\n    full_y = [c for q, c in qlist]\n\n    features = ['total_words',\n                'total_stopwords',\n                'total_numerals',\n                'question_marks']\n    full_addX = feat_matrix(full_X, feature_list=features)\n\n    # set cutoff values for cross-validation\n    cutoffs = np.linspace(0,len(qlist),10).astype(int)\n\n    for i in range(len(cutoffs)-1):\n        train_X = full_X[:cutoffs[i]] + full_X[cutoffs[i+1]:]\n        train_y = full_y[:cutoffs[i]] + full_y[cutoffs[i+1]:]\n        test_X = full_X[cutoffs[i]:cutoffs[i+1]]\n        test_y = full_y[cutoffs[i]:cutoffs[i+1]]\n\n        tfidf = TfidfVectorizer(stop_words=stopset)\n        train_mat = tfidf.fit_transform(train_X, train_y)\n        test_mat = tfidf.transform(test_X)\n\n        train_addX = np.concatenate((full_addX[:cutoffs[i]], full_addX[cutoffs[i+1]:]), axis=0)\n        test_addX = full_addX[cutoffs[i]:cutoffs[i+1]]\n\n        train_mat_add = sps.csr_matrix(np.concatenate((train_mat.toarray(), train_addX), axis=1))\n        test_mat_add = sps.csr_matrix(np.concatenate((test_mat.toarray(), test_addX), axis=1))\n\n        clf = SGDClassifier(loss='huber')\n        fin = clf.fit(train_mat_add, train_y)\n\n        scores.append(fin.score(test_mat_add, test_y))\n\n    print sum(scores) / len(scores)\n    print ('\\t%0.4f' * 9) % tuple(scores)",
     "prompt_number": 21,
     "outputs": [
      {
       "output_type": "stream",
       "text": "0.573381642512\n\t0.5719\t0.5867\t0.5700\t0.5767\t0.5585\t0.5467\t0.5967\t0.5633\t0.5900\n0.583781741608",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.6388\t0.5600\t0.6333\t0.5600\t0.5719\t0.5667\t0.5533\t0.5700\t0.6000\n0.575624922581",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5686\t0.5400\t0.5567\t0.5833\t0.6187\t0.5867\t0.5967\t0.5667\t0.5633\n0.580063173541",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.6154\t0.5767\t0.5767\t0.5600\t0.5552\t0.5400\t0.6633\t0.5300\t0.6033\n0.583013749535",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5418\t0.5467\t0.6200\t0.6067\t0.5953\t0.5567\t0.5733\t0.6033\t0.6033\n0.576352037656",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5786\t0.5967\t0.5233\t0.6067\t0.5719\t0.5400\t0.6167\t0.5633\t0.5900\n0.581902638424",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5518\t0.6233\t0.6400\t0.5567\t0.5853\t0.5733\t0.5300\t0.5533\t0.6233\n0.571906354515",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5819\t0.5700\t0.5400\t0.5867\t0.5652\t0.6100\t0.5433\t0.5933\t0.5567\n0.581917502787",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.6020\t0.5667\t0.6067\t0.5833\t0.5753\t0.6200\t0.5633\t0.5767\t0.5433\n0.583764399851",
       "stream": "stdout"
      },
      {
       "output_type": "stream",
       "text": "\n\t0.5719\t0.6133\t0.5200\t0.5567\t0.5920\t0.6300\t0.5633\t0.5633\t0.6433\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "## Score output"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "raw = codecs.open('test.csv', 'r', 'utf-8').readlines()\ntestq = []\nfor row in raw:\n    if not row.startswith('Id,'):\n        text = h.unescape(row[row.index(',')+1:]).replace('<br>',' ')\n        testq.append((text.strip('\\n'), row[:row.index(',')]))",
     "prompt_number": 13,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Round 1"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "train_X = [q for q, c in questions]\ntrain_y = [c for q, c in questions]\n\ntfidf = TfidfVectorizer(max_df=0.75)\ntrain_mat = tfidf.fit_transform(train_X, train_y)\napply_mat = tfidf.transform([q for q, i in testq])\n\nclf = SGDClassifier()\nfin = clf.fit(train_mat, train_y)\n\ntest_guess = fin.predict(apply_mat)",
     "prompt_number": 19,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "f = open('submit20141029_1.csv','w')\nf.write('Id,Category\\n')\nfor i in range(len(testq)):\n    f.write('%s,%s\\n' % (testq[i][1],test_guess[i]))\nf.close()",
     "prompt_number": 20,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Round 2"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "train_X = [q for q, c in questions]\ntrain_y = [c for q, c in questions]\n\ntfidf = TfidfVectorizer()\ntrain_mat = tfidf.fit_transform(train_X, train_y)\napply_mat = tfidf.transform([q for q, i in testq])\n\nclf = SGDClassifier(loss='huber')\nfin = clf.fit(train_mat, train_y)\n\ntest_guess = fin.predict(apply_mat)",
     "prompt_number": 156,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "f = open('submit20141029_2.csv','w')\nf.write('Id,Category\\n')\nfor i in range(len(testq)):\n    f.write('%s,%s\\n' % (testq[i][1],test_guess[i]))\nf.close()",
     "prompt_number": 158,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Round 3"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "train_X = [q for q, c in questions]\ntrain_y = [c for q, c in questions]\n\ntfidf = TfidfVectorizer()\ntrain_mat = tfidf.fit_transform(train_X, train_y)\napply_mat = tfidf.transform([q for q, i in testq])\n\nclf = SGDClassifier(loss='log')\nfin = clf.fit(train_mat, train_y)\n\ntest_guess = fin.predict(apply_mat)",
     "prompt_number": 64,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "f = open('submit20141029_3.csv','w')\nf.write('Id,Category\\n')\nfor i in range(len(testq)):\n    f.write('%s,%s\\n' % (testq[i][1],test_guess[i]))\nf.close()",
     "prompt_number": 65,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Round 5"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#k in ['total_stopwords','total_words','NN','share_stopwords','IN']\n\ntrain_X = [' '.join(lemtize(process_text(q))) for q, c in questions]\ntrain_y = [c for q, c in questions]\ntrain_addX = feat_matrix(train_X)\n\napply_X = [' '.join(lemtize(process_text(q))) for q, i in testq]\napply_addX = feat_matrix(apply_X)\n\ntfidf = TfidfVectorizer()\ntrain_mat = tfidf.fit_transform(train_X, train_y)\ntrain_mat_add = sps.csr_matrix(np.concatenate((train_mat.toarray(), train_addX), axis=1))\napply_mat = tfidf.transform(apply_X)\napply_mat_add = sps.csr_matrix(np.concatenate((apply_mat.toarray(), apply_addX), axis=1))\n\nclf = SGDClassifier(loss='huber')\nfin = clf.fit(train_mat_add, train_y)\n\ntest_guess = fin.predict(apply_mat_add)",
     "prompt_number": 103,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "f = open('submit20141031_5.csv','w')\nf.write('Id,Category\\n')\nfor i in range(len(testq)):\n    f.write('%s,%s\\n' % (testq[i][1],test_guess[i]))\nf.close()",
     "prompt_number": 105,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Round 6"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# first vote: basic tf-idf\ntrain_X = [q for q, c in questions]\ntrain_y = [c for q, c in questions]\n\ntfidf = TfidfVectorizer()\ntrain_mat = tfidf.fit_transform(train_X, train_y)\napply_mat = tfidf.transform([q for q, i in testq])\n\nclf = SGDClassifier(loss='log')\nfin = clf.fit(train_mat, train_y)\n\ntest_guess1 = fin.predict(apply_mat)\ntest_guess1_prob = fin.predict_proba(apply_mat)\n\n# second vote: enhanced lemmatized tf-idf\ntrain_X = [' '.join(lemtize(process_text(q))) for q, c in questions]\ntrain_y = [c for q, c in questions]\ntrain_addX = feat_matrix(train_X)\n\napply_X = [' '.join(lemtize(process_text(q))) for q, i in testq]\napply_addX = feat_matrix(apply_X)\n\ntfidf = TfidfVectorizer()\ntrain_mat = tfidf.fit_transform(train_X, train_y)\ntrain_mat_add = sps.csr_matrix(np.concatenate((train_mat.toarray(), train_addX), axis=1))\napply_mat = tfidf.transform(apply_X)\napply_mat_add = sps.csr_matrix(np.concatenate((apply_mat.toarray(), apply_addX), axis=1))\n\nclf = SGDClassifier(loss='log')\nfin = clf.fit(train_mat_add, train_y)\n\ntest_guess2 = fin.predict(apply_mat_add)\ntest_guess2_prob = fin.predict_proba(apply_mat_add)",
     "prompt_number": 108,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# combining votes\nguess_final = []\nfor i in range(len(test_guess1)):\n    if test_guess1[i] == test_guess2[i]:\n        guess_final.append(test_guess1[i])\n    elif test_guess1_prob[i][int(test_guess1[i]) - 1] > test_guess2_prob[i][int(test_guess2[i]) - 1]:\n        guess_final.append(test_guess1[i])\n    else:\n        guess_final.append(test_guess2[i])",
     "prompt_number": 121,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "f = open('submit20141031_6.csv','w')\nf.write('Id,Category\\n')\nfor i in range(len(testq)):\n    f.write('%s,%s\\n' % (testq[i][1],guess_final[i]))\nf.close()",
     "prompt_number": 122,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Round 7"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "train_X = [' '.join(lemtize(process_text(q))) for q, c in questions]\ntrain_y = [c for q, c in questions]\ntrain_addX = feat_matrix(train_X)\n\napply_X = [' '.join(lemtize(process_text(q))) for q, i in testq]\napply_addX = feat_matrix(apply_X)\n\ntfidf = TfidfVectorizer()\ntrain_mat = tfidf.fit_transform(train_X, train_y)\ntrain_mat_add = sps.csr_matrix(np.concatenate((train_mat.toarray(), train_addX), axis=1))\napply_mat = tfidf.transform(apply_X)\napply_mat_add = sps.csr_matrix(np.concatenate((apply_mat.toarray(), apply_addX), axis=1))\n\nclf = SGDClassifier(loss='huber')\nfin = clf.fit(train_mat_add, train_y)\n\ntest_guess = fin.predict(apply_mat_add)",
     "prompt_number": 29,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "f = open('submit20141031_7.csv','w')\nf.write('Id,Category\\n')\nfor i in range(len(testq)):\n    f.write('%s,%s\\n' % (testq[i][1],test_guess[i]))\nf.close()",
     "prompt_number": 31,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Round 8"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# first vote: basic tf-idf\ntrain_X = [q for q, c in questions]\ntrain_y = [c for q, c in questions]\n\ntfidf = TfidfVectorizer()\ntrain_mat = tfidf.fit_transform(train_X, train_y)\napply_mat = tfidf.transform([q for q, i in testq])\n\nclf = SGDClassifier(loss='log')\nfin = clf.fit(train_mat, train_y)\n\ntest_guess1 = fin.predict(apply_mat)\ntest_guess1_prob = fin.predict_proba(apply_mat)\n\n# second vote: enhanced lemmatized tf-idf\ntrain_X = [' '.join(lemtize(process_text(q))) for q, c in questions]\ntrain_y = [c for q, c in questions]\ntrain_addX = feat_matrix(train_X)\n\napply_X = [' '.join(lemtize(process_text(q))) for q, i in testq]\napply_addX = feat_matrix(apply_X)\n\ntfidf = TfidfVectorizer()\ntrain_mat = tfidf.fit_transform(train_X, train_y)\ntrain_mat_add = sps.csr_matrix(np.concatenate((train_mat.toarray(), train_addX), axis=1))\napply_mat = tfidf.transform(apply_X)\napply_mat_add = sps.csr_matrix(np.concatenate((apply_mat.toarray(), apply_addX), axis=1))\n\nclf = SGDClassifier(loss='log')\nfin = clf.fit(train_mat_add, train_y)\n\ntest_guess2 = fin.predict(apply_mat_add)\ntest_guess2_prob = fin.predict_proba(apply_mat_add)",
     "prompt_number": 32,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# combining votes\nguess_final = []\nfor i in range(len(test_guess1)):\n    if test_guess1[i] == test_guess2[i]:\n        guess_final.append(test_guess1[i])\n    elif test_guess1_prob[i][int(test_guess1[i]) - 1] > test_guess2_prob[i][int(test_guess2[i]) - 1]:\n        guess_final.append(test_guess1[i])\n    else:\n        guess_final.append(test_guess2[i])",
     "prompt_number": 33,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "f = open('submit20141031_8.csv','w')\nf.write('Id,Category\\n')\nfor i in range(len(testq)):\n    f.write('%s,%s\\n' % (testq[i][1],guess_final[i]))\nf.close()",
     "prompt_number": 34,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Round 9"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "train_X = [' '.join(lemtize(process_text(q))) for q, c in questions]\ntrain_y = [c for q, c in questions]\napply_X = [' '.join(lemtize(process_text(q))) for q, i in testq]\n\n# creating unigram count vectors\ncv = CountVectorizer(stop_words=stopset)\ncv.fit(train_X)\nvecs = cv.transform(train_X)\nvecs_apply = cv.transform(apply_X)\n\n# adding bigrams (that appear in between 0.5% and 5% of documents)\nbigrams = [[bg for sent in process_text(q) \n            for bg in nltk.bigrams([w for w in sent if w[0] not in string.punctuation])]\n               for q, c in questions]\nbigrams_apply = [[bg for sent in process_text(q) \n                    for bg in nltk.bigrams([w for w in sent if w[0] not in string.punctuation])]\n                       for q, i in testq]\n\nbigram_freq = nltk.FreqDist([b for q in bigrams for b in q])\nbigram_set = [b for b,c in bigram_freq.items() if c < len(questions) * 0.05 and c >= len(questions) * 0.005]\nbigram_array = np.asarray([np.asarray([d.count(b) for b in bigram_set]) for d in bigrams])\nbigram_array_apply = np.asarray([np.asarray([d.count(b) for b in bigram_set]) for d in bigrams_apply])\n\n# adding trigrams (that appear in between 0.2% and 2% of documents)\ntrigrams = [[bg for sent in process_text(q) \n             for bg in nltk.trigrams([w for w in sent if w[0] not in string.punctuation])]\n                for q, c in questions]\ntrigrams_apply = [[bg for sent in process_text(q) \n                     for bg in nltk.trigrams([w for w in sent if w[0] not in string.punctuation])]\n                        for q, i in testq]\n\ntrigram_freq = nltk.FreqDist([b for q in trigrams for b in q])\ntrigram_set = [b for b,c in trigram_freq.items() if c < len(questions) * 0.02 and c >= len(questions) * 0.002]\ntrigram_array = np.asarray([np.asarray([d.count(b) for b in trigram_set]) for d in trigrams])\ntrigram_array_apply = np.asarray([np.asarray([d.count(b) for b in trigram_set]) for d in trigrams_apply])\n\n# concatenating and converting to sparse matrices for tf-idf transform (normalized vectors and idf transformation)\ncount_mat = sps.csr_matrix(np.concatenate((vecs.toarray(), bigram_array, trigram_array), axis=1))\ncount_mat_apply = sps.csr_matrix(np.concatenate((vecs_apply.toarray(), bigram_array_apply, trigram_array_apply), axis=1))\n\ntfidf = TfidfTransformer()\ntrain_mat = tfidf.fit_transform(count_mat)\napply_mat = tfidf.fit_transform(count_mat_apply)\n\nclf = SGDClassifier(loss='huber')\nfin = clf.fit(train_mat, train_y)\n\ntest_guess = fin.predict(apply_mat)",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "f = open('submit20141031_9.csv','w')\nf.write('Id,Category\\n')\nfor i in range(len(testq)):\n    f.write('%s,%s\\n' % (testq[i][1],test_guess[i]))\nf.close()",
     "prompt_number": 24,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Round 10"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "train_X = [' '.join(lemtize(process_text(q))) for q, c in questions]\ntrain_y = [c for q, c in questions]\n\napply_X = [' '.join(lemtize(process_text(q))) for q, i in testq]\n\nfeatures = ['total_words',\n            'total_stopwords',\n            'question_marks']\ntrain_addX = feat_matrix(train_X, feature_list=features)\napply_addX = feat_matrix(apply_X, feature_list=features)\n\ntfidf = TfidfVectorizer(stop_words=stopset)\ntrain_mat = tfidf.fit_transform(train_X, train_y)\napply_mat = tfidf.transform(apply_X)\n\ntrain_mat_add = sps.csr_matrix(np.concatenate((train_mat.toarray(), train_addX), axis=1))\napply_mat_add = sps.csr_matrix(np.concatenate((apply_mat.toarray(), apply_addX), axis=1))\n\nclf = SGDClassifier(loss='huber')\nfin = clf.fit(train_mat_add, train_y)\n\ntest_guess = fin.predict(apply_mat_add)",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "f = open('submit20141031_10.csv','w')\nf.write('Id,Category\\n')\nfor i in range(len(testq)):\n    f.write('%s,%s\\n' % (testq[i][1],test_guess[i]))\nf.close()",
     "prompt_number": 27,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Round 11"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "train_X = [' '.join(lemtize(process_text(q)) + [h.replace('.','_') for h in gethypernyms(process_text(q))]) for q, c in questions]\ntrain_y = [c for q, c in questions]\n\napply_X = [' '.join(lemtize(process_text(q)) + [h.replace('.','_') for h in gethypernyms(process_text(q))]) for q, i in testq]\n\nfeatures = ['total_words',\n            'total_stopwords',\n            'total_numerals',\n            'question_marks']\ntrain_addX = feat_matrix(train_X, feature_list=features)\napply_addX = feat_matrix(apply_X, feature_list=features)\n\ntfidf = TfidfVectorizer(stop_words=stopset)\ntrain_mat = tfidf.fit_transform(train_X, train_y)\napply_mat = tfidf.transform(apply_X)\n\ntrain_mat_add = sps.csr_matrix(np.concatenate((train_mat.toarray(), train_addX), axis=1))\napply_mat_add = sps.csr_matrix(np.concatenate((apply_mat.toarray(), apply_addX), axis=1))\n\nclf = SGDClassifier(loss='huber')\nfin = clf.fit(train_mat_add, train_y)\n\ntest_guess = fin.predict(apply_mat_add)",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "f = open('submit20141102_11.csv','w')\nf.write('Id,Category\\n')\nfor i in range(len(testq)):\n    f.write('%s,%s\\n' % (testq[i][1],test_guess[i]))\nf.close()",
     "prompt_number": 15,
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    }
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "name": "",
  "signature": "sha256:d7a739b653943ff3b0853a500c684d03cae55c23940687bcf71db3adee1a90e2",
  "gist_id": "14542d75059c0d46c8cc"
 },
 "nbformat": 3
}