rjweiss/Python_classification

## Python_classification
{
 "metadata": {
  "name": "Python_classification.ipynb"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Document-level text analysis\n",
      "\n",
      "Document-level analysis is when you are interested in the whole text article, not tokens (sentences or words). The most basic example is labeling documents against some classification scheme, hence **text classification**. When you don't know your scheme ahead of time or you're interested in exploring a large set of data, you can try **topic modeling**.\n",
      "\n",
      "We're going to go over a couple of examples of document-level text analysis using some very most common classifiers models. We're going to go over the code to train your own model and discuss the results we see.\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Supervised learning: Text classification in Python\n",
      "\n",
      "We're going to go over examples of how to use the excellent [Scikits-Learn](http://scikit-learn.org/stable/) library to train some text classifiers. \n",
      "\n",
      "The dataset used are the titles and topic codes from the `NYTimes` dataset that comes with the RTextTools library in `R`. It consists of titles from NYTimes front page news and associated codes according to [Amber Boydstun's classification scheme](http://www.policyagendas.org/sites/policyagendas.org/files/Boydstun_NYT_FrontPage_Codebook_0.pdf).\n",
      "\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.naive_bayes import MultinomialNB\n",
      "from sklearn.feature_extraction.text import TfidfVectorizer\n",
      "from sklearn import metrics\n",
      "from operator import itemgetter\n",
      "from sklearn.metrics import classification_report\n",
      "import csv\n",
      "import os\n",
      "\n",
      "os.chdir('/Users/rweiss/Dropbox/presentations/MozFest2013/data/')\n",
      "\n",
      "#note that if you generated this from R, you will need to delete the row\n",
      "#\"NYT_sample.Topic.Code\",\"NYT_sample.Title\"\n",
      "#from the top of the file.\n",
      "nyt = open('../data/nyt_title_data.csv') # check the structure of this file!\n",
      "nyt_data = []\n",
      "nyt_labels = []\n",
      "csv_reader = csv.reader(nyt)\n",
      "\n",
      "for line in csv_reader:\n",
      " nyt_labels.append(int(line[0]))\n",
      " nyt_data.append(line[1])\n",
      "\n",
      "nyt.close()\n",
      "\n",
      "trainset_size = int(round(len(nyt_data)*0.75)) # i chose this threshold arbitrarily...to discuss\n",
      "print 'The training set size for this classifier is ' + str(trainset_size) + '\\n'\n",
      "\n",
      "X_train = np.array([''.join(el) for el in nyt_data[0:trainset_size]])\n",
      "y_train = np.array([el for el in nyt_labels[0:trainset_size]])\n",
      "\n",
      "X_test = np.array([''.join(el) for el in nyt_data[trainset_size+1:len(nyt_data)]]) \n",
      "y_test = np.array([el for el in nyt_labels[trainset_size+1:len(nyt_labels)]]) \n",
      "\n",
      "#print(X_train)\n",
      "\n",
      "vectorizer = TfidfVectorizer(min_df=2, \n",
      " ngram_range=(1, 2), \n",
      " stop_words='english', \n",
      " strip_accents='unicode', \n",
      " norm='l2')\n",
      " \n",
      "test_string = unicode(nyt_data[0])\n",
      "\n",
      "print \"Example string: \" + test_string\n",
      "print \"Preprocessed string: \" + vectorizer.build_preprocessor()(test_string)\n",
      "print \"Tokenized string:\" + str(vectorizer.build_tokenizer()(test_string))\n",
      "print \"N-gram data string:\" + str(vectorizer.build_analyzer()(test_string))\n",
      "print \"\\n\"\n",
      " \n",
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "The training set size for this classifier is 1621\n",
        "\n",
        "Example string: Dole Courts Democrats\n",
        "Preprocessed string: dole courts democrats\n",
        "Tokenized string:[u'Dole', u'Courts', u'Democrats']\n",
        "N-gram data string:[u'dole', u'courts', u'democrats', u'dole courts', u'courts democrats']\n",
        "\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X_train = vectorizer.fit_transform(X_train)\n",
      "X_test = vectorizer.transform(X_test)\n",
      "\n",
      "nb_classifier = MultinomialNB().fit(X_train, y_train)\n",
      "\n",
      "y_nb_predicted = nb_classifier.predict(X_test)\n",
      "\n",
      "print \"MODEL: Multinomial Naive Bayes\\n\"\n",
      "\n",
      "print 'The precision for this classifier is ' + str(metrics.precision_score(y_test, y_nb_predicted))\n",
      "print 'The recall for this classifier is ' + str(metrics.recall_score(y_test, y_nb_predicted))\n",
      "print 'The f1 for this classifier is ' + str(metrics.f1_score(y_test, y_nb_predicted))\n",
      "print 'The accuracy for this classifier is ' + str(metrics.accuracy_score(y_test, y_nb_predicted))\n",
      "\n",
      "print '\\nHere is the classification report:'\n",
      "print classification_report(y_test, y_nb_predicted)\n",
      "\n",
      "#simple thing to do would be to up the n-grams to bigrams; try varying ngram_range from (1, 1) to (1, 2)\n",
      "#we could also modify the vectorizer to stem or lemmatize\n",
      "print '\\nHere is the confusion matrix:'\n",
      "print metrics.confusion_matrix(y_test, y_nb_predicted, labels=unique(nyt_labels))\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "MODEL: Multinomial Naive Bayes\n",
        "\n",
        "The precision for this classifier is 0.678610380886\n",
        "The recall for this classifier is 0.549165120594\n",
        "The f1 for this classifier is 0.506785046956\n",
        "The accuracy for this classifier is 0.549165120594\n",
        "\n",
        "Here is the classification report:\n",
        " precision recall f1-score support\n",
        "\n",
        " 3 1.00 0.23 0.38 47\n",
        " 12 0.75 0.08 0.15 37\n",
        " 15 1.00 0.10 0.19 39\n",
        " 16 0.59 0.56 0.58 112\n",
        " 19 0.46 0.88 0.60 162\n",
        " 20 0.70 0.64 0.67 99\n",
        " 29 1.00 0.21 0.35 43\n",
        "\n",
        "avg / total 0.68 0.55 0.51 539\n",
        "\n",
        "\n",
        "Here is the confusion matrix:\n",
        "[[ 11 0 0 5 24 7 0]\n",
        " [ 0 3 0 5 24 5 0]\n",
        " [ 0 0 4 4 26 5 0]\n",
        " [ 0 1 0 63 44 4 0]\n",
        " [ 0 0 0 18 143 1 0]\n",
        " [ 0 0 0 8 28 63 0]\n",
        " [ 0 0 0 4 25 5 9]]\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#What are the top N most predictive features per class?\n",
      "N = 10\n",
      "vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary_.iteritems(), key=itemgetter(1))])\n",
      "\n",
      "for i, label in enumerate(nyt_labels):\n",
      " if i == 7: # hack...\n",
      " break\n",
      " topN = np.argsort(nb_classifier.coef_[i])[-N:]\n",
      " print \"\\nThe top %d most informative features for topic code %s: \\n%s\" % (N, label, \" \".join(vocabulary[topN]))\n",
      " #print topN"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "The top 10 most informative features for topic code 20: \n",
        "study hospitals aids medicare cancer health tobacco care drug new\n",
        "\n",
        "The top 10 most informative features for topic code 29: \n",
        "special special report new drug sniper suspect report crime police case\n",
        "\n",
        "The top 10 most informative features for topic code 3: \n",
        "market chief wall new billion big stocks enron deal microsoft\n",
        "\n",
        "The top 10 most informative features for topic code 16: \n",
        "iraqi military baghdad 11 bush challenged nation challenged nation war iraq\n",
        "\n",
        "The top 10 most informative features for topic code 19: \n",
        "russia japan war mideast russian india leader new israel china\n",
        "\n",
        "The top 10 most informative features for topic code 19: \n",
        "2000 campaign 2000 clinton testing testing president politics bush democrats campaign president\n",
        "\n",
        "The top 10 most informative features for topic code 20: \n",
        "victory bowl knicks win world yankees playoffs series game baseball\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.svm import LinearSVC\n",
      "\n",
      "svm_classifier = LinearSVC().fit(X_train, y_train)\n",
      "\n",
      "y_svm_predicted = svm_classifier.predict(X_test)\n",
      "print \"MODEL: Linear SVC\\n\"\n",
      "\n",
      "print 'The precision for this classifier is ' + str(metrics.precision_score(y_test, y_svm_predicted))\n",
      "print 'The recall for this classifier is ' + str(metrics.recall_score(y_test, y_svm_predicted))\n",
      "print 'The f1 for this classifier is ' + str(metrics.f1_score(y_test, y_svm_predicted))\n",
      "print 'The accuracy for this classifier is ' + str(metrics.accuracy_score(y_test, y_svm_predicted))\n",
      "\n",
      "print '\\nHere is the classification report:'\n",
      "print classification_report(y_test, y_svm_predicted)\n",
      "\n",
      "#simple thing to do would be to up the n-grams to bigrams; try varying ngram_range from (1, 1) to (1, 2)\n",
      "#we could also modify the vectorizer to stem or lemmatize\n",
      "print '\\nHere is the confusion matrix:'\n",
      "print metrics.confusion_matrix(y_test, y_svm_predicted, labels=unique(nyt_labels))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "MODEL: Linear SVC\n",
        "\n",
        "The precision for this classifier is 0.63396261297\n",
        "The recall for this classifier is 0.623376623377\n",
        "The f1 for this classifier is 0.620561226142\n",
        "The accuracy for this classifier is 0.623376623377\n",
        "\n",
        "Here is the classification report:\n",
        " precision recall f1-score support\n",
        "\n",
        " 3 0.69 0.47 0.56 47\n",
        " 12 0.43 0.49 0.46 37\n",
        " 15 0.68 0.44 0.53 39\n",
        " 16 0.60 0.59 0.59 112\n",
        " 19 0.60 0.77 0.67 162\n",
        " 20 0.72 0.66 0.69 99\n",
        " 29 0.73 0.56 0.63 43\n",
        "\n",
        "avg / total 0.63 0.62 0.62 539\n",
        "\n",
        "\n",
        "Here is the confusion matrix:\n",
        "[[ 22 5 1 6 7 5 1]\n",
        " [ 1 18 1 2 10 2 3]\n",
        " [ 1 3 17 1 10 6 1]\n",
        " [ 1 7 3 66 28 6 1]\n",
        " [ 4 5 2 23 124 4 0]\n",
        " [ 3 2 1 9 16 65 3]\n",
        " [ 0 2 0 3 12 2 24]]\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#What are the top N most predictive features per class?\n",
      "N = 10\n",
      "vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary_.iteritems(), key=itemgetter(1))])\n",
      "\n",
      "for i, label in enumerate(nyt_labels):\n",
      " if i == 7: # hack...\n",
      " break\n",
      " topN = np.argsort(svm_classifier.coef_[i])[-N:]\n",
      " print \"\\nThe top %d most informative features for topic code %s: \\n%s\" % (N, label, \" \".join(vocabulary[topN]))\n",
      " #print topN"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "The top 10 most informative features for topic code 20: \n",
        "schiavo tissue baby scientists fat cancer gene medicare hospitals tobacco\n",
        "\n",
        "The top 10 most informative features for topic code 29: \n",
        "fallen limiting charged police rampage suspect murder crime sniper gun\n",
        "\n",
        "The top 10 most informative features for topic code 3: \n",
        "profit workers merger response storm pricing deal stocks enron microsoft\n",
        "\n",
        "The top 10 most informative features for topic code 16: \n",
        "base generals afghanistan navy force hussein nation nato 11 iraq\n",
        "\n",
        "The top 10 most informative features for topic code 19: \n",
        "pakistan india russian japan europe china africa mideast israel chinese\n",
        "\n",
        "The top 10 most informative features for topic code 19: \n",
        "impeachment whitewater race gingrich lewinsky senate president politics campaign democrats\n",
        "\n",
        "The top 10 most informative features for topic code 20: \n",
        "bowl armstrong match knicks playoffs play yankees series baseball game\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.linear_model import LogisticRegression\n",
      "\n",
      "maxent_classifier = LogisticRegression().fit(X_train, y_train)\n",
      "\n",
      "y_maxent_predicted = maxent_classifier.predict(X_test)\n",
      "print \"MODEL: Maximum Entropy\\n\"\n",
      "\n",
      "print 'The precision for this classifier is ' + str(metrics.precision_score(y_test, y_maxent_predicted))\n",
      "print 'The recall for this classifier is ' + str(metrics.recall_score(y_test, y_maxent_predicted))\n",
      "print 'The f1 for this classifier is ' + str(metrics.f1_score(y_test, y_maxent_predicted))\n",
      "print 'The accuracy for this classifier is ' + str(metrics.accuracy_score(y_test, y_maxent_predicted))\n",
      "\n",
      "print '\\nHere is the classification report:'\n",
      "print classification_report(y_test, y_maxent_predicted)\n",
      "\n",
      "#simple thing to do would be to up the n-grams to bigrams; try varying ngram_range from (1, 1) to (1, 2)\n",
      "#we could also modify the vectorizer to stem or lemmatize\n",
      "print '\\nHere is the confusion matrix:'\n",
      "print metrics.confusion_matrix(y_test, y_maxent_predicted, labels=unique(nyt_labels))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "MODEL: Maximum Entropy\n",
        "\n",
        "The precision for this classifier is 0.654004346593\n",
        "The recall for this classifier is 0.549165120594\n",
        "The f1 for this classifier is 0.524774091511\n",
        "The accuracy for this classifier is 0.549165120594\n",
        "\n",
        "Here is the classification report:\n",
        " precision recall f1-score support\n",
        "\n",
        " 3 0.94 0.32 0.48 47\n",
        " 12 0.67 0.16 0.26 37\n",
        " 15 0.71 0.13 0.22 39\n",
        " 16 0.64 0.54 0.59 112\n",
        " 19 0.44 0.85 0.58 162\n",
        " 20 0.72 0.60 0.65 99\n",
        " 29 1.00 0.28 0.44 43\n",
        "\n",
        "avg / total 0.65 0.55 0.52 539\n",
        "\n",
        "\n",
        "Here is the confusion matrix:\n",
        "[[ 15 0 0 4 25 3 0]\n",
        " [ 1 6 0 3 24 3 0]\n",
        " [ 0 1 5 1 30 2 0]\n",
        " [ 0 2 0 61 43 6 0]\n",
        " [ 0 0 1 19 138 4 0]\n",
        " [ 0 0 1 7 32 59 0]\n",
        " [ 0 0 0 1 25 5 12]]\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#What are the top N most predictive features per class?\n",
      "N = 10\n",
      "vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary_.iteritems(), key=itemgetter(1))])\n",
      "\n",
      "for i, label in enumerate(nyt_labels):\n",
      " if i == 7: # hack...\n",
      " break\n",
      " topN = np.argsort(maxent_classifier.coef_[i])[-N:]\n",
      " print \"\\nThe top %d most informative features for topic code %s: \\n%s\" % (N, label, \" \".join(vocabulary[topN]))\n",
      " #print topN"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "The top 10 most informative features for topic code 20: \n",
        "study scientists aids health hospitals medicare cancer care drug tobacco\n",
        "\n",
        "The top 10 most informative features for topic code 29: \n",
        "murder mexico officer drug gun suspect sniper police case crime\n",
        "\n",
        "The top 10 most informative features for topic code 3: \n",
        "big chief wall pay market billion stocks deal enron microsoft\n",
        "\n",
        "The top 10 most informative features for topic code 16: \n",
        "challenged iraqis nato force arms nation challenged war nation 11 iraq\n",
        "\n",
        "The top 10 most informative features for topic code 19: \n",
        "europe leader russia chinese japan russian mideast india israel china\n",
        "\n",
        "The top 10 most informative features for topic code 19: \n",
        "2000 dole clinton bush race senate politics president democrats campaign\n",
        "\n",
        "The top 10 most informative features for topic code 20: \n",
        "team play armstrong bowl knicks yankees playoffs series game baseball\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Unsupervised learning: Topic modeling in Python\n",
      "\n",
      "Now we're going to go over some typical topic modeling by using the popular [Gensim](http://radimrehurek.com/gensim/) library.\n",
      "\n",
      "The nice thing about Gensim is that it's ready to be applied to large datasets as it incorporates both the online version of LDA and distributed computing capability.\n",
      "\n",
      "We won't go over those features in this tutorial, since that would take hours to show a single example, and the NYTimes dataset is really quite small and can be run on a single machine."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from gensim import corpora, models, similarities\n",
      "from itertools import chain\n",
      "import nltk\n",
      "from nltk.corpus import stopwords\n",
      "from operator import itemgetter\n",
      "import re\n",
      "\n",
      "url_pattern = r'https?:\\/\\/(.*[\\r\\n]*)+'\n",
      "\n",
      "documents = [nltk.clean_html(document) for document in nyt_data]\n",
      "stoplist = stopwords.words('english')\n",
      "texts = [[word for word in document.lower().split() if word not in stoplist]\n",
      " for document in documents]\n",
      "\n",
      "dictionary = corpora.Dictionary(texts)\n",
      "corpus = [dictionary.doc2bow(text) for text in texts]\n",
      "\n",
      "tfidf = models.TfidfModel(corpus) \n",
      "corpus_tfidf = tfidf[corpus]\n",
      "\n",
      "#lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=100)\n",
      "#lsi.print_topics(20)\n",
      "\n",
      "n_topics = 60\n",
      "lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=n_topics)\n",
      "\n",
      "for i in range(0, n_topics):\n",
      " temp = lda.show_topic(i, 10)\n",
      " terms = []\n",
      " for term in temp:\n",
      " terms.append(term[1])\n",
      " print \"Top 10 terms for topic #\" + str(i) + \": \"+ \", \".join(terms)\n",
      " \n",
      "print \n",
      "print 'Which LDA topic maximally describes a document?\\n'\n",
      "print 'Original document: ' + documents[1]\n",
      "print 'Preprocessed document: ' + str(texts[1])\n",
      "print 'Matrix Market format: ' + str(corpus[1])\n",
      "print 'Topic probability mixture: ' + str(lda[corpus[1]])\n",
      "print 'Maximally probable topic: topic #' + str(max(lda[corpus[1]],key=itemgetter(1))[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Top 10 terms for topic #0: company, delays, reforms, money;, 6, whose, rights, giving, nazis, running\n",
        "Top 10 terms for topic #1: garden, he's, lavish, networks,, trade-off, scripts, head, plan, pentagon, hit\n",
        "Top 10 terms for topic #2: revival, now,, last,, guns,, killing,, chamber, tries, split, friends;, russians;\n",
        "Top 10 terms for topic #3: legal, beckons, best, resistance, won't, privileged, mexico's, presidency,, tilts, production\n",
        "Top 10 terms for topic #4: europe, r, entrepreneurs,, ambitious, moscow, tough, home,, said, call, one\n",
        "Top 10 terms for topic #5: pakistan, gathering, claimed, signs, raise, taliban;, supplying, hesitantly, ended, reviewing\n",
        "Top 10 terms for topic #6: gaining,, losing,, torricelli, take, super, beijing, bowl, two, find, bush's\n",
        "Top 10 terms for topic #7: terror, turmoil, must, shift, suspects, level, 2, vast, left, anguish\n",
        "Top 10 terms for topic #8: colombia, swing, justices, effort, holding, suspect, pledge, talks, peres, military\n",
        "Top 10 terms for topic #9: california;, title, balance, takes, governor, move, u.n., tough, hill's, seat-squirmer\n",
        "Top 10 terms for topic #10: syria, baseball;, afghanistan,, pushes, swiftly, sharon, report, halt, israel, ace\n",
        "Top 10 terms for topic #11: adoptions, rewarding, pause, recovery,, saudis, hundreds, diplomacy;, betrayed, laotians, mideast\n",
        "Top 10 terms for topic #12: sars, timetable, clash, hotel, aide, urges, close, iraq, bush, toronto\n",
        "Top 10 terms for topic #13: inside, care, business, planes, take, rebels, ill, mexico, town, war:\n",
        "Top 10 terms for topic #14: insiders, kerry, suspected, woman, bid, giuliani, describes, saudi, foster, court\n",
        "Top 10 terms for topic #15: aid, briefly, airstrike, zarqawi, survived, balks, numbers, bush, get, game\n",
        "Top 10 terms for topic #16: cardinals, choosy, loner,, series, world, region, short, blasts, control, feared,\n",
        "Top 10 terms for topic #17: fronts, disease, project, stretching,, sadly, lessons, excruciating, air, came, trial\n",
        "Top 10 terms for topic #18: alzheimer's, drugs, cover, treatments, emergency, program, streets, doctor's, many,, room,\n",
        "Top 10 terms for topic #19: weigh, documents, incentives, dissuade, gifts, bush, east, stay, leaves, trial\n",
        "Top 10 terms for topic #20: veto, resolution, pursue, islamic, shiite, leadership;, iraq,, northeast, back, g.o.p.\n",
        "Top 10 terms for topic #21: game, fear, c.i.a., sees, intelligence;, hindered, hindsight,, terror, tide, all-around\n",
        "Top 10 terms for topic #22: public, conservatives, protest, terrorism, amid, starts, tobacco, impasse, white, rules\n",
        "Top 10 terms for topic #23: arafat, fled, detainee, homes, mentally, heart, chechen, afghanistan, approves, money\n",
        "Top 10 terms for topic #24: iran, rivera, send, confession, star, deaths, spy, yankees, croatian's, scale\n",
        "Top 10 terms for topic #25: talks, defines, palestinians, german, vote, final, taliban, plan, war:, nation\n",
        "Top 10 terms for topic #26: international, business;, passes, opportunity, vows, expand, seek, sell, crime, asks\n",
        "Top 10 terms for topic #27: all,, treatment, crash, economic, executive, medicare, covering, goes, '96;, plea\n",
        "Top 10 terms for topic #28: fall, coach, analysis;, news, charges, caught, (again), costly, near, feet\n",
        "Top 10 terms for topic #29: missile, chief, taiwan, forced, mccain, test, dole's, canceled, vote, unable\n",
        "Top 10 terms for topic #30: states, murder, shift,, 3, locked, sets, it's, backing, koreans, sites\n",
        "Top 10 terms for topic #31: middle, executives, twice, widowed, india,, scorned, sent, inquiry, (again), hand,\n",
        "Top 10 terms for topic #32: challenged:, nation, time,, insurance, efforts, sept., 11, keep, peace, panel\n",
        "Top 10 terms for topic #33: president:, testing, peace, overview;, israel, rehiring, ex-president, morgan, rises, hope\n",
        "Top 10 terms for topic #34: saudi, blast, high, 13, insurance, pact, government, costs, day, lowest\n",
        "Top 10 terms for topic #35: proposes, medicare, found, troops;, guilty, tighten, remains, set, crisis, case\n",
        "Top 10 terms for topic #36: search, baseball, 5, upsets, move, one, playoffs, bush, may, swiftly\n",
        "Top 10 terms for topic #37: responses:, threats, strategy, disease;, budget, rich, security;, presence, damage, toss\n",
        "Top 10 terms for topic #38: al, qaeda, witness, personal, let, $1, memo;, russia:, tour, bold\n",
        "Top 10 terms for topic #39: failure, benefit, cutback, cheney's, companies, drug, bush, slave, adopt, fund\n",
        "Top 10 terms for topic #40: military, losses, target, bars, cloning, ban, california's, clear, book,, novel\n",
        "Top 10 terms for topic #41: smoke, front;, role, battle:, vast, spin, fierce, contest, vote:, clear\n",
        "Top 10 terms for topic #42: life, he'll, court, dean's, reshape, faulted, contest, choice, falling, steps\n",
        "Top 10 terms for topic #43: toward, children, light, hurrying, warmth, aiding, russians, try, presents, missiles\n",
        "Top 10 terms for topic #44: surges, mode,, rightist, bolster, pick, york, attack, dies, new, cries\n",
        "Top 10 terms for topic #45: confinement, guantanamo,, interrogation, rebuilding, corporations, goldman, hurting, deal;, appears, iowa\n",
        "Top 10 terms for topic #46: tells, record, vow, israel, yankees, temper, leaders, runner, mile, working\n",
        "Top 10 terms for topic #47: points, arms, 10%, arbitration, approval, three, picked, woo, populist, f.d.a.\n",
        "Top 10 terms for topic #48: uneasiness, tigers, iraq, inflamed:, reconstruction;, repayment, fee, sees, worst, region\n",
        "Top 10 terms for topic #49: sosa, mcgwire, expert's, point, plan, blair, gains, caribbean, coverage, jet\n",
        "Top 10 terms for topic #50: 1998, resolve, base, campaign:, aids, family, foley, cleric, republicans, elections:\n",
        "Top 10 terms for topic #51: peace,, politics, crashes, hollywood, enclaves, cockfights, flourishing, war, die, u.s.,\n",
        "Top 10 terms for topic #52: green, cloning, bus, flaws, hubris, eccentric's, frenzy, defense, avoid, pro\n",
        "Top 10 terms for topic #53: giants, declares, fence, victory, paler, rosy, sometimes, investors,, forecasts, results\n",
        "Top 10 terms for topic #54: bridgeport, dispute, health, ballot, insurer, options, away, mccain,, crack, uninsured\n",
        "Top 10 terms for topic #55: time, milosevic, travel, savor, stalls, tuesdays:, clues, among, grants, africans\n",
        "Top 10 terms for topic #56: victims, texaco, democracy, batter, jews, using, back, nigerians, lurching, nafta\n",
        "Top 10 terms for topic #57: city, school, shaken, weapons, pills., sunscreen., spray., checklist, bug, camp:\n",
        "Top 10 terms for topic #58: g.i.'s, counting, command, ways, east, marines, charter, john, changes, assailing\n",
        "Top 10 terms for topic #59: jordan, egypt, rape, brings, asian, rise, young, kabul, first, leader\n",
        "\n",
        "Which LDA topic maximally describes a document?\n",
        "\n",
        "Original document: Yanks End Drought; Mets Fall in Opener\n",
        "Preprocessed document: ['yanks', 'end', 'drought;', 'mets', 'fall', 'opener']\n",
        "Matrix Market format: [(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]\n",
        "Topic probability mixture: [(19, 0.43120206027031704), (27, 0.14505309958538579), (28, 0.14521071520874509), (38, 0.14520079160221899)]\n",
        "Maximally probable topic: topic #19\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "/Users/rweiss/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/gensim-0.8.7-py2.7.egg/gensim/__init__.py:12: UserWarning: Module IPython was already imported from /Applications/Canopy.app/appdata/canopy-1.0.3.1262.macosx-x86_64/Canopy.app/Contents/lib/python2.7/site-packages/IPython/__init__.pyc, but /Users/rweiss/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/ipython-1.0.0-py2.7.egg is being added to sys.path\n",
        " __version__ = __import__('pkg_resources').get_distribution('gensim').version\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Big picture questions:\n",
      "\n",
      "1. How do the different supervised models compare against each other? \n",
      " 1. What's the tradeoffs between the metrics per model?\n",
      " 2. What about per class? Are some models better than others are certain classes?\n",
      " 3. What if we had had more data? Would some models get better than others?\n",
      " 5. What if our *observations* had more data? Instead of titles, we used lead paragraphs or even the full document?\n",
      " 6. What if our feature space was different? Instead of unigrams or bigrams, we used trigrams? Parts-of-speech?\n",
      " 4. Is there something about the **underlying language** structure that leads certain models to being better than others?\n",
      "2. How do the supervised models compare against the unsupervised model?\n",
      " 1. Are they \"better?\" If so, how?\n",
      " 2. What did we need to train a supervised model? What did we need to train an unsupervised model?\n",
      " 3. On that note, when is it more appropriate to use an unsupervised model over a supervised model?\n",
      " 4. How do you choose *k* number of topics for an unsupervised model?\n",
      " 5. What happens if you run the unsupervised model again? What about the supervised model?"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}