georgehc/94-775, 95-865 Latent Dirichlet Allocation demo.ipynb Secret

## 94-775, 95-865 Latent Dirichlet Allocation demo.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 94-775/95-865: Topic Modeling Demo\n",
    "\n",
    "Author: George H. Chen (georgechen [at symbol] cmu.edu)\n",
    "\n",
    "The beginning part of this demo is a shortened and modified version of sklearn's LDA & NMF demo (http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Latent Dirichlet Allocation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups\n",
    "num_articles = 10000\n",
    "data = fetch_20newsgroups(shuffle=True, random_state=0,\n",
    "                          remove=('headers', 'footers', 'quotes')).data[:num_articles]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "The last name is Niedermayer, as in New Jersey's Scott's last name, because\n",
      "(you guessed it) they are brothers.  But Rob Niedermayer is a center, not\n",
      "a defenseman.\n",
      "\n",
      "I am not sure that the Sharks will take Kariya.  They aren't saying much, but\n",
      "they apparently like Niedermayer and Victor Kozlov, along with Kariya.  Chris\n",
      "Pronger's name has also been mentioned.  My guess is that they'll take\n",
      "Niedermayer.  They may take Pronger, except that they already have too many\n",
      "defensive prospects.\n"
     ]
    }
   ],
   "source": [
    "# you can take a look at what individual documents look like by replacing what index we look at\n",
    "print(data[5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_size = 1000\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "# CountVectorizer does tokenization and can remove terms that occur too frequently, not frequently enough, or that are stop words\n",
    "\n",
    "# document frequency (df) means number of documents a word appears in\n",
    "tf_vectorizer = CountVectorizer(max_df=0.95,\n",
    "                                min_df=2,\n",
    "                                max_features=vocab_size,\n",
    "                                stop_words='english')\n",
    "tf = tf_vectorizer.fit_transform(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "scipy.sparse.csr.csr_matrix"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(tf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['00', '000', '02', '03', '04', '0d', '0t', '10', '100', '11', '12', '128', '13', '14', '145', '15', '16', '17', '18', '19', '1990', '1991', '1992', '1993', '1d9', '1st', '1t', '20', '200', '21', '22', '23', '24', '25', '250', '26', '27', '28', '29', '2di', '2tm', '30', '300', '31', '32', '33', '34', '34u', '35', '36', '37', '38', '39', '3d', '3t', '40', '41', '42', '43', '44', '45', '46', '48', '50', '500', '55', '60', '64', '6um', '70', '75', '75u', '7ey', '80', '800', '86', '90', '91', '92', '93', '9v', 'a86', 'ability', 'able', 'ac', 'accept', 'access', 'according', 'act', 'action', 'actually', 'add', 'addition', 'address', 'administration', 'advance', 'age', 'ago', 'agree', 'ah', 'air', 'al', 'algorithm', 'allow', 'allowed', 'alt', 'america', 'american', 'analysis', 'anonymous', 'answer', 'answers', 'anti', 'anybody', 'apparently', 'appears', 'apple', 'application', 'applications', 'appreciate', 'appreciated', 'approach', 'appropriate', 'apr', 'april', 'archive', 'area', 'areas', 'aren', 'argument', 'armenia', 'armenian', 'armenians', 'arms', 'army', 'article', 'articles', 'ask', 'asked', 'asking', 'assume', 'atheism', 'attack', 'attempt', 'au', 'author', 'authority', 'available', 'average', 'avoid', 'away', 'ax', 'b8f', 'bad', 'base', 'based', 'basic', 'basically', 'basis', 'belief', 'believe', 'best', 'better', 'bh', 'bhj', 'bible', 'big', 'bike', 'bit', 'bits', 'bj', 'black', 'block', 'blood', 'board', 'body', 'book', 'books', 'bought', 'box', 'break', 'bring', 'brought', 'btw', 'buf', 'build', 'building', 'built', 'bus', 'business', 'buy', 'bxn', 'ca', 'cable', 'california', 'called', 'calls', 'came', 'canada', 'car', 'card', 'cards', 'care', 'carry', 'cars', 'case', 'cases', 'cause', 'cd', 'center', 'certain', 'certainly', 'chance', 'change', 'changed', 'changes', 'check', 'chicago', 'child', 'children', 'chip', 'chips', 'choice', 'christ', 'christian', 'christianity', 'christians', 'church', 'citizens', 'city', 'claim', 'claims', 'class', 'clear', 'clearly', 'clinton', 'clipper', 'close', 'code', 'color', 'com', 'come', 'comes', 'coming', 'command', 'comments', 'commercial', 'committee', 'common', 'community', 'comp', 'company', 'complete', 'completely', 'computer', 'condition', 'conference', 'congress', 'consider', 'considered', 'contact', 'contains', 'context', 'continue', 'control', 'controller', 'copy', 'correct', 'cost', 'couldn', 'country', 'couple', 'course', 'court', 'cover', 'create', 'created', 'crime', 'cross', 'cs', 'current', 'currently', 'cut', 'cx', 'data', 'date', 'dave', 'david', 'day', 'days', 'db', 'dc', 'dead', 'deal', 'death', 'dec', 'decided', 'defense', 'define', 'deleted', 'department', 'des', 'design', 'designed', 'details', 'development', 'device', 'devices', 'did', 'didn', 'difference', 'different', 'difficult', 'digital', 'directly', 'directory', 'discussion', 'disk', 'display', 'distribution', 'division', 'dod', 'does', 'doesn', 'doing', 'don', 'door', 'dos', 'doubt', 'dr', 'drive', 'driver', 'drivers', 'drives', 'drug', 'early', 'earth', 'easily', 'east', 'easy', 'ed', 'edu', 'effect', 'electronic', 'email', 'encryption', 'end', 'enforcement', 'engine', 'entire', 'entry', 'environment', 'error', 'escrow', 'especially', 'event', 'events', 'evidence', 'exactly', 'example', 'excellent', 'exist', 'existence', 'exists', 'expect', 'experience', 'explain', 'export', 'extra', 'face', 'fact', 'faith', 'false', 'family', 'faq', 'far', 'fast', 'faster', 'father', 'fax', 'fbi', 'features', 'federal', 'feel', 'field', 'figure', 'file', 'files', 'final', 'finally', 'fine', 'firearms', 'floppy', 'folks', 'follow', 'following', 'food', 'force', 'form', 'format', 'free', 'freedom', 'friend', 'ftp', 'function', 'functions', 'future', 'g9v', 'game', 'games', 'gas', 'gave', 'general', 'generally', 'gets', 'getting', 'gif', 'given', 'gives', 'giz', 'gk', 'gm', 'goal', 'god', 'goes', 'going', 'good', 'got', 'gov', 'government', 'graphics', 'great', 'greek', 'ground', 'group', 'groups', 'guess', 'gun', 'guns', 'guy', 'half', 'hand', 'happen', 'happened', 'happens', 'hard', 'hardware', 'haven', 'having', 'head', 'health', 'hear', 'heard', 'held', 'hell', 'help', 'hi', 'high', 'higher', 'history', 'hit', 'hockey', 'hold', 'home', 'hope', 'hours', 'house', 'hp', 'human', 'ibm', 'ide', 'idea', 'ideas', 'ii', 'image', 'images', 'imagine', 'important', 'include', 'included', 'includes', 'including', 'individual', 'info', 'information', 'input', 'inside', 'installed', 'instead', 'insurance', 'int', 'interested', 'interesting', 'interface', 'internal', 'international', 'internet', 'involved', 'isn', 'israel', 'israeli', 'issue', 'issues', 'jesus', 'jewish', 'jews', 'jim', 'job', 'jobs', 'john', 'jpeg', 'just', 'key', 'keyboard', 'keys', 'kill', 'killed', 'kind', 'knew', 'know', 'knowledge', 'known', 'knows', 'la', 'land', 'language', 'large', 'late', 'later', 'law', 'laws', 'league', 'learn', 'leave', 'left', 'legal', 'let', 'letter', 'level', 'library', 'life', 'light', 'like', 'likely', 'limited', 'line', 'lines', 'list', 'little', 'live', 'lives', 'living', 'll', 'local', 'long', 'longer', 'look', 'looked', 'looking', 'looks', 'lord', 'lost', 'lot', 'lots', 'love', 'low', 'lower', 'mac', 'machine', 'machines', 'mail', 'main', 'major', 'make', 'makes', 'making', 'man', 'manager', 'manual', 'mark', 'market', 'mass', 'master', 'material', 'matter', 'max', 'maybe', 'mb', 'mean', 'meaning', 'means', 'media', 'medical', 'members', 'memory', 'men', 'mention', 'mentioned', 'message', 'mike', 'miles', 'military', 'million', 'mind', 'mit', 'mode', 'model', 'modem', 'money', 'monitor', 'month', 'months', 'moral', 'mother', 'motif', 'mouse', 'mr', 'ms', 'multiple', 'nasa', 'national', 'nature', 'near', 'necessary', 'need', 'needed', 'needs', 'net', 'network', 'new', 'news', 'newsgroup', 'nhl', 'nice', 'night', 'non', 'normal', 'note', 'nsa', 'number', 'numbers', 'object', 'obvious', 'obviously', 'offer', 'office', 'official', 'oh', 'ok', 'old', 'ones', 'open', 'opinion', 'opinions', 'orbit', 'order', 'org', 'organization', 'original', 'os', 'output', 'outside', 'package', 'page', 'paper', 'particular', 'parts', 'party', 'past', 'paul', 'pay', 'pc', 'peace', 'people', 'perfect', 'performance', 'period', 'person', 'personal', 'phone', 'pick', 'picture', 'pin', 'pittsburgh', 'pl', 'place', 'places', 'plan', 'play', 'played', 'player', 'players', 'plus', 'point', 'points', 'police', 'policy', 'political', 'population', 'port', 'position', 'possible', 'possibly', 'post', 'posted', 'posting', 'power', 'pp', 'present', 'president', 'press', 'pretty', 'previous', 'price', 'printer', 'privacy', 'private', 'pro', 'probably', 'problem', 'problems', 'process', 'product', 'program', 'programs', 'project', 'protect', 'provide', 'provides', 'pub', 'public', 'published', 'purpose', 'qq', 'quality', 'question', 'questions', 'quite', 'radio', 'ram', 'range', 'rate', 'read', 'reading', 'real', 'really', 'reason', 'reasonable', 'reasons', 'received', 'recent', 'recently', 'record', 'red', 'reference', 'regular', 'related', 'release', 'religion', 'religious', 'remember', 'reply', 'report', 'reports', 'request', 'require', 'required', 'requires', 'research', 'resources', 'response', 'rest', 'result', 'results', 'return', 'right', 'rights', 'road', 'rom', 'room', 'round', 'rules', 'run', 'running', 'runs', 'russian', 'safety', 'said', 'sale', 'san', 'save', 'saw', 'say', 'saying', 'says', 'school', 'sci', 'science', 'scientific', 'screen', 'scsi', 'search', 'season', 'second', 'secret', 'section', 'secure', 'security', 'seen', 'self', 'sell', 'send', 'sense', 'sent', 'serial', 'series', 'server', 'service', 'set', 'shall', 'shipping', 'short', 'shot', 'shuttle', 'similar', 'simple', 'simply', 'sin', 'single', 'site', 'sites', 'situation', 'size', 'small', 'society', 'software', 'solution', 'son', 'soon', 'sorry', 'sort', 'sound', 'sounds', 'source', 'sources', 'south', 'soviet', 'space', 'special', 'specific', 'speed', 'spirit', 'st', 'standard', 'start', 'started', 'state', 'statement', 'states', 'station', 'stephanopoulos', 'steve', 'stop', 'story', 'stream', 'street', 'strong', 'study', 'stuff', 'subject', 'suggest', 'sun', 'support', 'supports', 'supposed', 'sure', 'systems', 'taken', 'takes', 'taking', 'talk', 'talking', 'tape', 'tar', 'tax', 'team', 'teams', 'technical', 'technology', 'tell', 'term', 'terms', 'test', 'text', 'thank', 'thanks', 'theory', 'thing', 'things', 'think', 'thinking', 'thought', 'time', 'times', 'title', 'tm', 'today', 'told', 'took', 'tools', 'total', 'trade', 'transfer', 'tried', 'true', 'truth', 'try', 'trying', 'turkey', 'turkish', 'turn', 'tv', 'type', 'uk', 'understand', 'unfortunately', 'unit', 'united', 'university', 'unix', 'unless', 'usa', 'use', 'used', 'useful', 'usenet', 'user', 'users', 'uses', 'using', 'usually', 'value', 'values', 'van', 'various', 've', 'version', 'vga', 'video', 'view', 'voice', 'volume', 'vs', 'wait', 'want', 'wanted', 'wants', 'war', 'washington', 'wasn', 'watch', 'water', 'way', 'ways', 'weapons', 'week', 'weeks', 'went', 'white', 'wide', 'widget', 'willing', 'win', 'window', 'windows', 'wish', 'wm', 'women', 'won', 'word', 'words', 'work', 'worked', 'working', 'works', 'world', 'worth', 'wouldn', 'write', 'writing', 'written', 'wrong', 'wrote', 'x11', 'xt', 'year', 'years', 'yes', 'york', 'young']\n"
     ]
    }
   ],
   "source": [
    "print(tf_vectorizer.get_feature_names())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "965"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf_vectorizer.vocabulary_['week']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 2 3 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0\n",
      "  0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0\n",
      "  0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1\n",
      "  1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      "  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]\n"
     ]
    }
   ],
   "source": [
    "print(tf[0].toarray())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,\n",
       "             evaluate_every=-1, learning_decay=0.7,\n",
       "             learning_method='online', learning_offset=10.0,\n",
       "             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,\n",
       "             n_components=10, n_jobs=1, n_topics=None, perp_tol=0.1,\n",
       "             random_state=0, topic_word_prior=None,\n",
       "             total_samples=1000000.0, verbose=0)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "num_topics = 10\n",
    "\n",
    "from sklearn.decomposition import LatentDirichletAllocation\n",
    "lda = LatentDirichletAllocation(n_components=num_topics, learning_method='online', random_state=0)\n",
    "lda.fit(tf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(10, 1000)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lda.components_.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "topic_word_distributions = np.array([topic_word_pseudocounts / np.sum(topic_word_pseudocounts)\n",
    "                                     for topic_word_pseudocounts in lda.components_])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Displaying the top 20 words per topic and their probabilities within the topic...\n",
      "\n",
      "[Topic 0]\n",
      "year : 0.03297118094739015\n",
      "team : 0.01920466634388107\n",
      "good : 0.016830138855415556\n",
      "gun : 0.01617856783548117\n",
      "new : 0.015603672962549879\n",
      "mr : 0.015050811381843318\n",
      "president : 0.014311296435057167\n",
      "games : 0.014180220098802127\n",
      "season : 0.01293113138543077\n",
      "league : 0.01049137855069683\n",
      "players : 0.010349770248162016\n",
      "play : 0.01027439631422798\n",
      "hockey : 0.009974003061195056\n",
      "time : 0.009694174123867787\n",
      "best : 0.009409285751708707\n",
      "price : 0.009381674293086624\n",
      "years : 0.009373344778562087\n",
      "win : 0.009083391252969874\n",
      "stephanopoulos : 0.008137787607874035\n",
      "got : 0.007609708926778383\n",
      "\n",
      "[Topic 1]\n",
      "edu : 0.040291106052065644\n",
      "file : 0.03401605842663057\n",
      "com : 0.022202304614878616\n",
      "ftp : 0.01489187024945181\n",
      "available : 0.01487234441017172\n",
      "program : 0.013900099687007892\n",
      "files : 0.013469412901007923\n",
      "mail : 0.01273118625252846\n",
      "list : 0.012302522449955501\n",
      "server : 0.012191608776669673\n",
      "pub : 0.0117819009260124\n",
      "send : 0.01161808163325555\n",
      "information : 0.011328191056635484\n",
      "email : 0.010819419501123815\n",
      "faq : 0.010338836510401846\n",
      "use : 0.010135228627400334\n",
      "anonymous : 0.00967466854735926\n",
      "entry : 0.009404082662349876\n",
      "source : 0.008823763832099556\n",
      "sun : 0.00874015157494\n",
      "\n",
      "[Topic 2]\n",
      "space : 0.017520829980870224\n",
      "government : 0.014764857355700099\n",
      "law : 0.013984172594173342\n",
      "public : 0.012742615885066335\n",
      "new : 0.011835560307473852\n",
      "university : 0.011455663004367131\n",
      "use : 0.010558752454926011\n",
      "information : 0.010523671907643252\n",
      "national : 0.010005648977110936\n",
      "research : 0.009837721562709869\n",
      "state : 0.009253714677482188\n",
      "states : 0.008693409057582452\n",
      "data : 0.008137174327814675\n",
      "general : 0.007869014386848868\n",
      "1993 : 0.007725106577330365\n",
      "privacy : 0.007621978472840824\n",
      "nasa : 0.007618877183926785\n",
      "control : 0.007514320482818136\n",
      "center : 0.007103239931985107\n",
      "technology : 0.006967116433727512\n",
      "\n",
      "[Topic 3]\n",
      "people : 0.02589540888252561\n",
      "don : 0.019051730185797963\n",
      "just : 0.016890775051117458\n",
      "think : 0.014747845951330912\n",
      "know : 0.014199565695037612\n",
      "like : 0.014113213308107737\n",
      "said : 0.01087441440577609\n",
      "time : 0.01068227137221108\n",
      "right : 0.010536202795306702\n",
      "did : 0.009462650925254168\n",
      "say : 0.008821296230055412\n",
      "going : 0.0079178707316679\n",
      "want : 0.0075225944490121535\n",
      "ve : 0.007291645923099413\n",
      "way : 0.007098601161894257\n",
      "didn : 0.007067349290356856\n",
      "make : 0.00628634181601172\n",
      "really : 0.006220033365741916\n",
      "years : 0.006030323009314091\n",
      "ll : 0.005758539222688551\n",
      "\n",
      "[Topic 4]\n",
      "10 : 0.047226062225035684\n",
      "00 : 0.033267653799259465\n",
      "11 : 0.031173500801189236\n",
      "12 : 0.03012370037786803\n",
      "15 : 0.02990359767341557\n",
      "25 : 0.029398719591355014\n",
      "20 : 0.0287112182643199\n",
      "14 : 0.026568760431583568\n",
      "16 : 0.024643647803909763\n",
      "17 : 0.02321676042163341\n",
      "13 : 0.022892614277262482\n",
      "18 : 0.020309457032141473\n",
      "24 : 0.018746898985639325\n",
      "40 : 0.017416012742783437\n",
      "30 : 0.017264702358222788\n",
      "55 : 0.017249979067923197\n",
      "19 : 0.016906732935191217\n",
      "21 : 0.016492869919936096\n",
      "23 : 0.01548132532684356\n",
      "22 : 0.015325721360768488\n",
      "\n",
      "[Topic 5]\n",
      "windows : 0.027479651671251597\n",
      "db : 0.02008130260617209\n",
      "software : 0.019461723980915106\n",
      "dos : 0.016645871396920555\n",
      "card : 0.016629462252449215\n",
      "image : 0.014905777621753546\n",
      "disk : 0.014882547183197141\n",
      "graphics : 0.014394671626531437\n",
      "data : 0.014324999278027082\n",
      "pc : 0.01285284399133799\n",
      "color : 0.012760377141737953\n",
      "mac : 0.012521671837075525\n",
      "memory : 0.012288066764058091\n",
      "window : 0.01172543924698812\n",
      "version : 0.011359671702652232\n",
      "use : 0.011144096720246894\n",
      "display : 0.01070306970233049\n",
      "using : 0.010212944450226524\n",
      "bit : 0.010158024468026963\n",
      "screen : 0.009825308617938854\n",
      "\n",
      "[Topic 6]\n",
      "key : 0.036657227264248624\n",
      "thanks : 0.03330919224244328\n",
      "know : 0.02754614302969939\n",
      "does : 0.023024362368053112\n",
      "chip : 0.020030336089087997\n",
      "use : 0.017336023309540434\n",
      "encryption : 0.01713683715884816\n",
      "help : 0.016496899538802713\n",
      "like : 0.016002653922314678\n",
      "mail : 0.015804000270816125\n",
      "need : 0.01534427106958563\n",
      "keys : 0.013747762178046729\n",
      "looking : 0.013312665254122464\n",
      "clipper : 0.012823434108060519\n",
      "used : 0.012362006854300497\n",
      "sound : 0.012103168531042614\n",
      "hi : 0.011949674369115187\n",
      "advance : 0.010790313074703295\n",
      "information : 0.010637453739816545\n",
      "bit : 0.010237134231452456\n",
      "\n",
      "[Topic 7]\n",
      "god : 0.03677897297320968\n",
      "jesus : 0.01755382907913231\n",
      "does : 0.015816424138174412\n",
      "believe : 0.013958023130356094\n",
      "game : 0.012171745895151097\n",
      "people : 0.01104281007944074\n",
      "say : 0.011006544304419667\n",
      "christian : 0.010853081080413829\n",
      "true : 0.010630093025426314\n",
      "bible : 0.01033931356085625\n",
      "think : 0.009774171445882959\n",
      "church : 0.00966870470103312\n",
      "life : 0.00883349489937276\n",
      "way : 0.007852699506432994\n",
      "religion : 0.00759097523503712\n",
      "christians : 0.0075449880919839655\n",
      "christ : 0.0075396160943345175\n",
      "faith : 0.007439660543789043\n",
      "point : 0.007427660316865335\n",
      "good : 0.007186456856356701\n",
      "\n",
      "[Topic 8]\n",
      "drive : 0.022295739864303166\n",
      "power : 0.01950734168199434\n",
      "like : 0.018522559790376272\n",
      "just : 0.016924517655010334\n",
      "car : 0.016675980375469957\n",
      "use : 0.01546275692548456\n",
      "scsi : 0.013950686273385776\n",
      "ve : 0.01392586768350613\n",
      "good : 0.011218067098463793\n",
      "speed : 0.011115387347522983\n",
      "hard : 0.01099407199094689\n",
      "used : 0.010715407919410537\n",
      "don : 0.010212854864543285\n",
      "problem : 0.01010922379681309\n",
      "work : 0.009567260783500347\n",
      "drives : 0.00822283821010156\n",
      "buy : 0.00800970254393546\n",
      "better : 0.007788251352635117\n",
      "high : 0.0077583107983032525\n",
      "does : 0.00731337975997651\n",
      "\n",
      "[Topic 9]\n",
      "ax : 0.7750508616859831\n",
      "max : 0.05676346028145709\n",
      "g9v : 0.017637993832254468\n",
      "b8f : 0.015315278818004914\n",
      "a86 : 0.012646093392057575\n",
      "145 : 0.010168944298400002\n",
      "pl : 0.01012365428397651\n",
      "1d9 : 0.008174180919536261\n",
      "1t : 0.0065319516608289266\n",
      "0t : 0.006459809882532864\n",
      "bhj : 0.006110218348574932\n",
      "giz : 0.005453049410747113\n",
      "3t : 0.005447285579836966\n",
      "34u : 0.005285937363655366\n",
      "2di : 0.005090173874145282\n",
      "75u : 0.00463261937952306\n",
      "wm : 0.004518731940833898\n",
      "2tm : 0.004222775559250272\n",
      "7ey : 0.0036648191849338423\n",
      "bxn : 0.0032500074436691783\n",
      "\n"
     ]
    }
   ],
   "source": [
    "num_top_words = 20\n",
    "\n",
    "print('Displaying the top %d words per topic and their probabilities within the topic...' % num_top_words)\n",
    "print()\n",
    "\n",
    "for topic_idx in range(num_topics):\n",
    "    print('[Topic ', topic_idx, ']', sep='')\n",
    "    sort_indices = np.argsort(topic_word_distributions[topic_idx])[::-1]\n",
    "    for rank in range(num_top_words):\n",
    "        word_idx = sort_indices[rank]\n",
    "        print(tf_vectorizer.get_feature_names()[word_idx], ':', topic_word_distributions[topic_idx, word_idx])\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing co-occurrences of words\n",
    "\n",
    "Here, we count the number of newsgroup posts in which two words both occur. This part of the demo should feel like a review of co-occurrence analysis from earlier in the course, except now we use scikit-learn's built-in CountVectorizer. Conceptually everything else in the same as before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "word1 = 'year'\n",
    "word2 = 'team'\n",
    "\n",
    "word1_column_idx = tf_vectorizer.vocabulary_[word1]\n",
    "word2_column_idx = tf_vectorizer.vocabulary_[word2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents_with_word1 = (tf[:, word1_column_idx].toarray().flatten() > 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents_with_word2 = (tf[:, word2_column_idx].toarray().flatten() > 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents_with_both_word1_and_word2 = documents_with_word1 * documents_with_word2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we compute the log of the conditional probability of word 1 appearing given that word 2 appeared, where we add in a little bit of a fudge factor in the numerator (in this case, it's actually not needed but some times you do have two words that do not co-occur for which you run into a numerical issue due to taking the log of 0)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1.5482462194376105"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eps = 0.1\n",
    "np.log2((documents_with_both_word1_and_word2.sum() + eps) / documents_with_word2.sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def prob_see_word1_given_see_word2(word1, word2, vectorizer, eps=0.1):\n",
    "    word1_column_idx = vectorizer.vocabulary_[word1]\n",
    "    word2_column_idx = vectorizer.vocabulary_[word2]\n",
    "    documents_with_word1 = (tf[:, word1_column_idx].toarray().flatten() > 0)\n",
    "    documents_with_word2 = (tf[:, word2_column_idx].toarray().flatten() > 0)\n",
    "    documents_with_both_word1_and_word2 = documents_with_word1 * documents_with_word2\n",
    "    return np.log2((documents_with_both_word1_and_word2.sum() + eps) / documents_with_word2.sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Topic coherence\n",
    "\n",
    "The below code shows how one implements the topic coherence calculation from lecture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Topic 0]\n",
      "Coherence: -1356.3836721926853\n",
      "\n",
      "[Topic 1]\n",
      "Coherence: -969.252344768849\n",
      "\n",
      "[Topic 2]\n",
      "Coherence: -1038.5936491181455\n",
      "\n",
      "[Topic 3]\n",
      "Coherence: -752.9744085675202\n",
      "\n",
      "[Topic 4]\n",
      "Coherence: -641.5683154733748\n",
      "\n",
      "[Topic 5]\n",
      "Coherence: -1155.6763255419658\n",
      "\n",
      "[Topic 6]\n",
      "Coherence: -1177.3645847380105\n",
      "\n",
      "[Topic 7]\n",
      "Coherence: -948.0033411181123\n",
      "\n",
      "[Topic 8]\n",
      "Coherence: -1054.2809655411477\n",
      "\n",
      "[Topic 9]\n",
      "Coherence: -217.079440424438\n",
      "\n",
      "Average coherence: -931.1177047484249\n"
     ]
    }
   ],
   "source": [
    "average_coherence = 0\n",
    "for topic_idx in range(num_topics):\n",
    "    print('[Topic ', topic_idx, ']', sep='')\n",
    "    sort_indices = np.argsort(topic_word_distributions[topic_idx])[::-1]\n",
    "    coherence = 0.\n",
    "    for top_word_idx1 in sort_indices[:num_top_words]:\n",
    "        word1 = tf_vectorizer.get_feature_names()[top_word_idx1]\n",
    "        for top_word_idx2 in sort_indices[:num_top_words]:\n",
    "            word2 = tf_vectorizer.get_feature_names()[top_word_idx2]\n",
    "            if top_word_idx1 != top_word_idx2:\n",
    "                coherence += prob_see_word1_given_see_word2(word1, word2, tf_vectorizer, 0.1)\n",
    "    print('Coherence:', coherence)\n",
    "    print()\n",
    "    average_coherence += coherence\n",
    "average_coherence /= num_topics\n",
    "print('Average coherence:', average_coherence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Number of unique words\n",
    "\n",
    "The below code shows how one implements the number of unique words calculation from lecture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Topic 0]\n",
      "Number of unique top words: 16\n",
      "\n",
      "[Topic 1]\n",
      "Number of unique top words: 17\n",
      "\n",
      "[Topic 2]\n",
      "Number of unique top words: 16\n",
      "\n",
      "[Topic 3]\n",
      "Number of unique top words: 9\n",
      "\n",
      "[Topic 4]\n",
      "Number of unique top words: 20\n",
      "\n",
      "[Topic 5]\n",
      "Number of unique top words: 17\n",
      "\n",
      "[Topic 6]\n",
      "Number of unique top words: 12\n",
      "\n",
      "[Topic 7]\n",
      "Number of unique top words: 14\n",
      "\n",
      "[Topic 8]\n",
      "Number of unique top words: 12\n",
      "\n",
      "[Topic 9]\n",
      "Number of unique top words: 20\n",
      "\n",
      "Average number of unique top words: 15.3\n"
     ]
    }
   ],
   "source": [
    "average_number_of_unique_top_words = 0\n",
    "for topic_idx1 in range(num_topics):\n",
    "    print('[Topic ', topic_idx1, ']', sep='')\n",
    "    sort_indices1 = np.argsort(topic_word_distributions[topic_idx1])[::-1]\n",
    "    num_unique_top_words = 0\n",
    "    for top_word_idx1 in sort_indices1[:num_top_words]:\n",
    "        word1 = tf_vectorizer.get_feature_names()[top_word_idx1]\n",
    "        break_ = False\n",
    "        for topic_idx2 in range(num_topics):\n",
    "            if topic_idx1 != topic_idx2:\n",
    "                sort_indices2 = np.argsort(topic_word_distributions[topic_idx2])[::-1]\n",
    "                for top_word_idx2 in sort_indices2[:num_top_words]:\n",
    "                    word2 = tf_vectorizer.get_feature_names()[top_word_idx2]\n",
    "                    if word1 == word2:\n",
    "                        break_ = True\n",
    "                        break\n",
    "                if break_:\n",
    "                    break\n",
    "        else:\n",
    "            num_unique_top_words += 1\n",
    "    print('Number of unique top words:', num_unique_top_words)\n",
    "    print()\n",
    "    \n",
    "    average_number_of_unique_top_words += num_unique_top_words\n",
    "average_number_of_unique_top_words /= num_topics\n",
    "print('Average number of unique top words:', average_number_of_unique_top_words)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}