Skip to content

Instantly share code, notes, and snippets.

@georgehc
Created February 12, 2019 13:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save georgehc/470202fc7fd2e9abb874359a911d10f9 to your computer and use it in GitHub Desktop.
Save georgehc/470202fc7fd2e9abb874359a911d10f9 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 94-775/95-865: Topic Modeling Demo\n",
"\n",
"Author: George H. Chen (georgechen [at symbol] cmu.edu)\n",
"\n",
"The beginning part of this demo is a shortened and modified version of sklearn's LDA & NMF demo (http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Latent Dirichlet Allocation"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_20newsgroups\n",
"num_articles = 10000\n",
"data = fetch_20newsgroups(shuffle=True, random_state=0,\n",
" remove=('headers', 'footers', 'quotes')).data[:num_articles]"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"The last name is Niedermayer, as in New Jersey's Scott's last name, because\n",
"(you guessed it) they are brothers. But Rob Niedermayer is a center, not\n",
"a defenseman.\n",
"\n",
"I am not sure that the Sharks will take Kariya. They aren't saying much, but\n",
"they apparently like Niedermayer and Victor Kozlov, along with Kariya. Chris\n",
"Pronger's name has also been mentioned. My guess is that they'll take\n",
"Niedermayer. They may take Pronger, except that they already have too many\n",
"defensive prospects.\n"
]
}
],
"source": [
"# you can take a look at what individual documents look like by replacing what index we look at\n",
"print(data[5])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"vocab_size = 1000\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"# CountVectorizer does tokenization and can remove terms that occur too frequently, not frequently enough, or that are stop words\n",
"\n",
"# document frequency (df) means number of documents a word appears in\n",
"tf_vectorizer = CountVectorizer(max_df=0.95,\n",
" min_df=2,\n",
" max_features=vocab_size,\n",
" stop_words='english')\n",
"tf = tf_vectorizer.fit_transform(data)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"scipy.sparse.csr.csr_matrix"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(tf)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['00', '000', '02', '03', '04', '0d', '0t', '10', '100', '11', '12', '128', '13', '14', '145', '15', '16', '17', '18', '19', '1990', '1991', '1992', '1993', '1d9', '1st', '1t', '20', '200', '21', '22', '23', '24', '25', '250', '26', '27', '28', '29', '2di', '2tm', '30', '300', '31', '32', '33', '34', '34u', '35', '36', '37', '38', '39', '3d', '3t', '40', '41', '42', '43', '44', '45', '46', '48', '50', '500', '55', '60', '64', '6um', '70', '75', '75u', '7ey', '80', '800', '86', '90', '91', '92', '93', '9v', 'a86', 'ability', 'able', 'ac', 'accept', 'access', 'according', 'act', 'action', 'actually', 'add', 'addition', 'address', 'administration', 'advance', 'age', 'ago', 'agree', 'ah', 'air', 'al', 'algorithm', 'allow', 'allowed', 'alt', 'america', 'american', 'analysis', 'anonymous', 'answer', 'answers', 'anti', 'anybody', 'apparently', 'appears', 'apple', 'application', 'applications', 'appreciate', 'appreciated', 'approach', 'appropriate', 'apr', 'april', 'archive', 'area', 'areas', 'aren', 'argument', 'armenia', 'armenian', 'armenians', 'arms', 'army', 'article', 'articles', 'ask', 'asked', 'asking', 'assume', 'atheism', 'attack', 'attempt', 'au', 'author', 'authority', 'available', 'average', 'avoid', 'away', 'ax', 'b8f', 'bad', 'base', 'based', 'basic', 'basically', 'basis', 'belief', 'believe', 'best', 'better', 'bh', 'bhj', 'bible', 'big', 'bike', 'bit', 'bits', 'bj', 'black', 'block', 'blood', 'board', 'body', 'book', 'books', 'bought', 'box', 'break', 'bring', 'brought', 'btw', 'buf', 'build', 'building', 'built', 'bus', 'business', 'buy', 'bxn', 'ca', 'cable', 'california', 'called', 'calls', 'came', 'canada', 'car', 'card', 'cards', 'care', 'carry', 'cars', 'case', 'cases', 'cause', 'cd', 'center', 'certain', 'certainly', 'chance', 'change', 'changed', 'changes', 'check', 'chicago', 'child', 'children', 'chip', 'chips', 'choice', 'christ', 'christian', 'christianity', 'christians', 'church', 'citizens', 'city', 'claim', 'claims', 'class', 'clear', 'clearly', 'clinton', 'clipper', 'close', 'code', 'color', 'com', 'come', 'comes', 'coming', 'command', 'comments', 'commercial', 'committee', 'common', 'community', 'comp', 'company', 'complete', 'completely', 'computer', 'condition', 'conference', 'congress', 'consider', 'considered', 'contact', 'contains', 'context', 'continue', 'control', 'controller', 'copy', 'correct', 'cost', 'couldn', 'country', 'couple', 'course', 'court', 'cover', 'create', 'created', 'crime', 'cross', 'cs', 'current', 'currently', 'cut', 'cx', 'data', 'date', 'dave', 'david', 'day', 'days', 'db', 'dc', 'dead', 'deal', 'death', 'dec', 'decided', 'defense', 'define', 'deleted', 'department', 'des', 'design', 'designed', 'details', 'development', 'device', 'devices', 'did', 'didn', 'difference', 'different', 'difficult', 'digital', 'directly', 'directory', 'discussion', 'disk', 'display', 'distribution', 'division', 'dod', 'does', 'doesn', 'doing', 'don', 'door', 'dos', 'doubt', 'dr', 'drive', 'driver', 'drivers', 'drives', 'drug', 'early', 'earth', 'easily', 'east', 'easy', 'ed', 'edu', 'effect', 'electronic', 'email', 'encryption', 'end', 'enforcement', 'engine', 'entire', 'entry', 'environment', 'error', 'escrow', 'especially', 'event', 'events', 'evidence', 'exactly', 'example', 'excellent', 'exist', 'existence', 'exists', 'expect', 'experience', 'explain', 'export', 'extra', 'face', 'fact', 'faith', 'false', 'family', 'faq', 'far', 'fast', 'faster', 'father', 'fax', 'fbi', 'features', 'federal', 'feel', 'field', 'figure', 'file', 'files', 'final', 'finally', 'fine', 'firearms', 'floppy', 'folks', 'follow', 'following', 'food', 'force', 'form', 'format', 'free', 'freedom', 'friend', 'ftp', 'function', 'functions', 'future', 'g9v', 'game', 'games', 'gas', 'gave', 'general', 'generally', 'gets', 'getting', 'gif', 'given', 'gives', 'giz', 'gk', 'gm', 'goal', 'god', 'goes', 'going', 'good', 'got', 'gov', 'government', 'graphics', 'great', 'greek', 'ground', 'group', 'groups', 'guess', 'gun', 'guns', 'guy', 'half', 'hand', 'happen', 'happened', 'happens', 'hard', 'hardware', 'haven', 'having', 'head', 'health', 'hear', 'heard', 'held', 'hell', 'help', 'hi', 'high', 'higher', 'history', 'hit', 'hockey', 'hold', 'home', 'hope', 'hours', 'house', 'hp', 'human', 'ibm', 'ide', 'idea', 'ideas', 'ii', 'image', 'images', 'imagine', 'important', 'include', 'included', 'includes', 'including', 'individual', 'info', 'information', 'input', 'inside', 'installed', 'instead', 'insurance', 'int', 'interested', 'interesting', 'interface', 'internal', 'international', 'internet', 'involved', 'isn', 'israel', 'israeli', 'issue', 'issues', 'jesus', 'jewish', 'jews', 'jim', 'job', 'jobs', 'john', 'jpeg', 'just', 'key', 'keyboard', 'keys', 'kill', 'killed', 'kind', 'knew', 'know', 'knowledge', 'known', 'knows', 'la', 'land', 'language', 'large', 'late', 'later', 'law', 'laws', 'league', 'learn', 'leave', 'left', 'legal', 'let', 'letter', 'level', 'library', 'life', 'light', 'like', 'likely', 'limited', 'line', 'lines', 'list', 'little', 'live', 'lives', 'living', 'll', 'local', 'long', 'longer', 'look', 'looked', 'looking', 'looks', 'lord', 'lost', 'lot', 'lots', 'love', 'low', 'lower', 'mac', 'machine', 'machines', 'mail', 'main', 'major', 'make', 'makes', 'making', 'man', 'manager', 'manual', 'mark', 'market', 'mass', 'master', 'material', 'matter', 'max', 'maybe', 'mb', 'mean', 'meaning', 'means', 'media', 'medical', 'members', 'memory', 'men', 'mention', 'mentioned', 'message', 'mike', 'miles', 'military', 'million', 'mind', 'mit', 'mode', 'model', 'modem', 'money', 'monitor', 'month', 'months', 'moral', 'mother', 'motif', 'mouse', 'mr', 'ms', 'multiple', 'nasa', 'national', 'nature', 'near', 'necessary', 'need', 'needed', 'needs', 'net', 'network', 'new', 'news', 'newsgroup', 'nhl', 'nice', 'night', 'non', 'normal', 'note', 'nsa', 'number', 'numbers', 'object', 'obvious', 'obviously', 'offer', 'office', 'official', 'oh', 'ok', 'old', 'ones', 'open', 'opinion', 'opinions', 'orbit', 'order', 'org', 'organization', 'original', 'os', 'output', 'outside', 'package', 'page', 'paper', 'particular', 'parts', 'party', 'past', 'paul', 'pay', 'pc', 'peace', 'people', 'perfect', 'performance', 'period', 'person', 'personal', 'phone', 'pick', 'picture', 'pin', 'pittsburgh', 'pl', 'place', 'places', 'plan', 'play', 'played', 'player', 'players', 'plus', 'point', 'points', 'police', 'policy', 'political', 'population', 'port', 'position', 'possible', 'possibly', 'post', 'posted', 'posting', 'power', 'pp', 'present', 'president', 'press', 'pretty', 'previous', 'price', 'printer', 'privacy', 'private', 'pro', 'probably', 'problem', 'problems', 'process', 'product', 'program', 'programs', 'project', 'protect', 'provide', 'provides', 'pub', 'public', 'published', 'purpose', 'qq', 'quality', 'question', 'questions', 'quite', 'radio', 'ram', 'range', 'rate', 'read', 'reading', 'real', 'really', 'reason', 'reasonable', 'reasons', 'received', 'recent', 'recently', 'record', 'red', 'reference', 'regular', 'related', 'release', 'religion', 'religious', 'remember', 'reply', 'report', 'reports', 'request', 'require', 'required', 'requires', 'research', 'resources', 'response', 'rest', 'result', 'results', 'return', 'right', 'rights', 'road', 'rom', 'room', 'round', 'rules', 'run', 'running', 'runs', 'russian', 'safety', 'said', 'sale', 'san', 'save', 'saw', 'say', 'saying', 'says', 'school', 'sci', 'science', 'scientific', 'screen', 'scsi', 'search', 'season', 'second', 'secret', 'section', 'secure', 'security', 'seen', 'self', 'sell', 'send', 'sense', 'sent', 'serial', 'series', 'server', 'service', 'set', 'shall', 'shipping', 'short', 'shot', 'shuttle', 'similar', 'simple', 'simply', 'sin', 'single', 'site', 'sites', 'situation', 'size', 'small', 'society', 'software', 'solution', 'son', 'soon', 'sorry', 'sort', 'sound', 'sounds', 'source', 'sources', 'south', 'soviet', 'space', 'special', 'specific', 'speed', 'spirit', 'st', 'standard', 'start', 'started', 'state', 'statement', 'states', 'station', 'stephanopoulos', 'steve', 'stop', 'story', 'stream', 'street', 'strong', 'study', 'stuff', 'subject', 'suggest', 'sun', 'support', 'supports', 'supposed', 'sure', 'systems', 'taken', 'takes', 'taking', 'talk', 'talking', 'tape', 'tar', 'tax', 'team', 'teams', 'technical', 'technology', 'tell', 'term', 'terms', 'test', 'text', 'thank', 'thanks', 'theory', 'thing', 'things', 'think', 'thinking', 'thought', 'time', 'times', 'title', 'tm', 'today', 'told', 'took', 'tools', 'total', 'trade', 'transfer', 'tried', 'true', 'truth', 'try', 'trying', 'turkey', 'turkish', 'turn', 'tv', 'type', 'uk', 'understand', 'unfortunately', 'unit', 'united', 'university', 'unix', 'unless', 'usa', 'use', 'used', 'useful', 'usenet', 'user', 'users', 'uses', 'using', 'usually', 'value', 'values', 'van', 'various', 've', 'version', 'vga', 'video', 'view', 'voice', 'volume', 'vs', 'wait', 'want', 'wanted', 'wants', 'war', 'washington', 'wasn', 'watch', 'water', 'way', 'ways', 'weapons', 'week', 'weeks', 'went', 'white', 'wide', 'widget', 'willing', 'win', 'window', 'windows', 'wish', 'wm', 'women', 'won', 'word', 'words', 'work', 'worked', 'working', 'works', 'world', 'worth', 'wouldn', 'write', 'writing', 'written', 'wrong', 'wrote', 'x11', 'xt', 'year', 'years', 'yes', 'york', 'young']\n"
]
}
],
"source": [
"print(tf_vectorizer.get_feature_names())"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"965"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tf_vectorizer.vocabulary_['week']"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 2 3 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0\n",
" 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0\n",
" 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1\n",
" 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]\n"
]
}
],
"source": [
"print(tf[0].toarray())"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,\n",
" evaluate_every=-1, learning_decay=0.7,\n",
" learning_method='online', learning_offset=10.0,\n",
" max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,\n",
" n_components=10, n_jobs=1, n_topics=None, perp_tol=0.1,\n",
" random_state=0, topic_word_prior=None,\n",
" total_samples=1000000.0, verbose=0)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num_topics = 10\n",
"\n",
"from sklearn.decomposition import LatentDirichletAllocation\n",
"lda = LatentDirichletAllocation(n_components=num_topics, learning_method='online', random_state=0)\n",
"lda.fit(tf)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10, 1000)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lda.components_.shape"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"topic_word_distributions = np.array([topic_word_pseudocounts / np.sum(topic_word_pseudocounts)\n",
" for topic_word_pseudocounts in lda.components_])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Displaying the top 20 words per topic and their probabilities within the topic...\n",
"\n",
"[Topic 0]\n",
"year : 0.03297118094739015\n",
"team : 0.01920466634388107\n",
"good : 0.016830138855415556\n",
"gun : 0.01617856783548117\n",
"new : 0.015603672962549879\n",
"mr : 0.015050811381843318\n",
"president : 0.014311296435057167\n",
"games : 0.014180220098802127\n",
"season : 0.01293113138543077\n",
"league : 0.01049137855069683\n",
"players : 0.010349770248162016\n",
"play : 0.01027439631422798\n",
"hockey : 0.009974003061195056\n",
"time : 0.009694174123867787\n",
"best : 0.009409285751708707\n",
"price : 0.009381674293086624\n",
"years : 0.009373344778562087\n",
"win : 0.009083391252969874\n",
"stephanopoulos : 0.008137787607874035\n",
"got : 0.007609708926778383\n",
"\n",
"[Topic 1]\n",
"edu : 0.040291106052065644\n",
"file : 0.03401605842663057\n",
"com : 0.022202304614878616\n",
"ftp : 0.01489187024945181\n",
"available : 0.01487234441017172\n",
"program : 0.013900099687007892\n",
"files : 0.013469412901007923\n",
"mail : 0.01273118625252846\n",
"list : 0.012302522449955501\n",
"server : 0.012191608776669673\n",
"pub : 0.0117819009260124\n",
"send : 0.01161808163325555\n",
"information : 0.011328191056635484\n",
"email : 0.010819419501123815\n",
"faq : 0.010338836510401846\n",
"use : 0.010135228627400334\n",
"anonymous : 0.00967466854735926\n",
"entry : 0.009404082662349876\n",
"source : 0.008823763832099556\n",
"sun : 0.00874015157494\n",
"\n",
"[Topic 2]\n",
"space : 0.017520829980870224\n",
"government : 0.014764857355700099\n",
"law : 0.013984172594173342\n",
"public : 0.012742615885066335\n",
"new : 0.011835560307473852\n",
"university : 0.011455663004367131\n",
"use : 0.010558752454926011\n",
"information : 0.010523671907643252\n",
"national : 0.010005648977110936\n",
"research : 0.009837721562709869\n",
"state : 0.009253714677482188\n",
"states : 0.008693409057582452\n",
"data : 0.008137174327814675\n",
"general : 0.007869014386848868\n",
"1993 : 0.007725106577330365\n",
"privacy : 0.007621978472840824\n",
"nasa : 0.007618877183926785\n",
"control : 0.007514320482818136\n",
"center : 0.007103239931985107\n",
"technology : 0.006967116433727512\n",
"\n",
"[Topic 3]\n",
"people : 0.02589540888252561\n",
"don : 0.019051730185797963\n",
"just : 0.016890775051117458\n",
"think : 0.014747845951330912\n",
"know : 0.014199565695037612\n",
"like : 0.014113213308107737\n",
"said : 0.01087441440577609\n",
"time : 0.01068227137221108\n",
"right : 0.010536202795306702\n",
"did : 0.009462650925254168\n",
"say : 0.008821296230055412\n",
"going : 0.0079178707316679\n",
"want : 0.0075225944490121535\n",
"ve : 0.007291645923099413\n",
"way : 0.007098601161894257\n",
"didn : 0.007067349290356856\n",
"make : 0.00628634181601172\n",
"really : 0.006220033365741916\n",
"years : 0.006030323009314091\n",
"ll : 0.005758539222688551\n",
"\n",
"[Topic 4]\n",
"10 : 0.047226062225035684\n",
"00 : 0.033267653799259465\n",
"11 : 0.031173500801189236\n",
"12 : 0.03012370037786803\n",
"15 : 0.02990359767341557\n",
"25 : 0.029398719591355014\n",
"20 : 0.0287112182643199\n",
"14 : 0.026568760431583568\n",
"16 : 0.024643647803909763\n",
"17 : 0.02321676042163341\n",
"13 : 0.022892614277262482\n",
"18 : 0.020309457032141473\n",
"24 : 0.018746898985639325\n",
"40 : 0.017416012742783437\n",
"30 : 0.017264702358222788\n",
"55 : 0.017249979067923197\n",
"19 : 0.016906732935191217\n",
"21 : 0.016492869919936096\n",
"23 : 0.01548132532684356\n",
"22 : 0.015325721360768488\n",
"\n",
"[Topic 5]\n",
"windows : 0.027479651671251597\n",
"db : 0.02008130260617209\n",
"software : 0.019461723980915106\n",
"dos : 0.016645871396920555\n",
"card : 0.016629462252449215\n",
"image : 0.014905777621753546\n",
"disk : 0.014882547183197141\n",
"graphics : 0.014394671626531437\n",
"data : 0.014324999278027082\n",
"pc : 0.01285284399133799\n",
"color : 0.012760377141737953\n",
"mac : 0.012521671837075525\n",
"memory : 0.012288066764058091\n",
"window : 0.01172543924698812\n",
"version : 0.011359671702652232\n",
"use : 0.011144096720246894\n",
"display : 0.01070306970233049\n",
"using : 0.010212944450226524\n",
"bit : 0.010158024468026963\n",
"screen : 0.009825308617938854\n",
"\n",
"[Topic 6]\n",
"key : 0.036657227264248624\n",
"thanks : 0.03330919224244328\n",
"know : 0.02754614302969939\n",
"does : 0.023024362368053112\n",
"chip : 0.020030336089087997\n",
"use : 0.017336023309540434\n",
"encryption : 0.01713683715884816\n",
"help : 0.016496899538802713\n",
"like : 0.016002653922314678\n",
"mail : 0.015804000270816125\n",
"need : 0.01534427106958563\n",
"keys : 0.013747762178046729\n",
"looking : 0.013312665254122464\n",
"clipper : 0.012823434108060519\n",
"used : 0.012362006854300497\n",
"sound : 0.012103168531042614\n",
"hi : 0.011949674369115187\n",
"advance : 0.010790313074703295\n",
"information : 0.010637453739816545\n",
"bit : 0.010237134231452456\n",
"\n",
"[Topic 7]\n",
"god : 0.03677897297320968\n",
"jesus : 0.01755382907913231\n",
"does : 0.015816424138174412\n",
"believe : 0.013958023130356094\n",
"game : 0.012171745895151097\n",
"people : 0.01104281007944074\n",
"say : 0.011006544304419667\n",
"christian : 0.010853081080413829\n",
"true : 0.010630093025426314\n",
"bible : 0.01033931356085625\n",
"think : 0.009774171445882959\n",
"church : 0.00966870470103312\n",
"life : 0.00883349489937276\n",
"way : 0.007852699506432994\n",
"religion : 0.00759097523503712\n",
"christians : 0.0075449880919839655\n",
"christ : 0.0075396160943345175\n",
"faith : 0.007439660543789043\n",
"point : 0.007427660316865335\n",
"good : 0.007186456856356701\n",
"\n",
"[Topic 8]\n",
"drive : 0.022295739864303166\n",
"power : 0.01950734168199434\n",
"like : 0.018522559790376272\n",
"just : 0.016924517655010334\n",
"car : 0.016675980375469957\n",
"use : 0.01546275692548456\n",
"scsi : 0.013950686273385776\n",
"ve : 0.01392586768350613\n",
"good : 0.011218067098463793\n",
"speed : 0.011115387347522983\n",
"hard : 0.01099407199094689\n",
"used : 0.010715407919410537\n",
"don : 0.010212854864543285\n",
"problem : 0.01010922379681309\n",
"work : 0.009567260783500347\n",
"drives : 0.00822283821010156\n",
"buy : 0.00800970254393546\n",
"better : 0.007788251352635117\n",
"high : 0.0077583107983032525\n",
"does : 0.00731337975997651\n",
"\n",
"[Topic 9]\n",
"ax : 0.7750508616859831\n",
"max : 0.05676346028145709\n",
"g9v : 0.017637993832254468\n",
"b8f : 0.015315278818004914\n",
"a86 : 0.012646093392057575\n",
"145 : 0.010168944298400002\n",
"pl : 0.01012365428397651\n",
"1d9 : 0.008174180919536261\n",
"1t : 0.0065319516608289266\n",
"0t : 0.006459809882532864\n",
"bhj : 0.006110218348574932\n",
"giz : 0.005453049410747113\n",
"3t : 0.005447285579836966\n",
"34u : 0.005285937363655366\n",
"2di : 0.005090173874145282\n",
"75u : 0.00463261937952306\n",
"wm : 0.004518731940833898\n",
"2tm : 0.004222775559250272\n",
"7ey : 0.0036648191849338423\n",
"bxn : 0.0032500074436691783\n",
"\n"
]
}
],
"source": [
"num_top_words = 20\n",
"\n",
"print('Displaying the top %d words per topic and their probabilities within the topic...' % num_top_words)\n",
"print()\n",
"\n",
"for topic_idx in range(num_topics):\n",
" print('[Topic ', topic_idx, ']', sep='')\n",
" sort_indices = np.argsort(topic_word_distributions[topic_idx])[::-1]\n",
" for rank in range(num_top_words):\n",
" word_idx = sort_indices[rank]\n",
" print(tf_vectorizer.get_feature_names()[word_idx], ':', topic_word_distributions[topic_idx, word_idx])\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing co-occurrences of words\n",
"\n",
"Here, we count the number of newsgroup posts in which two words both occur. This part of the demo should feel like a review of co-occurrence analysis from earlier in the course, except now we use scikit-learn's built-in CountVectorizer. Conceptually everything else in the same as before."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"word1 = 'year'\n",
"word2 = 'team'\n",
"\n",
"word1_column_idx = tf_vectorizer.vocabulary_[word1]\n",
"word2_column_idx = tf_vectorizer.vocabulary_[word2]"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"documents_with_word1 = (tf[:, word1_column_idx].toarray().flatten() > 0)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"documents_with_word2 = (tf[:, word2_column_idx].toarray().flatten() > 0)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"documents_with_both_word1_and_word2 = documents_with_word1 * documents_with_word2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we compute the log of the conditional probability of word 1 appearing given that word 2 appeared, where we add in a little bit of a fudge factor in the numerator (in this case, it's actually not needed but some times you do have two words that do not co-occur for which you run into a numerical issue due to taking the log of 0)."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-1.5482462194376105"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eps = 0.1\n",
"np.log2((documents_with_both_word1_and_word2.sum() + eps) / documents_with_word2.sum())"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"def prob_see_word1_given_see_word2(word1, word2, vectorizer, eps=0.1):\n",
" word1_column_idx = vectorizer.vocabulary_[word1]\n",
" word2_column_idx = vectorizer.vocabulary_[word2]\n",
" documents_with_word1 = (tf[:, word1_column_idx].toarray().flatten() > 0)\n",
" documents_with_word2 = (tf[:, word2_column_idx].toarray().flatten() > 0)\n",
" documents_with_both_word1_and_word2 = documents_with_word1 * documents_with_word2\n",
" return np.log2((documents_with_both_word1_and_word2.sum() + eps) / documents_with_word2.sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Topic coherence\n",
"\n",
"The below code shows how one implements the topic coherence calculation from lecture."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Topic 0]\n",
"Coherence: -1356.3836721926853\n",
"\n",
"[Topic 1]\n",
"Coherence: -969.252344768849\n",
"\n",
"[Topic 2]\n",
"Coherence: -1038.5936491181455\n",
"\n",
"[Topic 3]\n",
"Coherence: -752.9744085675202\n",
"\n",
"[Topic 4]\n",
"Coherence: -641.5683154733748\n",
"\n",
"[Topic 5]\n",
"Coherence: -1155.6763255419658\n",
"\n",
"[Topic 6]\n",
"Coherence: -1177.3645847380105\n",
"\n",
"[Topic 7]\n",
"Coherence: -948.0033411181123\n",
"\n",
"[Topic 8]\n",
"Coherence: -1054.2809655411477\n",
"\n",
"[Topic 9]\n",
"Coherence: -217.079440424438\n",
"\n",
"Average coherence: -931.1177047484249\n"
]
}
],
"source": [
"average_coherence = 0\n",
"for topic_idx in range(num_topics):\n",
" print('[Topic ', topic_idx, ']', sep='')\n",
" sort_indices = np.argsort(topic_word_distributions[topic_idx])[::-1]\n",
" coherence = 0.\n",
" for top_word_idx1 in sort_indices[:num_top_words]:\n",
" word1 = tf_vectorizer.get_feature_names()[top_word_idx1]\n",
" for top_word_idx2 in sort_indices[:num_top_words]:\n",
" word2 = tf_vectorizer.get_feature_names()[top_word_idx2]\n",
" if top_word_idx1 != top_word_idx2:\n",
" coherence += prob_see_word1_given_see_word2(word1, word2, tf_vectorizer, 0.1)\n",
" print('Coherence:', coherence)\n",
" print()\n",
" average_coherence += coherence\n",
"average_coherence /= num_topics\n",
"print('Average coherence:', average_coherence)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Number of unique words\n",
"\n",
"The below code shows how one implements the number of unique words calculation from lecture."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Topic 0]\n",
"Number of unique top words: 16\n",
"\n",
"[Topic 1]\n",
"Number of unique top words: 17\n",
"\n",
"[Topic 2]\n",
"Number of unique top words: 16\n",
"\n",
"[Topic 3]\n",
"Number of unique top words: 9\n",
"\n",
"[Topic 4]\n",
"Number of unique top words: 20\n",
"\n",
"[Topic 5]\n",
"Number of unique top words: 17\n",
"\n",
"[Topic 6]\n",
"Number of unique top words: 12\n",
"\n",
"[Topic 7]\n",
"Number of unique top words: 14\n",
"\n",
"[Topic 8]\n",
"Number of unique top words: 12\n",
"\n",
"[Topic 9]\n",
"Number of unique top words: 20\n",
"\n",
"Average number of unique top words: 15.3\n"
]
}
],
"source": [
"average_number_of_unique_top_words = 0\n",
"for topic_idx1 in range(num_topics):\n",
" print('[Topic ', topic_idx1, ']', sep='')\n",
" sort_indices1 = np.argsort(topic_word_distributions[topic_idx1])[::-1]\n",
" num_unique_top_words = 0\n",
" for top_word_idx1 in sort_indices1[:num_top_words]:\n",
" word1 = tf_vectorizer.get_feature_names()[top_word_idx1]\n",
" break_ = False\n",
" for topic_idx2 in range(num_topics):\n",
" if topic_idx1 != topic_idx2:\n",
" sort_indices2 = np.argsort(topic_word_distributions[topic_idx2])[::-1]\n",
" for top_word_idx2 in sort_indices2[:num_top_words]:\n",
" word2 = tf_vectorizer.get_feature_names()[top_word_idx2]\n",
" if word1 == word2:\n",
" break_ = True\n",
" break\n",
" if break_:\n",
" break\n",
" else:\n",
" num_unique_top_words += 1\n",
" print('Number of unique top words:', num_unique_top_words)\n",
" print()\n",
" \n",
" average_number_of_unique_top_words += num_unique_top_words\n",
"average_number_of_unique_top_words /= num_topics\n",
"print('Average number of unique top words:', average_number_of_unique_top_words)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment