Created
February 12, 2019 13:59
-
-
Save georgehc/470202fc7fd2e9abb874359a911d10f9 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# 94-775/95-865: Topic Modeling Demo\n", | |
"\n", | |
"Author: George H. Chen (georgechen [at symbol] cmu.edu)\n", | |
"\n", | |
"The beginning part of this demo is a shortened and modified version of sklearn's LDA & NMF demo (http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Latent Dirichlet Allocation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.datasets import fetch_20newsgroups\n", | |
"num_articles = 10000\n", | |
"data = fetch_20newsgroups(shuffle=True, random_state=0,\n", | |
" remove=('headers', 'footers', 'quotes')).data[:num_articles]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"\n", | |
"\n", | |
"The last name is Niedermayer, as in New Jersey's Scott's last name, because\n", | |
"(you guessed it) they are brothers. But Rob Niedermayer is a center, not\n", | |
"a defenseman.\n", | |
"\n", | |
"I am not sure that the Sharks will take Kariya. They aren't saying much, but\n", | |
"they apparently like Niedermayer and Victor Kozlov, along with Kariya. Chris\n", | |
"Pronger's name has also been mentioned. My guess is that they'll take\n", | |
"Niedermayer. They may take Pronger, except that they already have too many\n", | |
"defensive prospects.\n" | |
] | |
} | |
], | |
"source": [ | |
"# you can take a look at what individual documents look like by replacing what index we look at\n", | |
"print(data[5])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"vocab_size = 1000\n", | |
"from sklearn.feature_extraction.text import CountVectorizer\n", | |
"\n", | |
"# CountVectorizer does tokenization and can remove terms that occur too frequently, not frequently enough, or that are stop words\n", | |
"\n", | |
"# document frequency (df) means number of documents a word appears in\n", | |
"tf_vectorizer = CountVectorizer(max_df=0.95,\n", | |
" min_df=2,\n", | |
" max_features=vocab_size,\n", | |
" stop_words='english')\n", | |
"tf = tf_vectorizer.fit_transform(data)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"scipy.sparse.csr.csr_matrix" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"type(tf)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"['00', '000', '02', '03', '04', '0d', '0t', '10', '100', '11', '12', '128', '13', '14', '145', '15', '16', '17', '18', '19', '1990', '1991', '1992', '1993', '1d9', '1st', '1t', '20', '200', '21', '22', '23', '24', '25', '250', '26', '27', '28', '29', '2di', '2tm', '30', '300', '31', '32', '33', '34', '34u', '35', '36', '37', '38', '39', '3d', '3t', '40', '41', '42', '43', '44', '45', '46', '48', '50', '500', '55', '60', '64', '6um', '70', '75', '75u', '7ey', '80', '800', '86', '90', '91', '92', '93', '9v', 'a86', 'ability', 'able', 'ac', 'accept', 'access', 'according', 'act', 'action', 'actually', 'add', 'addition', 'address', 'administration', 'advance', 'age', 'ago', 'agree', 'ah', 'air', 'al', 'algorithm', 'allow', 'allowed', 'alt', 'america', 'american', 'analysis', 'anonymous', 'answer', 'answers', 'anti', 'anybody', 'apparently', 'appears', 'apple', 'application', 'applications', 'appreciate', 'appreciated', 'approach', 'appropriate', 'apr', 'april', 'archive', 'area', 'areas', 'aren', 'argument', 'armenia', 'armenian', 'armenians', 'arms', 'army', 'article', 'articles', 'ask', 'asked', 'asking', 'assume', 'atheism', 'attack', 'attempt', 'au', 'author', 'authority', 'available', 'average', 'avoid', 'away', 'ax', 'b8f', 'bad', 'base', 'based', 'basic', 'basically', 'basis', 'belief', 'believe', 'best', 'better', 'bh', 'bhj', 'bible', 'big', 'bike', 'bit', 'bits', 'bj', 'black', 'block', 'blood', 'board', 'body', 'book', 'books', 'bought', 'box', 'break', 'bring', 'brought', 'btw', 'buf', 'build', 'building', 'built', 'bus', 'business', 'buy', 'bxn', 'ca', 'cable', 'california', 'called', 'calls', 'came', 'canada', 'car', 'card', 'cards', 'care', 'carry', 'cars', 'case', 'cases', 'cause', 'cd', 'center', 'certain', 'certainly', 'chance', 'change', 'changed', 'changes', 'check', 'chicago', 'child', 'children', 'chip', 'chips', 'choice', 'christ', 'christian', 'christianity', 'christians', 'church', 'citizens', 'city', 'claim', 'claims', 'class', 'clear', 'clearly', 'clinton', 'clipper', 'close', 'code', 'color', 'com', 'come', 'comes', 'coming', 'command', 'comments', 'commercial', 'committee', 'common', 'community', 'comp', 'company', 'complete', 'completely', 'computer', 'condition', 'conference', 'congress', 'consider', 'considered', 'contact', 'contains', 'context', 'continue', 'control', 'controller', 'copy', 'correct', 'cost', 'couldn', 'country', 'couple', 'course', 'court', 'cover', 'create', 'created', 'crime', 'cross', 'cs', 'current', 'currently', 'cut', 'cx', 'data', 'date', 'dave', 'david', 'day', 'days', 'db', 'dc', 'dead', 'deal', 'death', 'dec', 'decided', 'defense', 'define', 'deleted', 'department', 'des', 'design', 'designed', 'details', 'development', 'device', 'devices', 'did', 'didn', 'difference', 'different', 'difficult', 'digital', 'directly', 'directory', 'discussion', 'disk', 'display', 'distribution', 'division', 'dod', 'does', 'doesn', 'doing', 'don', 'door', 'dos', 'doubt', 'dr', 'drive', 'driver', 'drivers', 'drives', 'drug', 'early', 'earth', 'easily', 'east', 'easy', 'ed', 'edu', 'effect', 'electronic', 'email', 'encryption', 'end', 'enforcement', 'engine', 'entire', 'entry', 'environment', 'error', 'escrow', 'especially', 'event', 'events', 'evidence', 'exactly', 'example', 'excellent', 'exist', 'existence', 'exists', 'expect', 'experience', 'explain', 'export', 'extra', 'face', 'fact', 'faith', 'false', 'family', 'faq', 'far', 'fast', 'faster', 'father', 'fax', 'fbi', 'features', 'federal', 'feel', 'field', 'figure', 'file', 'files', 'final', 'finally', 'fine', 'firearms', 'floppy', 'folks', 'follow', 'following', 'food', 'force', 'form', 'format', 'free', 'freedom', 'friend', 'ftp', 'function', 'functions', 'future', 'g9v', 'game', 'games', 'gas', 'gave', 'general', 'generally', 'gets', 'getting', 'gif', 'given', 'gives', 'giz', 'gk', 'gm', 'goal', 'god', 'goes', 'going', 'good', 'got', 'gov', 'government', 'graphics', 'great', 'greek', 'ground', 'group', 'groups', 'guess', 'gun', 'guns', 'guy', 'half', 'hand', 'happen', 'happened', 'happens', 'hard', 'hardware', 'haven', 'having', 'head', 'health', 'hear', 'heard', 'held', 'hell', 'help', 'hi', 'high', 'higher', 'history', 'hit', 'hockey', 'hold', 'home', 'hope', 'hours', 'house', 'hp', 'human', 'ibm', 'ide', 'idea', 'ideas', 'ii', 'image', 'images', 'imagine', 'important', 'include', 'included', 'includes', 'including', 'individual', 'info', 'information', 'input', 'inside', 'installed', 'instead', 'insurance', 'int', 'interested', 'interesting', 'interface', 'internal', 'international', 'internet', 'involved', 'isn', 'israel', 'israeli', 'issue', 'issues', 'jesus', 'jewish', 'jews', 'jim', 'job', 'jobs', 'john', 'jpeg', 'just', 'key', 'keyboard', 'keys', 'kill', 'killed', 'kind', 'knew', 'know', 'knowledge', 'known', 'knows', 'la', 'land', 'language', 'large', 'late', 'later', 'law', 'laws', 'league', 'learn', 'leave', 'left', 'legal', 'let', 'letter', 'level', 'library', 'life', 'light', 'like', 'likely', 'limited', 'line', 'lines', 'list', 'little', 'live', 'lives', 'living', 'll', 'local', 'long', 'longer', 'look', 'looked', 'looking', 'looks', 'lord', 'lost', 'lot', 'lots', 'love', 'low', 'lower', 'mac', 'machine', 'machines', 'mail', 'main', 'major', 'make', 'makes', 'making', 'man', 'manager', 'manual', 'mark', 'market', 'mass', 'master', 'material', 'matter', 'max', 'maybe', 'mb', 'mean', 'meaning', 'means', 'media', 'medical', 'members', 'memory', 'men', 'mention', 'mentioned', 'message', 'mike', 'miles', 'military', 'million', 'mind', 'mit', 'mode', 'model', 'modem', 'money', 'monitor', 'month', 'months', 'moral', 'mother', 'motif', 'mouse', 'mr', 'ms', 'multiple', 'nasa', 'national', 'nature', 'near', 'necessary', 'need', 'needed', 'needs', 'net', 'network', 'new', 'news', 'newsgroup', 'nhl', 'nice', 'night', 'non', 'normal', 'note', 'nsa', 'number', 'numbers', 'object', 'obvious', 'obviously', 'offer', 'office', 'official', 'oh', 'ok', 'old', 'ones', 'open', 'opinion', 'opinions', 'orbit', 'order', 'org', 'organization', 'original', 'os', 'output', 'outside', 'package', 'page', 'paper', 'particular', 'parts', 'party', 'past', 'paul', 'pay', 'pc', 'peace', 'people', 'perfect', 'performance', 'period', 'person', 'personal', 'phone', 'pick', 'picture', 'pin', 'pittsburgh', 'pl', 'place', 'places', 'plan', 'play', 'played', 'player', 'players', 'plus', 'point', 'points', 'police', 'policy', 'political', 'population', 'port', 'position', 'possible', 'possibly', 'post', 'posted', 'posting', 'power', 'pp', 'present', 'president', 'press', 'pretty', 'previous', 'price', 'printer', 'privacy', 'private', 'pro', 'probably', 'problem', 'problems', 'process', 'product', 'program', 'programs', 'project', 'protect', 'provide', 'provides', 'pub', 'public', 'published', 'purpose', 'qq', 'quality', 'question', 'questions', 'quite', 'radio', 'ram', 'range', 'rate', 'read', 'reading', 'real', 'really', 'reason', 'reasonable', 'reasons', 'received', 'recent', 'recently', 'record', 'red', 'reference', 'regular', 'related', 'release', 'religion', 'religious', 'remember', 'reply', 'report', 'reports', 'request', 'require', 'required', 'requires', 'research', 'resources', 'response', 'rest', 'result', 'results', 'return', 'right', 'rights', 'road', 'rom', 'room', 'round', 'rules', 'run', 'running', 'runs', 'russian', 'safety', 'said', 'sale', 'san', 'save', 'saw', 'say', 'saying', 'says', 'school', 'sci', 'science', 'scientific', 'screen', 'scsi', 'search', 'season', 'second', 'secret', 'section', 'secure', 'security', 'seen', 'self', 'sell', 'send', 'sense', 'sent', 'serial', 'series', 'server', 'service', 'set', 'shall', 'shipping', 'short', 'shot', 'shuttle', 'similar', 'simple', 'simply', 'sin', 'single', 'site', 'sites', 'situation', 'size', 'small', 'society', 'software', 'solution', 'son', 'soon', 'sorry', 'sort', 'sound', 'sounds', 'source', 'sources', 'south', 'soviet', 'space', 'special', 'specific', 'speed', 'spirit', 'st', 'standard', 'start', 'started', 'state', 'statement', 'states', 'station', 'stephanopoulos', 'steve', 'stop', 'story', 'stream', 'street', 'strong', 'study', 'stuff', 'subject', 'suggest', 'sun', 'support', 'supports', 'supposed', 'sure', 'systems', 'taken', 'takes', 'taking', 'talk', 'talking', 'tape', 'tar', 'tax', 'team', 'teams', 'technical', 'technology', 'tell', 'term', 'terms', 'test', 'text', 'thank', 'thanks', 'theory', 'thing', 'things', 'think', 'thinking', 'thought', 'time', 'times', 'title', 'tm', 'today', 'told', 'took', 'tools', 'total', 'trade', 'transfer', 'tried', 'true', 'truth', 'try', 'trying', 'turkey', 'turkish', 'turn', 'tv', 'type', 'uk', 'understand', 'unfortunately', 'unit', 'united', 'university', 'unix', 'unless', 'usa', 'use', 'used', 'useful', 'usenet', 'user', 'users', 'uses', 'using', 'usually', 'value', 'values', 'van', 'various', 've', 'version', 'vga', 'video', 'view', 'voice', 'volume', 'vs', 'wait', 'want', 'wanted', 'wants', 'war', 'washington', 'wasn', 'watch', 'water', 'way', 'ways', 'weapons', 'week', 'weeks', 'went', 'white', 'wide', 'widget', 'willing', 'win', 'window', 'windows', 'wish', 'wm', 'women', 'won', 'word', 'words', 'work', 'worked', 'working', 'works', 'world', 'worth', 'wouldn', 'write', 'writing', 'written', 'wrong', 'wrote', 'x11', 'xt', 'year', 'years', 'yes', 'york', 'young']\n" | |
] | |
} | |
], | |
"source": [ | |
"print(tf_vectorizer.get_feature_names())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"965" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"tf_vectorizer.vocabulary_['week']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 2 3 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0\n", | |
" 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1\n", | |
" 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]\n" | |
] | |
} | |
], | |
"source": [ | |
"print(tf[0].toarray())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,\n", | |
" evaluate_every=-1, learning_decay=0.7,\n", | |
" learning_method='online', learning_offset=10.0,\n", | |
" max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,\n", | |
" n_components=10, n_jobs=1, n_topics=None, perp_tol=0.1,\n", | |
" random_state=0, topic_word_prior=None,\n", | |
" total_samples=1000000.0, verbose=0)" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"num_topics = 10\n", | |
"\n", | |
"from sklearn.decomposition import LatentDirichletAllocation\n", | |
"lda = LatentDirichletAllocation(n_components=num_topics, learning_method='online', random_state=0)\n", | |
"lda.fit(tf)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(10, 1000)" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"lda.components_.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"topic_word_distributions = np.array([topic_word_pseudocounts / np.sum(topic_word_pseudocounts)\n", | |
" for topic_word_pseudocounts in lda.components_])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Displaying the top 20 words per topic and their probabilities within the topic...\n", | |
"\n", | |
"[Topic 0]\n", | |
"year : 0.03297118094739015\n", | |
"team : 0.01920466634388107\n", | |
"good : 0.016830138855415556\n", | |
"gun : 0.01617856783548117\n", | |
"new : 0.015603672962549879\n", | |
"mr : 0.015050811381843318\n", | |
"president : 0.014311296435057167\n", | |
"games : 0.014180220098802127\n", | |
"season : 0.01293113138543077\n", | |
"league : 0.01049137855069683\n", | |
"players : 0.010349770248162016\n", | |
"play : 0.01027439631422798\n", | |
"hockey : 0.009974003061195056\n", | |
"time : 0.009694174123867787\n", | |
"best : 0.009409285751708707\n", | |
"price : 0.009381674293086624\n", | |
"years : 0.009373344778562087\n", | |
"win : 0.009083391252969874\n", | |
"stephanopoulos : 0.008137787607874035\n", | |
"got : 0.007609708926778383\n", | |
"\n", | |
"[Topic 1]\n", | |
"edu : 0.040291106052065644\n", | |
"file : 0.03401605842663057\n", | |
"com : 0.022202304614878616\n", | |
"ftp : 0.01489187024945181\n", | |
"available : 0.01487234441017172\n", | |
"program : 0.013900099687007892\n", | |
"files : 0.013469412901007923\n", | |
"mail : 0.01273118625252846\n", | |
"list : 0.012302522449955501\n", | |
"server : 0.012191608776669673\n", | |
"pub : 0.0117819009260124\n", | |
"send : 0.01161808163325555\n", | |
"information : 0.011328191056635484\n", | |
"email : 0.010819419501123815\n", | |
"faq : 0.010338836510401846\n", | |
"use : 0.010135228627400334\n", | |
"anonymous : 0.00967466854735926\n", | |
"entry : 0.009404082662349876\n", | |
"source : 0.008823763832099556\n", | |
"sun : 0.00874015157494\n", | |
"\n", | |
"[Topic 2]\n", | |
"space : 0.017520829980870224\n", | |
"government : 0.014764857355700099\n", | |
"law : 0.013984172594173342\n", | |
"public : 0.012742615885066335\n", | |
"new : 0.011835560307473852\n", | |
"university : 0.011455663004367131\n", | |
"use : 0.010558752454926011\n", | |
"information : 0.010523671907643252\n", | |
"national : 0.010005648977110936\n", | |
"research : 0.009837721562709869\n", | |
"state : 0.009253714677482188\n", | |
"states : 0.008693409057582452\n", | |
"data : 0.008137174327814675\n", | |
"general : 0.007869014386848868\n", | |
"1993 : 0.007725106577330365\n", | |
"privacy : 0.007621978472840824\n", | |
"nasa : 0.007618877183926785\n", | |
"control : 0.007514320482818136\n", | |
"center : 0.007103239931985107\n", | |
"technology : 0.006967116433727512\n", | |
"\n", | |
"[Topic 3]\n", | |
"people : 0.02589540888252561\n", | |
"don : 0.019051730185797963\n", | |
"just : 0.016890775051117458\n", | |
"think : 0.014747845951330912\n", | |
"know : 0.014199565695037612\n", | |
"like : 0.014113213308107737\n", | |
"said : 0.01087441440577609\n", | |
"time : 0.01068227137221108\n", | |
"right : 0.010536202795306702\n", | |
"did : 0.009462650925254168\n", | |
"say : 0.008821296230055412\n", | |
"going : 0.0079178707316679\n", | |
"want : 0.0075225944490121535\n", | |
"ve : 0.007291645923099413\n", | |
"way : 0.007098601161894257\n", | |
"didn : 0.007067349290356856\n", | |
"make : 0.00628634181601172\n", | |
"really : 0.006220033365741916\n", | |
"years : 0.006030323009314091\n", | |
"ll : 0.005758539222688551\n", | |
"\n", | |
"[Topic 4]\n", | |
"10 : 0.047226062225035684\n", | |
"00 : 0.033267653799259465\n", | |
"11 : 0.031173500801189236\n", | |
"12 : 0.03012370037786803\n", | |
"15 : 0.02990359767341557\n", | |
"25 : 0.029398719591355014\n", | |
"20 : 0.0287112182643199\n", | |
"14 : 0.026568760431583568\n", | |
"16 : 0.024643647803909763\n", | |
"17 : 0.02321676042163341\n", | |
"13 : 0.022892614277262482\n", | |
"18 : 0.020309457032141473\n", | |
"24 : 0.018746898985639325\n", | |
"40 : 0.017416012742783437\n", | |
"30 : 0.017264702358222788\n", | |
"55 : 0.017249979067923197\n", | |
"19 : 0.016906732935191217\n", | |
"21 : 0.016492869919936096\n", | |
"23 : 0.01548132532684356\n", | |
"22 : 0.015325721360768488\n", | |
"\n", | |
"[Topic 5]\n", | |
"windows : 0.027479651671251597\n", | |
"db : 0.02008130260617209\n", | |
"software : 0.019461723980915106\n", | |
"dos : 0.016645871396920555\n", | |
"card : 0.016629462252449215\n", | |
"image : 0.014905777621753546\n", | |
"disk : 0.014882547183197141\n", | |
"graphics : 0.014394671626531437\n", | |
"data : 0.014324999278027082\n", | |
"pc : 0.01285284399133799\n", | |
"color : 0.012760377141737953\n", | |
"mac : 0.012521671837075525\n", | |
"memory : 0.012288066764058091\n", | |
"window : 0.01172543924698812\n", | |
"version : 0.011359671702652232\n", | |
"use : 0.011144096720246894\n", | |
"display : 0.01070306970233049\n", | |
"using : 0.010212944450226524\n", | |
"bit : 0.010158024468026963\n", | |
"screen : 0.009825308617938854\n", | |
"\n", | |
"[Topic 6]\n", | |
"key : 0.036657227264248624\n", | |
"thanks : 0.03330919224244328\n", | |
"know : 0.02754614302969939\n", | |
"does : 0.023024362368053112\n", | |
"chip : 0.020030336089087997\n", | |
"use : 0.017336023309540434\n", | |
"encryption : 0.01713683715884816\n", | |
"help : 0.016496899538802713\n", | |
"like : 0.016002653922314678\n", | |
"mail : 0.015804000270816125\n", | |
"need : 0.01534427106958563\n", | |
"keys : 0.013747762178046729\n", | |
"looking : 0.013312665254122464\n", | |
"clipper : 0.012823434108060519\n", | |
"used : 0.012362006854300497\n", | |
"sound : 0.012103168531042614\n", | |
"hi : 0.011949674369115187\n", | |
"advance : 0.010790313074703295\n", | |
"information : 0.010637453739816545\n", | |
"bit : 0.010237134231452456\n", | |
"\n", | |
"[Topic 7]\n", | |
"god : 0.03677897297320968\n", | |
"jesus : 0.01755382907913231\n", | |
"does : 0.015816424138174412\n", | |
"believe : 0.013958023130356094\n", | |
"game : 0.012171745895151097\n", | |
"people : 0.01104281007944074\n", | |
"say : 0.011006544304419667\n", | |
"christian : 0.010853081080413829\n", | |
"true : 0.010630093025426314\n", | |
"bible : 0.01033931356085625\n", | |
"think : 0.009774171445882959\n", | |
"church : 0.00966870470103312\n", | |
"life : 0.00883349489937276\n", | |
"way : 0.007852699506432994\n", | |
"religion : 0.00759097523503712\n", | |
"christians : 0.0075449880919839655\n", | |
"christ : 0.0075396160943345175\n", | |
"faith : 0.007439660543789043\n", | |
"point : 0.007427660316865335\n", | |
"good : 0.007186456856356701\n", | |
"\n", | |
"[Topic 8]\n", | |
"drive : 0.022295739864303166\n", | |
"power : 0.01950734168199434\n", | |
"like : 0.018522559790376272\n", | |
"just : 0.016924517655010334\n", | |
"car : 0.016675980375469957\n", | |
"use : 0.01546275692548456\n", | |
"scsi : 0.013950686273385776\n", | |
"ve : 0.01392586768350613\n", | |
"good : 0.011218067098463793\n", | |
"speed : 0.011115387347522983\n", | |
"hard : 0.01099407199094689\n", | |
"used : 0.010715407919410537\n", | |
"don : 0.010212854864543285\n", | |
"problem : 0.01010922379681309\n", | |
"work : 0.009567260783500347\n", | |
"drives : 0.00822283821010156\n", | |
"buy : 0.00800970254393546\n", | |
"better : 0.007788251352635117\n", | |
"high : 0.0077583107983032525\n", | |
"does : 0.00731337975997651\n", | |
"\n", | |
"[Topic 9]\n", | |
"ax : 0.7750508616859831\n", | |
"max : 0.05676346028145709\n", | |
"g9v : 0.017637993832254468\n", | |
"b8f : 0.015315278818004914\n", | |
"a86 : 0.012646093392057575\n", | |
"145 : 0.010168944298400002\n", | |
"pl : 0.01012365428397651\n", | |
"1d9 : 0.008174180919536261\n", | |
"1t : 0.0065319516608289266\n", | |
"0t : 0.006459809882532864\n", | |
"bhj : 0.006110218348574932\n", | |
"giz : 0.005453049410747113\n", | |
"3t : 0.005447285579836966\n", | |
"34u : 0.005285937363655366\n", | |
"2di : 0.005090173874145282\n", | |
"75u : 0.00463261937952306\n", | |
"wm : 0.004518731940833898\n", | |
"2tm : 0.004222775559250272\n", | |
"7ey : 0.0036648191849338423\n", | |
"bxn : 0.0032500074436691783\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"num_top_words = 20\n", | |
"\n", | |
"print('Displaying the top %d words per topic and their probabilities within the topic...' % num_top_words)\n", | |
"print()\n", | |
"\n", | |
"for topic_idx in range(num_topics):\n", | |
" print('[Topic ', topic_idx, ']', sep='')\n", | |
" sort_indices = np.argsort(topic_word_distributions[topic_idx])[::-1]\n", | |
" for rank in range(num_top_words):\n", | |
" word_idx = sort_indices[rank]\n", | |
" print(tf_vectorizer.get_feature_names()[word_idx], ':', topic_word_distributions[topic_idx, word_idx])\n", | |
" print()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Computing co-occurrences of words\n", | |
"\n", | |
"Here, we count the number of newsgroup posts in which two words both occur. This part of the demo should feel like a review of co-occurrence analysis from earlier in the course, except now we use scikit-learn's built-in CountVectorizer. Conceptually everything else in the same as before." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"word1 = 'year'\n", | |
"word2 = 'team'\n", | |
"\n", | |
"word1_column_idx = tf_vectorizer.vocabulary_[word1]\n", | |
"word2_column_idx = tf_vectorizer.vocabulary_[word2]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"documents_with_word1 = (tf[:, word1_column_idx].toarray().flatten() > 0)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"documents_with_word2 = (tf[:, word2_column_idx].toarray().flatten() > 0)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"documents_with_both_word1_and_word2 = documents_with_word1 * documents_with_word2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Next, we compute the log of the conditional probability of word 1 appearing given that word 2 appeared, where we add in a little bit of a fudge factor in the numerator (in this case, it's actually not needed but some times you do have two words that do not co-occur for which you run into a numerical issue due to taking the log of 0)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-1.5482462194376105" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"eps = 0.1\n", | |
"np.log2((documents_with_both_word1_and_word2.sum() + eps) / documents_with_word2.sum())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def prob_see_word1_given_see_word2(word1, word2, vectorizer, eps=0.1):\n", | |
" word1_column_idx = vectorizer.vocabulary_[word1]\n", | |
" word2_column_idx = vectorizer.vocabulary_[word2]\n", | |
" documents_with_word1 = (tf[:, word1_column_idx].toarray().flatten() > 0)\n", | |
" documents_with_word2 = (tf[:, word2_column_idx].toarray().flatten() > 0)\n", | |
" documents_with_both_word1_and_word2 = documents_with_word1 * documents_with_word2\n", | |
" return np.log2((documents_with_both_word1_and_word2.sum() + eps) / documents_with_word2.sum())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Topic coherence\n", | |
"\n", | |
"The below code shows how one implements the topic coherence calculation from lecture." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[Topic 0]\n", | |
"Coherence: -1356.3836721926853\n", | |
"\n", | |
"[Topic 1]\n", | |
"Coherence: -969.252344768849\n", | |
"\n", | |
"[Topic 2]\n", | |
"Coherence: -1038.5936491181455\n", | |
"\n", | |
"[Topic 3]\n", | |
"Coherence: -752.9744085675202\n", | |
"\n", | |
"[Topic 4]\n", | |
"Coherence: -641.5683154733748\n", | |
"\n", | |
"[Topic 5]\n", | |
"Coherence: -1155.6763255419658\n", | |
"\n", | |
"[Topic 6]\n", | |
"Coherence: -1177.3645847380105\n", | |
"\n", | |
"[Topic 7]\n", | |
"Coherence: -948.0033411181123\n", | |
"\n", | |
"[Topic 8]\n", | |
"Coherence: -1054.2809655411477\n", | |
"\n", | |
"[Topic 9]\n", | |
"Coherence: -217.079440424438\n", | |
"\n", | |
"Average coherence: -931.1177047484249\n" | |
] | |
} | |
], | |
"source": [ | |
"average_coherence = 0\n", | |
"for topic_idx in range(num_topics):\n", | |
" print('[Topic ', topic_idx, ']', sep='')\n", | |
" sort_indices = np.argsort(topic_word_distributions[topic_idx])[::-1]\n", | |
" coherence = 0.\n", | |
" for top_word_idx1 in sort_indices[:num_top_words]:\n", | |
" word1 = tf_vectorizer.get_feature_names()[top_word_idx1]\n", | |
" for top_word_idx2 in sort_indices[:num_top_words]:\n", | |
" word2 = tf_vectorizer.get_feature_names()[top_word_idx2]\n", | |
" if top_word_idx1 != top_word_idx2:\n", | |
" coherence += prob_see_word1_given_see_word2(word1, word2, tf_vectorizer, 0.1)\n", | |
" print('Coherence:', coherence)\n", | |
" print()\n", | |
" average_coherence += coherence\n", | |
"average_coherence /= num_topics\n", | |
"print('Average coherence:', average_coherence)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Number of unique words\n", | |
"\n", | |
"The below code shows how one implements the number of unique words calculation from lecture." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[Topic 0]\n", | |
"Number of unique top words: 16\n", | |
"\n", | |
"[Topic 1]\n", | |
"Number of unique top words: 17\n", | |
"\n", | |
"[Topic 2]\n", | |
"Number of unique top words: 16\n", | |
"\n", | |
"[Topic 3]\n", | |
"Number of unique top words: 9\n", | |
"\n", | |
"[Topic 4]\n", | |
"Number of unique top words: 20\n", | |
"\n", | |
"[Topic 5]\n", | |
"Number of unique top words: 17\n", | |
"\n", | |
"[Topic 6]\n", | |
"Number of unique top words: 12\n", | |
"\n", | |
"[Topic 7]\n", | |
"Number of unique top words: 14\n", | |
"\n", | |
"[Topic 8]\n", | |
"Number of unique top words: 12\n", | |
"\n", | |
"[Topic 9]\n", | |
"Number of unique top words: 20\n", | |
"\n", | |
"Average number of unique top words: 15.3\n" | |
] | |
} | |
], | |
"source": [ | |
"average_number_of_unique_top_words = 0\n", | |
"for topic_idx1 in range(num_topics):\n", | |
" print('[Topic ', topic_idx1, ']', sep='')\n", | |
" sort_indices1 = np.argsort(topic_word_distributions[topic_idx1])[::-1]\n", | |
" num_unique_top_words = 0\n", | |
" for top_word_idx1 in sort_indices1[:num_top_words]:\n", | |
" word1 = tf_vectorizer.get_feature_names()[top_word_idx1]\n", | |
" break_ = False\n", | |
" for topic_idx2 in range(num_topics):\n", | |
" if topic_idx1 != topic_idx2:\n", | |
" sort_indices2 = np.argsort(topic_word_distributions[topic_idx2])[::-1]\n", | |
" for top_word_idx2 in sort_indices2[:num_top_words]:\n", | |
" word2 = tf_vectorizer.get_feature_names()[top_word_idx2]\n", | |
" if word1 == word2:\n", | |
" break_ = True\n", | |
" break\n", | |
" if break_:\n", | |
" break\n", | |
" else:\n", | |
" num_unique_top_words += 1\n", | |
" print('Number of unique top words:', num_unique_top_words)\n", | |
" print()\n", | |
" \n", | |
" average_number_of_unique_top_words += num_unique_top_words\n", | |
"average_number_of_unique_top_words /= num_topics\n", | |
"print('Average number of unique top words:', average_number_of_unique_top_words)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.4" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment