Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 94-775/95-865: Topic Modeling with Latent Dirichlet Allocation\n",
"\n",
"Author: George H. Chen (georgechen [at symbol] cmu.edu)\n",
"\n",
"The beginning part of this demo is a shortened and modified version of sklearn's LDA & NMF demo (http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll use NumPy."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"np.set_printoptions(precision=5, suppress=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Latent Dirichlet Allocation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We first load in 10,000 posts from the 20 Newsgroups dataset."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_20newsgroups\n",
"num_articles = 10000\n",
"data = fetch_20newsgroups(shuffle=True, random_state=0,\n",
" remove=('headers', 'footers', 'quotes')).data[:num_articles]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can verify that there are 10,000 posts, and we can look at an example post."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10000"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(data)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Koberg,\n",
"\n",
"\tJust a couple of minor corrections here...\n",
"\n",
"\t1) The Churches of Christ do not usually believe in speaking in\n",
"tongues, in fact many of them are known for being strongly opposed to\n",
"Pentecostal teaching. You are probably thinking of Church of God in\n",
"Christ, the largest African-American Pentecostal denomination.\n",
"\n",
"\t2) I'm not sure what you mean by \"signifying believers\" but it\n",
"should be pointed out that the Assemblies of God does not now, nor has it\n",
"ever, held that speaking in tongues is the sign that one is a Christian. \n",
"The doctrine that traditional Pentecostals (including the A/G) maintain is\n",
"that speaking in tongues is the sign of a second experience after becoming\n",
"a Christian in which one is \"Baptized in the Holy Spirit\" That may be\n",
"what you were referring to, but I point this out because Pentecostals are\n",
"frequently labeled as believing that you have to speak in tongues in order\n",
"to be a Christian. Such a position is only held by some groups and not the\n",
"majority of Pentecostals. Many Pentecostals will quote the passage in\n",
"Mark 16 about \"these signs following them that believe\" but they generally\n",
"do not interpret this as meaning if you don't pactice the signs you aren't\n",
"\"saved\".\n",
"\n",
"\t3) I know it's hard to summarize the beliefs of a movement that\n",
"has such diversity, but I think you've made some pretty big\n",
"generalizations here. Do \"Neo-Pentecostals\" only believe in tongues as a\n",
"sign and tongues as prayer but NOT tongues as revelatory with a message? \n",
"I've never heard of that before. In fact I would have characterized them\n",
"as believing the same as Pentecostals except less likely to see tongues as\n",
"a sign of Spirit Baptism. Also, while neo-Pentecostals may not be\n",
"inclined to speak in tongues in the non-Pentecostal churches they attend,\n",
"they do have their own meetings and, in many cases, a whole church will be\n",
"charismatic.\n"
]
}
],
"source": [
"# you can take a look at what individual documents look like by replacing what index we look at\n",
"print(data[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now fit a `CountVectorizer` model that will compute, for each post, what its raw word count histograms are (the \"term frequencies\" we saw in week 1).\n",
"\n",
"The output of the following cell is the term-frequencies matrix, where rows index different posts/text documents, and columns index 1000 different vocabulary words. A note about the arguments to `CountVectorizer`:\n",
"\n",
"- `max_df`: we only keep words that appear in at most this fraction of the documents\n",
"- `min_df`: we only keep words that appear in at least this many documents\n",
"- `stop_words`: whether to remove stop words\n",
"- `max_features`: among words that don't get removed due to the above 3 arguments, we keep the top `max_features` number of most frequently occuring words"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"vocab_size = 1000\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"# CountVectorizer does tokenization and can remove terms that occur too frequently, not frequently enough, or that are stop words\n",
"\n",
"# document frequency (df) means number of documents a word appears in\n",
"tf_vectorizer = CountVectorizer(max_df=0.95,\n",
" min_df=2,\n",
" stop_words='english',\n",
" max_features=vocab_size)\n",
"tf = tf_vectorizer.fit_transform(data)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can verify that there are 10,000 rows (corresponding to posts), and 1000 columns (corresponding to words)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10000, 1000)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tf.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A note about the `tf` matrix: this actually is stored as what's called a sparse matrix (rather than a 2D NumPy array that you're more familiar with). The reason is that often these matrices are really large and the vast majority of entries are 0, so it's possible to save space by not storing where the 0's are."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"scipy.sparse.csr.csr_matrix"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(tf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" To convert `tf` to a 2D NumPy table, you can run `tf.toarray()` (this does not modify the original `tf` variable)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"numpy.ndarray"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(tf.toarray())"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10000, 1000)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tf.toarray().shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can figure out what words the different columns correspond to by using the `get_feature_names()` function; the output is in the same order as the column indices. In particular, we can index into the following list (i.e., so given a column index, we can figure out which word it corresponds to)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['00', '000', '02', '03', '04', '0d', '0t', '10', '100', '11', '12', '128', '13', '14', '145', '15', '16', '17', '18', '19', '1990', '1991', '1992', '1993', '1d9', '1st', '1t', '20', '200', '21', '22', '23', '24', '25', '250', '26', '27', '28', '29', '2di', '2tm', '30', '300', '31', '32', '33', '34', '34u', '35', '36', '37', '38', '39', '3d', '3t', '40', '41', '42', '43', '44', '45', '46', '48', '50', '500', '55', '60', '64', '6um', '70', '75', '75u', '7ey', '80', '800', '86', '90', '91', '92', '93', '9v', 'a86', 'ability', 'able', 'ac', 'accept', 'access', 'according', 'act', 'action', 'actually', 'add', 'addition', 'address', 'administration', 'advance', 'age', 'ago', 'agree', 'ah', 'air', 'al', 'algorithm', 'allow', 'allowed', 'alt', 'america', 'american', 'analysis', 'anonymous', 'answer', 'answers', 'anti', 'anybody', 'apparently', 'appears', 'apple', 'application', 'applications', 'appreciate', 'appreciated', 'approach', 'appropriate', 'apr', 'april', 'archive', 'area', 'areas', 'aren', 'argument', 'armenia', 'armenian', 'armenians', 'arms', 'army', 'article', 'articles', 'ask', 'asked', 'asking', 'assume', 'atheism', 'attack', 'attempt', 'au', 'author', 'authority', 'available', 'average', 'avoid', 'away', 'ax', 'b8f', 'bad', 'base', 'based', 'basic', 'basically', 'basis', 'belief', 'believe', 'best', 'better', 'bh', 'bhj', 'bible', 'big', 'bike', 'bit', 'bits', 'bj', 'black', 'block', 'blood', 'board', 'body', 'book', 'books', 'bought', 'box', 'break', 'bring', 'brought', 'btw', 'buf', 'build', 'building', 'built', 'bus', 'business', 'buy', 'bxn', 'ca', 'cable', 'california', 'called', 'calls', 'came', 'canada', 'car', 'card', 'cards', 'care', 'carry', 'cars', 'case', 'cases', 'cause', 'cd', 'center', 'certain', 'certainly', 'chance', 'change', 'changed', 'changes', 'check', 'chicago', 'child', 'children', 'chip', 'chips', 'choice', 'christ', 'christian', 'christianity', 'christians', 'church', 'citizens', 'city', 'claim', 'claims', 'class', 'clear', 'clearly', 'clinton', 'clipper', 'close', 'code', 'color', 'com', 'come', 'comes', 'coming', 'command', 'comments', 'commercial', 'committee', 'common', 'community', 'comp', 'company', 'complete', 'completely', 'computer', 'condition', 'conference', 'congress', 'consider', 'considered', 'contact', 'contains', 'context', 'continue', 'control', 'controller', 'copy', 'correct', 'cost', 'couldn', 'country', 'couple', 'course', 'court', 'cover', 'create', 'created', 'crime', 'cross', 'cs', 'current', 'currently', 'cut', 'cx', 'data', 'date', 'dave', 'david', 'day', 'days', 'db', 'dc', 'dead', 'deal', 'death', 'dec', 'decided', 'defense', 'define', 'deleted', 'department', 'des', 'design', 'designed', 'details', 'development', 'device', 'devices', 'did', 'didn', 'difference', 'different', 'difficult', 'digital', 'directly', 'directory', 'discussion', 'disk', 'display', 'distribution', 'division', 'dod', 'does', 'doesn', 'doing', 'don', 'door', 'dos', 'doubt', 'dr', 'drive', 'driver', 'drivers', 'drives', 'drug', 'early', 'earth', 'easily', 'east', 'easy', 'ed', 'edu', 'effect', 'electronic', 'email', 'encryption', 'end', 'enforcement', 'engine', 'entire', 'entry', 'environment', 'error', 'escrow', 'especially', 'event', 'events', 'evidence', 'exactly', 'example', 'excellent', 'exist', 'existence', 'exists', 'expect', 'experience', 'explain', 'export', 'extra', 'face', 'fact', 'faith', 'false', 'family', 'faq', 'far', 'fast', 'faster', 'father', 'fax', 'fbi', 'features', 'federal', 'feel', 'field', 'figure', 'file', 'files', 'final', 'finally', 'fine', 'firearms', 'floppy', 'folks', 'follow', 'following', 'food', 'force', 'form', 'format', 'free', 'freedom', 'friend', 'ftp', 'function', 'functions', 'future', 'g9v', 'game', 'games', 'gas', 'gave', 'general', 'generally', 'gets', 'getting', 'gif', 'given', 'gives', 'giz', 'gk', 'gm', 'goal', 'god', 'goes', 'going', 'good', 'got', 'gov', 'government', 'graphics', 'great', 'greek', 'ground', 'group', 'groups', 'guess', 'gun', 'guns', 'guy', 'half', 'hand', 'happen', 'happened', 'happens', 'hard', 'hardware', 'haven', 'having', 'head', 'health', 'hear', 'heard', 'held', 'hell', 'help', 'hi', 'high', 'higher', 'history', 'hit', 'hockey', 'hold', 'home', 'hope', 'hours', 'house', 'hp', 'human', 'ibm', 'ide', 'idea', 'ideas', 'ii', 'image', 'images', 'imagine', 'important', 'include', 'included', 'includes', 'including', 'individual', 'info', 'information', 'input', 'inside', 'installed', 'instead', 'insurance', 'int', 'interested', 'interesting', 'interface', 'internal', 'international', 'internet', 'involved', 'isn', 'israel', 'israeli', 'issue', 'issues', 'jesus', 'jewish', 'jews', 'jim', 'job', 'jobs', 'john', 'jpeg', 'just', 'key', 'keyboard', 'keys', 'kill', 'killed', 'kind', 'knew', 'know', 'knowledge', 'known', 'knows', 'la', 'land', 'language', 'large', 'late', 'later', 'law', 'laws', 'league', 'learn', 'leave', 'left', 'legal', 'let', 'letter', 'level', 'library', 'life', 'light', 'like', 'likely', 'limited', 'line', 'lines', 'list', 'little', 'live', 'lives', 'living', 'll', 'local', 'long', 'longer', 'look', 'looked', 'looking', 'looks', 'lord', 'lost', 'lot', 'lots', 'love', 'low', 'lower', 'mac', 'machine', 'machines', 'mail', 'main', 'major', 'make', 'makes', 'making', 'man', 'manager', 'manual', 'mark', 'market', 'mass', 'master', 'material', 'matter', 'max', 'maybe', 'mb', 'mean', 'meaning', 'means', 'media', 'medical', 'members', 'memory', 'men', 'mention', 'mentioned', 'message', 'mike', 'miles', 'military', 'million', 'mind', 'mit', 'mode', 'model', 'modem', 'money', 'monitor', 'month', 'months', 'moral', 'mother', 'motif', 'mouse', 'mr', 'ms', 'multiple', 'nasa', 'national', 'nature', 'near', 'necessary', 'need', 'needed', 'needs', 'net', 'network', 'new', 'news', 'newsgroup', 'nhl', 'nice', 'night', 'non', 'normal', 'note', 'nsa', 'number', 'numbers', 'object', 'obvious', 'obviously', 'offer', 'office', 'official', 'oh', 'ok', 'old', 'ones', 'open', 'opinion', 'opinions', 'orbit', 'order', 'org', 'organization', 'original', 'os', 'output', 'outside', 'package', 'page', 'paper', 'particular', 'parts', 'party', 'past', 'paul', 'pay', 'pc', 'peace', 'people', 'perfect', 'performance', 'period', 'person', 'personal', 'phone', 'pick', 'picture', 'pin', 'pittsburgh', 'pl', 'place', 'places', 'plan', 'play', 'played', 'player', 'players', 'plus', 'point', 'points', 'police', 'policy', 'political', 'population', 'port', 'position', 'possible', 'possibly', 'post', 'posted', 'posting', 'power', 'pp', 'present', 'president', 'press', 'pretty', 'previous', 'price', 'printer', 'privacy', 'private', 'pro', 'probably', 'problem', 'problems', 'process', 'product', 'program', 'programs', 'project', 'protect', 'provide', 'provides', 'pub', 'public', 'published', 'purpose', 'qq', 'quality', 'question', 'questions', 'quite', 'radio', 'ram', 'range', 'rate', 'read', 'reading', 'real', 'really', 'reason', 'reasonable', 'reasons', 'received', 'recent', 'recently', 'record', 'red', 'reference', 'regular', 'related', 'release', 'religion', 'religious', 'remember', 'reply', 'report', 'reports', 'request', 'require', 'required', 'requires', 'research', 'resources', 'response', 'rest', 'result', 'results', 'return', 'right', 'rights', 'road', 'rom', 'room', 'round', 'rules', 'run', 'running', 'runs', 'russian', 'safety', 'said', 'sale', 'san', 'save', 'saw', 'say', 'saying', 'says', 'school', 'sci', 'science', 'scientific', 'screen', 'scsi', 'search', 'season', 'second', 'secret', 'section', 'secure', 'security', 'seen', 'self', 'sell', 'send', 'sense', 'sent', 'serial', 'series', 'server', 'service', 'set', 'shall', 'shipping', 'short', 'shot', 'shuttle', 'similar', 'simple', 'simply', 'sin', 'single', 'site', 'sites', 'situation', 'size', 'small', 'society', 'software', 'solution', 'son', 'soon', 'sorry', 'sort', 'sound', 'sounds', 'source', 'sources', 'south', 'soviet', 'space', 'special', 'specific', 'speed', 'spirit', 'st', 'standard', 'start', 'started', 'state', 'statement', 'states', 'station', 'stephanopoulos', 'steve', 'stop', 'story', 'stream', 'street', 'strong', 'study', 'stuff', 'subject', 'suggest', 'sun', 'support', 'supports', 'supposed', 'sure', 'systems', 'taken', 'takes', 'taking', 'talk', 'talking', 'tape', 'tar', 'tax', 'team', 'teams', 'technical', 'technology', 'tell', 'term', 'terms', 'test', 'text', 'thank', 'thanks', 'theory', 'thing', 'things', 'think', 'thinking', 'thought', 'time', 'times', 'title', 'tm', 'today', 'told', 'took', 'tools', 'total', 'trade', 'transfer', 'tried', 'true', 'truth', 'try', 'trying', 'turkey', 'turkish', 'turn', 'tv', 'type', 'uk', 'understand', 'unfortunately', 'unit', 'united', 'university', 'unix', 'unless', 'usa', 'use', 'used', 'useful', 'usenet', 'user', 'users', 'uses', 'using', 'usually', 'value', 'values', 'van', 'various', 've', 'version', 'vga', 'video', 'view', 'voice', 'volume', 'vs', 'wait', 'want', 'wanted', 'wants', 'war', 'washington', 'wasn', 'watch', 'water', 'way', 'ways', 'weapons', 'week', 'weeks', 'went', 'white', 'wide', 'widget', 'willing', 'win', 'window', 'windows', 'wish', 'wm', 'women', 'won', 'word', 'words', 'work', 'worked', 'working', 'works', 'world', 'worth', 'wouldn', 'write', 'writing', 'written', 'wrong', 'wrote', 'x11', 'xt', 'year', 'years', 'yes', 'york', 'young']\n"
]
}
],
"source": [
"print(tf_vectorizer.get_feature_names())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also go in reverse: given a word, we can figure out which column index it corresponds to. To do this, we use the `vocabulary_` attribute."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"116"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tf_vectorizer.vocabulary_['apple']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can figure out what the raw counts are for the 0-th post as follows."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 2, 3, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tf[0].toarray()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now fit an LDA model to the data."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,\n",
" evaluate_every=-1, learning_decay=0.7,\n",
" learning_method='batch', learning_offset=10.0,\n",
" max_doc_update_iter=100, max_iter=10,\n",
" mean_change_tol=0.001, n_components=10, n_jobs=None,\n",
" perp_tol=0.1, random_state=0, topic_word_prior=None,\n",
" total_samples=1000000.0, verbose=0)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num_topics = 10\n",
"\n",
"from sklearn.decomposition import LatentDirichletAllocation\n",
"lda = LatentDirichletAllocation(n_components=num_topics, random_state=0)\n",
"lda.fit(tf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The fitting procedure determines the every topic's distribution over words; this information is stored in the `components_` attribute. There's a catch: we actually have to normalize to get the probability distributions (without this normalization, instead what the model has are pseudocounts for how often different words appear per topic)."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10, 1000)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lda.components_.shape"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([74979.92756, 37471.24963, 33866.78341, 40079.8962 , 41754.73437,\n",
" 44022.38561, 36169.17215, 70224.62485, 41944.93058, 73054.29565])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lda.components_.sum(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"topic_word_distributions = np.array([row / row.sum() for row in lda.components_])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can verify that each topic's word distribution sums to 1."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topic_word_distributions.sum(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also print out what the probabilities for the different words are for a specific topic. This isn't very easy to interpret."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.00011 0.00191 0. 0. 0. 0. 0. 0.00247 0.00257\n",
" 0.00003 0.00033 0. 0.00003 0.00015 0.00002 0.00059 0. 0.\n",
" 0. 0. 0.00049 0.00014 0.00019 0.00004 0. 0.00026 0.\n",
" 0.00164 0.00075 0. 0.00004 0. 0.00001 0.00005 0.00018 0.\n",
" 0. 0. 0. 0. 0. 0.00162 0.00095 0. 0.\n",
" 0. 0. 0. 0.0001 0. 0. 0. 0. 0.\n",
" 0. 0.00053 0. 0. 0. 0. 0.00011 0. 0.00003\n",
" 0.00235 0.00082 0.00001 0.00081 0. 0. 0.00048 0.00022 0.\n",
" 0. 0.0006 0.00024 0. 0.00108 0.00015 0.00025 0.00025 0.\n",
" 0. 0.00082 0.00259 0.00014 0.00021 0. 0.00058 0.00006 0.00022\n",
" 0.0042 0.00115 0.00021 0. 0.00069 0.00016 0.00076 0.00388 0.00136\n",
" 0.00023 0.00212 0.00103 0. 0.00029 0.00068 0. 0.0009 0.00128\n",
" 0.00006 0. 0.00054 0.00008 0.00012 0.00072 0.00069 0.00023 0.\n",
" 0.00003 0.0001 0.00019 0.00008 0.00027 0.0001 0. 0.00031 0.\n",
" 0.00167 0.00045 0.00156 0.00023 0. 0. 0. 0. 0.\n",
" 0.00004 0. 0.00067 0.00052 0.00081 0.0007 0. 0.00001 0.00014\n",
" 0.00001 0. 0.00001 0.00066 0.00247 0.00014 0.00206 0. 0.\n",
" 0.0037 0.00214 0.00123 0.00032 0.001 0.00025 0. 0.00294 0.00556\n",
" 0.00731 0. 0. 0. 0.00451 0. 0.0018 0. 0.\n",
" 0.0011 0.00011 0.00003 0.00008 0.00062 0.0004 0.00003 0.00162 0.00109\n",
" 0.00059 0.00079 0.00024 0.00134 0. 0.00133 0.00058 0.0011 0.\n",
" 0.00172 0.00267 0. 0.0002 0.00027 0.00026 0.00131 0.00027 0.00163\n",
" 0.00093 0.00686 0. 0.00004 0.00156 0.00047 0.00126 0.00251 0.00021\n",
" 0.00038 0.00001 0.00058 0.00074 0.00172 0.00155 0.00122 0.00053 0.00064\n",
" 0.00098 0. 0. 0. 0.00011 0.00042 0.00062 0. 0.\n",
" 0. 0. 0. 0. 0.00063 0.00027 0.00015 0.00099 0.00074\n",
" 0.00059 0.00106 0. 0.00196 0. 0. 0.00001 0.00335 0.00191\n",
" 0.00132 0. 0.0005 0.00041 0. 0.00068 0.00066 0. 0.00186\n",
" 0.00028 0.00081 0.00005 0.00194 0.00021 0.00035 0.0016 0.0003 0.00036\n",
" 0. 0.00012 0.00066 0.00082 0. 0.00004 0.00021 0.00333 0.00096\n",
" 0.00088 0.00241 0.00313 0. 0.00078 0.00057 0.00046 0. 0.00023\n",
" 0. 0.00191 0.00061 0.00075 0. 0.00005 0.00011 0.00085 0.\n",
" 0.003 0.0021 0. 0.00095 0.00032 0.00236 0. 0. 0.00077\n",
" 0.00141 0. 0.00042 0.0002 0. 0.00119 0.00121 0.00025 0.00075\n",
" 0. 0. 0.00572 0.00382 0.00196 0.00213 0.00064 0.00005 0.0006\n",
" 0. 0.00024 0. 0.00001 0. 0.00079 0. 0.00391 0.00365\n",
" 0.00257 0.01337 0.00058 0. 0.00102 0.00006 0.00062 0. 0.\n",
" 0. 0. 0.00195 0.0009 0.00107 0.00052 0.00089 0.00002 0.00043\n",
" 0.00091 0.00019 0. 0. 0.00336 0. 0.00161 0.0006 0.\n",
" 0.00035 0.00002 0. 0.00129 0.00026 0. 0. 0.00122 0.00108\n",
" 0.00227 0.0001 0. 0. 0.00122 0.00159 0.00016 0. 0.00208\n",
" 0.00135 0.00161 0. 0. 0.00023 0. 0.0034 0.00128 0.00023\n",
" 0. 0. 0.00001 0.00034 0. 0.00133 0.00086 0.00101 0.\n",
" 0. 0.00084 0.00115 0.00088 0. 0. 0.0008 0.00031 0.00022\n",
" 0.0014 0.00067 0.00037 0. 0.00123 0.00003 0.0004 0. 0.00001\n",
" 0. 0.00146 0. 0.00842 0.00374 0.00132 0.00076 0.00049 0.00082\n",
" 0.00236 0.0032 0. 0.00171 0.00054 0. 0. 0.00082 0.00192\n",
" 0. 0.00193 0.00736 0.01592 0.0071 0.00001 0.00006 0. 0.00513\n",
" 0. 0.00264 0.00024 0.00004 0.00239 0. 0. 0.00192 0.00167\n",
" 0.00124 0.00083 0.00065 0.00052 0.00247 0.00015 0.00137 0.00157 0.00154\n",
" 0.00003 0.00097 0.00213 0.00003 0.00092 0.00124 0. 0.0046 0.00133\n",
" 0.00025 0.00296 0.00249 0.00074 0.00217 0.00154 0.00076 0.00056 0.00014\n",
" 0. 0. 0. 0.0024 0.00049 0.00018 0. 0. 0.00075\n",
" 0.00247 0.00029 0.00053 0.00011 0.00048 0.00022 0. 0. 0.00018\n",
" 0.0013 0.00008 0.00151 0.00064 0. 0.00105 0.00117 0. 0.00004\n",
" 0. 0. 0.00057 0.00281 0. 0. 0.00071 0.00007 0.\n",
" 0. 0. 0.00007 0.00261 0.0028 0.0005 0. 0.01571 0.00022\n",
" 0. 0. 0.00004 0. 0.00243 0.00046 0.00787 0.0004 0.00077\n",
" 0.00051 0.0006 0.0004 0.00002 0.0017 0.00108 0.00102 0. 0.\n",
" 0.0019 0.00109 0.00034 0.00198 0.00013 0.00388 0.00009 0.00224 0.\n",
" 0.00106 0.00221 0.01585 0.00119 0.00057 0.00185 0.00054 0.00055 0.00491\n",
" 0.0011 0. 0.00052 0.00686 0.00196 0.00509 0.00107 0.00415 0.00124\n",
" 0.00209 0.00151 0. 0.00103 0.00541 0.00138 0.00074 0.00325 0.00123\n",
" 0. 0.00023 0.00002 0.00012 0.00072 0.00178 0.00832 0.00151 0.00157\n",
" 0.00089 0.00066 0.00031 0.00091 0.00177 0.00098 0.00009 0.00014 0.00096\n",
" 0.00002 0.00319 0. 0.00174 0. 0.00097 0.00037 0.00007 0.00003\n",
" 0.00017 0. 0.00076 0.00065 0.00002 0.00189 0.00192 0.00042 0.00123\n",
" 0.00141 0. 0.00001 0.00107 0. 0.00554 0.00004 0.00092 0.00185\n",
" 0. 0.00005 0. 0. 0.00015 0.00019 0. 0.00012 0.00027\n",
" 0.00018 0.00146 0.00016 0.00412 0.00056 0.00144 0.00074 0.00024 0.00877\n",
" 0.00043 0. 0.00147 0.00229 0.00142 0.00074 0.00074 0.00097 0.\n",
" 0.00111 0.00168 0. 0.00035 0.00085 0.00186 0.00063 0.00068 0.00213\n",
" 0.00105 0.00395 0.00155 0.00082 0.00062 0.00044 0.00112 0.00107 0.\n",
" 0.00039 0.0007 0. 0. 0.00043 0.00192 0.00001 0.00033 0.0003\n",
" 0.00144 0.00001 0.00193 0.00026 0.00255 0. 0. 0.00608 0.00076\n",
" 0.00139 0.00052 0.00026 0.00061 0.00041 0.00191 0.0001 0.00001 0.00083\n",
" 0. 0.00183 0.00062 0.00138 0.00435 0.00163 0.00283 0.00394 0.00154\n",
" 0.00433 0.00154 0. 0.0001 0.00004 0. 0.00004 0.00109 0.00146\n",
" 0.00057 0.00087 0.00006 0.00005 0.00546 0. 0.00011 0.00137 0.00016\n",
" 0.00367 0.00069 0.00325 0. 0. 0.00028 0.00052 0.00558 0.00221\n",
" 0.00192 0.0004 0.00053 0.00193 0.00066 0.0007 0. 0.00047 0.\n",
" 0. 0.00003 0. 0.00022 0. 0.00153 0.00158 0.00056 0.00296\n",
" 0.00198 0. 0.00102 0.0011 0.00052 0.00037 0.00291 0.00728 0.00164\n",
" 0.00064 0.00078 0. 0.00066 0.00072 0.0011 0.00132 0.00003 0.00097\n",
" 0.0001 0.00028 0. 0. 0.00285 0. 0.00013 0.00046 0.\n",
" 0.00079 0.00036 0.00012 0.00003 0.00015 0.00015 0.00126 0.00033 0.00035\n",
" 0.00041 0.00539 0. 0.00133 0. 0.00056 0.00054 0.00017 0.00274\n",
" 0.00112 0.00245 0.00069 0.00042 0.00259 0.00068 0.00017 0.00073 0.00063\n",
" 0.00426 0.00094 0.0003 0.0013 0. 0. 0. 0. 0.\n",
" 0. 0.00312 0.00295 0.00019 0. 0. 0. 0.00241 0.\n",
" 0.00246 0.00013 0.0007 0.00011 0. 0.0017 0. 0.0011 0.00104\n",
" 0. 0.00036 0.00189 0.00155 0.00018 0.00099 0.00049 0.00106 0.\n",
" 0.00173 0.00007 0.00002 0.00044 0.00088 0.00262 0.00004 0. 0.00016\n",
" 0. 0.00101 0.00097 0.0015 0.00084 0.0011 0.00017 0.00013 0.00085\n",
" 0.00002 0.00192 0.00043 0.00022 0.002 0. 0.00001 0.00038 0.00267\n",
" 0.00129 0.00094 0.00005 0.00047 0.00095 0. 0.0008 0.00096 0.00058\n",
" 0. 0.00051 0.00079 0.00016 0.00188 0.00009 0.0006 0.00079 0.00192\n",
" 0. 0.00054 0.00489 0.0004 0.00107 0.00084 0.0008 0.001 0.00136\n",
" 0.00002 0. 0.00121 0.0069 0.00236 0.00039 0.00034 0.00187 0.00133\n",
" 0.00058 0.00082 0. 0.00062 0.00022 0.00029 0.00526 0.00455 0.01466\n",
" 0.00091 0.00236 0.01216 0.00204 0. 0.00009 0.0015 0.00119 0.00127\n",
" 0. 0.00079 0.00162 0. 0.00028 0.00135 0.00002 0.00171 0.00149\n",
" 0. 0. 0.00099 0.00186 0.00102 0. 0.0008 0.00053 0.00067\n",
" 0.00008 0.00023 0. 0.00186 0.00013 0.00293 0.00366 0.00058 0.\n",
" 0. 0. 0.00027 0.00118 0.00139 0.00089 0.00007 0. 0.00039\n",
" 0.00806 0. 0. 0.00007 0.0002 0.00025 0.00028 0.00033 0.00107\n",
" 0.00508 0.00136 0.00066 0.00022 0.0003 0.00157 0.00158 0.00098 0.00702\n",
" 0.00043 0. 0.00132 0.00138 0.00144 0.00115 0.0005 0. 0.00095\n",
" 0.00249 0. 0. 0.00041 0. 0. 0.00325 0.00043 0.00045\n",
" 0.00577 0.00086 0.00217 0.00091 0.00206 0.00192 0.00176 0.00004 0.00002\n",
" 0. 0.00154 0.00004 0. 0. 0.01144 0.00844 0.00248 0.00049\n",
" 0.00144]\n"
]
}
],
"source": [
"print(topic_word_distributions[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead, usually people do something like looking at the most probable words per topic, and try to use these words to interpret what the different topics correspond to."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Displaying the top 20 words per topic and their probabilities within the topic...\n",
"\n",
"[Topic 0]\n",
"good : 0.01592254908129389\n",
"like : 0.01584763067117222\n",
"just : 0.015714974597809874\n",
"think : 0.014658035148150044\n",
"don : 0.01336602502776772\n",
"time : 0.012159230893303024\n",
"year : 0.011442050656933937\n",
"new : 0.008768217977593912\n",
"years : 0.00843922077825026\n",
"game : 0.008416482579473757\n",
"make : 0.008318270139852606\n",
"ve : 0.00805613381872604\n",
"know : 0.00786552901690738\n",
"going : 0.007357414502894818\n",
"better : 0.007305177940555176\n",
"really : 0.007282768897233162\n",
"got : 0.007100242166187475\n",
"way : 0.007020258221618519\n",
"team : 0.006901091494924322\n",
"car : 0.006860678090522195\n",
"\n",
"[Topic 1]\n",
"drive : 0.025114459755967225\n",
"card : 0.01904504522714293\n",
"scsi : 0.01574807346309645\n",
"disk : 0.015086151949241311\n",
"use : 0.01311205775591249\n",
"output : 0.012487568705565076\n",
"file : 0.011474974819227298\n",
"bit : 0.011450491727323115\n",
"hard : 0.010426435918865882\n",
"entry : 0.009962381704950415\n",
"memory : 0.009892936703385204\n",
"mac : 0.009531449582937765\n",
"video : 0.009451338641933656\n",
"drives : 0.009074000962777757\n",
"pc : 0.0090703286112168\n",
"windows : 0.008135023862197355\n",
"16 : 0.00798823814975238\n",
"bus : 0.007927283819698584\n",
"controller : 0.007902057876189581\n",
"program : 0.00784268458596016\n",
"\n",
"[Topic 2]\n",
"10 : 0.032029220388909305\n",
"00 : 0.026964330541393168\n",
"25 : 0.021829691245344042\n",
"15 : 0.0206063577495655\n",
"11 : 0.02060435039972444\n",
"20 : 0.02049576092374077\n",
"12 : 0.020376684447947477\n",
"14 : 0.018055470834875288\n",
"16 : 0.016460265629681656\n",
"13 : 0.016023012407497653\n",
"17 : 0.016018903187206005\n",
"18 : 0.015931416080248326\n",
"30 : 0.013487129884524487\n",
"50 : 0.013323083130055324\n",
"24 : 0.01312690458019883\n",
"19 : 0.012520561557602226\n",
"55 : 0.01250023314639117\n",
"21 : 0.01226424792228292\n",
"40 : 0.01192815257556637\n",
"22 : 0.011207231765196151\n",
"\n",
"[Topic 3]\n",
"key : 0.02286656504255774\n",
"government : 0.020853124988437204\n",
"gun : 0.016677031123281342\n",
"db : 0.015171423239595656\n",
"law : 0.014649075730175279\n",
"use : 0.014111126676381442\n",
"people : 0.012470526208203755\n",
"encryption : 0.010285224622752638\n",
"public : 0.010243636203369282\n",
"state : 0.009947341601966503\n",
"chip : 0.009825392329557068\n",
"keys : 0.008779074363170275\n",
"clipper : 0.00860324584622739\n",
"states : 0.00831822632566845\n",
"control : 0.008226551411556786\n",
"number : 0.008019292972255121\n",
"used : 0.007703329865061306\n",
"guns : 0.007611516117671719\n",
"health : 0.0074015183803604065\n",
"right : 0.006935759695920386\n",
"\n",
"[Topic 4]\n",
"people : 0.02431129507816113\n",
"said : 0.02424559321880428\n",
"mr : 0.014104981910179977\n",
"know : 0.013468288248487321\n",
"armenian : 0.013342199409186437\n",
"did : 0.01241301316278436\n",
"don : 0.011539776119380056\n",
"didn : 0.010944034432906041\n",
"armenians : 0.01092331207436881\n",
"turkish : 0.010132982460837872\n",
"going : 0.009685278853463312\n",
"went : 0.009482314763594613\n",
"time : 0.009318116756590049\n",
"president : 0.009271538629901233\n",
"just : 0.009064415387593214\n",
"war : 0.008272822237163071\n",
"stephanopoulos : 0.008264930815089574\n",
"jews : 0.007943422443838038\n",
"say : 0.00778579920403193\n",
"like : 0.007637486984592752\n",
"\n",
"[Topic 5]\n",
"edu : 0.020183125501381658\n",
"available : 0.01462208677587774\n",
"window : 0.0145175966975094\n",
"windows : 0.014508094728017699\n",
"image : 0.014202932981631174\n",
"version : 0.013567233933254529\n",
"use : 0.013559184672232326\n",
"software : 0.012291845460115209\n",
"program : 0.011764165563758247\n",
"graphics : 0.011586390886004624\n",
"ftp : 0.010473416938225266\n",
"server : 0.010462468444821689\n",
"file : 0.010300988884546303\n",
"using : 0.010279537804707145\n",
"dos : 0.01018612827417759\n",
"com : 0.010041868375764224\n",
"display : 0.00950825069896609\n",
"sun : 0.00841184948299741\n",
"set : 0.008165843948144175\n",
"motif : 0.008043634521501537\n",
"\n",
"[Topic 6]\n",
"space : 0.02122937597565116\n",
"information : 0.020605813849808002\n",
"edu : 0.01308609835593781\n",
"mail : 0.012122425926729597\n",
"data : 0.012063443176285979\n",
"new : 0.011922325100362566\n",
"internet : 0.011805664499152158\n",
"university : 0.011648063401924154\n",
"nasa : 0.011260324762512446\n",
"research : 0.011214434673689003\n",
"computer : 0.010579251202203384\n",
"send : 0.010082352367484646\n",
"file : 0.009827860114832524\n",
"anonymous : 0.009393114153924213\n",
"list : 0.00884288720549653\n",
"email : 0.008556680088168823\n",
"technology : 0.008502540090233126\n",
"address : 0.008453084833023794\n",
"available : 0.008312068272885744\n",
"com : 0.00801043664202556\n",
"\n",
"[Topic 7]\n",
"god : 0.024214358169839564\n",
"people : 0.019831156312240143\n",
"think : 0.012548053861881268\n",
"does : 0.011687743323611263\n",
"jesus : 0.011293758739612999\n",
"believe : 0.011272045532292002\n",
"don : 0.010834089093902355\n",
"say : 0.010114162234632518\n",
"just : 0.009461924824189793\n",
"know : 0.008276216060086664\n",
"true : 0.007825806533253571\n",
"like : 0.007419673532456676\n",
"way : 0.007242156428134814\n",
"life : 0.007196868607195379\n",
"christian : 0.007055010808531854\n",
"time : 0.0070227629792582015\n",
"israel : 0.006934392612716913\n",
"bible : 0.006745791914802942\n",
"question : 0.0064075417201172645\n",
"things : 0.00637398445911145\n",
"\n",
"[Topic 8]\n",
"know : 0.026134132260814825\n",
"like : 0.021317018148473164\n",
"don : 0.020053165678589904\n",
"thanks : 0.019992616431884633\n",
"just : 0.017657856830905833\n",
"does : 0.0150396282979183\n",
"ve : 0.012652637710485493\n",
"help : 0.01128297754768276\n",
"want : 0.010651057192301491\n",
"use : 0.010537774368367085\n",
"mail : 0.010021837288920494\n",
"problem : 0.00966309187171309\n",
"edu : 0.009128531452268497\n",
"good : 0.009109138814944394\n",
"post : 0.008996488682684118\n",
"file : 0.008922092098023733\n",
"need : 0.008706364795622623\n",
"looking : 0.0073136427958200205\n",
"time : 0.007174443660697419\n",
"bike : 0.007034238505179918\n",
"\n",
"[Topic 9]\n",
"ax : 0.7717971886082466\n",
"max : 0.055921756933680296\n",
"g9v : 0.015168170327094601\n",
"b8f : 0.014757516856658555\n",
"a86 : 0.012197776886748507\n",
"pl : 0.010147908394792498\n",
"145 : 0.009912072552667453\n",
"1d9 : 0.008447141779981918\n",
"1t : 0.006448628175420717\n",
"0t : 0.006215924619143329\n",
"bhj : 0.005901090271008161\n",
"3t : 0.005517813662944112\n",
"34u : 0.005463059915719654\n",
"giz : 0.00529879860622197\n",
"2di : 0.005244044809233216\n",
"wm : 0.004677791543976195\n",
"2tm : 0.004450114768534644\n",
"75u : 0.00445011475803213\n",
"7ey : 0.003574054032872233\n",
"0d : 0.0031497118103811115\n",
"\n"
]
}
],
"source": [
"num_top_words = 20\n",
"\n",
"def print_top_words(topic_word_distributions, num_top_words, vectorizer):\n",
" vocab = vectorizer.get_feature_names()\n",
" num_topics = len(topic_word_distributions)\n",
" print('Displaying the top %d words per topic and their probabilities within the topic...' % num_top_words)\n",
" print()\n",
"\n",
" for topic_idx in range(num_topics):\n",
" print('[Topic ', topic_idx, ']', sep='')\n",
" sort_indices = np.argsort(-topic_word_distributions[topic_idx])\n",
" for rank in range(num_top_words):\n",
" word_idx = sort_indices[rank]\n",
" print(vocab[word_idx], ':',\n",
" topic_word_distributions[topic_idx, word_idx])\n",
" print()\n",
"\n",
"print_top_words(topic_word_distributions, num_top_words, tf_vectorizer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the `transform()` function to figure out for each document, what fraction of it is explained by each of the topics."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"doc_topic_matrix = lda.transform(tf)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10000, 10)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc_topic_matrix.shape"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.00182, 0.00182, 0.02258, 0.00182, 0.00182, 0.00182, 0.00182,\n",
" 0.96288, 0.00182, 0.00182])"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc_topic_matrix[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that this *could* be interpreted as a form of dimensionality reduction: document 0 is converted from its raw counts histogram representation to a 10-dimensional vector of probabilities, indicating estimated memberships to the 10 different topics."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing co-occurrences of words\n",
"\n",
"Here, we count the number of newsgroup posts in which two words both occur. This part of the demo should feel like a review of co-occurrence analysis from earlier in the course, except now we use scikit-learn's built-in CountVectorizer. Conceptually everything else in the same as before."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"word1 = 'year'\n",
"word2 = 'team'\n",
"\n",
"word1_column_idx = tf_vectorizer.vocabulary_[word1]\n",
"word2_column_idx = tf_vectorizer.vocabulary_[word2]"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" ...,\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0]])"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.array(tf.todense())"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0],\n",
" [0],\n",
" [0],\n",
" ...,\n",
" [0],\n",
" [0],\n",
" [0]])"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tf[:, word1_column_idx].toarray()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"documents_with_word1 = (tf[:, word1_column_idx].toarray().flatten() > 0)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"documents_with_word2 = (tf[:, word2_column_idx].toarray().flatten() > 0)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"documents_with_both_word1_and_word2 = documents_with_word1 * documents_with_word2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we compute the log of the conditional probability of word 1 appearing given that word 2 appeared, where we add in a little bit of a fudge factor in the numerator (in this case, it's actually not needed but some times you do have two words that do not co-occur for which you run into a numerical issue due to taking the log of 0)."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-1.5482462194376105"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eps = 0.1\n",
"np.log2((documents_with_both_word1_and_word2.sum() + eps) / documents_with_word2.sum())"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"def prob_see_word1_given_see_word2(word1, word2, vectorizer, eps=0.1):\n",
" word1_column_idx = vectorizer.vocabulary_[word1]\n",
" word2_column_idx = vectorizer.vocabulary_[word2]\n",
" documents_with_word1 = (tf[:, word1_column_idx].toarray().flatten() > 0)\n",
" documents_with_word2 = (tf[:, word2_column_idx].toarray().flatten() > 0)\n",
" documents_with_both_word1_and_word2 = documents_with_word1 * documents_with_word2\n",
" return np.log2((documents_with_both_word1_and_word2.sum() + eps) / documents_with_word2.sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Topic coherence\n",
"\n",
"The below code shows how one implements the topic coherence calculation from lecture."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"def compute_average_coherence(topic_word_distributions, num_top_words, vectorizer, verbose=True):\n",
" vocab = vectorizer.get_feature_names()\n",
" num_topics = len(topic_word_distributions)\n",
" average_coherence = 0\n",
" for topic_idx in range(num_topics):\n",
" if verbose:\n",
" print('[Topic ', topic_idx, ']', sep='')\n",
" \n",
" sort_indices = np.argsort(topic_word_distributions[topic_idx])[::-1]\n",
" coherence = 0.\n",
" for top_word_idx1 in sort_indices[:num_top_words]:\n",
" word1 = vocab[top_word_idx1]\n",
" for top_word_idx2 in sort_indices[:num_top_words]:\n",
" word2 = vocab[top_word_idx2]\n",
" if top_word_idx1 != top_word_idx2:\n",
" coherence += prob_see_word1_given_see_word2(word1, word2, vectorizer, 0.1)\n",
" \n",
" if verbose:\n",
" print('Coherence:', coherence)\n",
" print()\n",
" average_coherence += coherence\n",
" average_coherence /= num_topics\n",
" if verbose:\n",
" print('Average coherence:', average_coherence)\n",
" return average_coherence"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Topic 0]\n",
"Coherence: -883.433278037819\n",
"\n",
"[Topic 1]\n",
"Coherence: -1204.9568944144453\n",
"\n",
"[Topic 2]\n",
"Coherence: -658.5193466692967\n",
"\n",
"[Topic 3]\n",
"Coherence: -1262.9416512877067\n",
"\n",
"[Topic 4]\n",
"Coherence: -1120.0387743529964\n",
"\n",
"[Topic 5]\n",
"Coherence: -1068.283511244093\n",
"\n",
"[Topic 6]\n",
"Coherence: -1018.020273866717\n",
"\n",
"[Topic 7]\n",
"Coherence: -874.4985384459782\n",
"\n",
"[Topic 8]\n",
"Coherence: -982.6298329129335\n",
"\n",
"[Topic 9]\n",
"Coherence: -228.56240302255796\n",
"\n",
"Average coherence: -930.1884504254543\n"
]
},
{
"data": {
"text/plain": [
"-930.1884504254543"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_average_coherence(topic_word_distributions, num_top_words, tf_vectorizer, True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Number of unique words\n",
"\n",
"The below code shows how one implements the number of unique words calculation from lecture."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"def compute_average_num_unique_words(topic_word_distributions, num_top_words, vectorizer, verbose=True):\n",
" vocab = vectorizer.get_feature_names()\n",
" num_topics = len(topic_word_distributions)\n",
" average_number_of_unique_top_words = 0\n",
" for topic_idx1 in range(num_topics):\n",
" if verbose:\n",
" print('[Topic ', topic_idx1, ']', sep='')\n",
" \n",
" sort_indices1 = np.argsort(topic_word_distributions[topic_idx1])[::-1]\n",
" num_unique_top_words = 0\n",
" for top_word_idx1 in sort_indices1[:num_top_words]:\n",
" word1 = vocab[top_word_idx1]\n",
" break_ = False\n",
" for topic_idx2 in range(num_topics):\n",
" if topic_idx1 != topic_idx2:\n",
" sort_indices2 = np.argsort(topic_word_distributions[topic_idx2])[::-1]\n",
" for top_word_idx2 in sort_indices2[:num_top_words]:\n",
" word2 = vocab[top_word_idx2]\n",
" if word1 == word2:\n",
" break_ = True\n",
" break\n",
" if break_:\n",
" break\n",
" else:\n",
" num_unique_top_words += 1\n",
" if verbose:\n",
" print('Number of unique top words:', num_unique_top_words)\n",
" print()\n",
"\n",
" average_number_of_unique_top_words += num_unique_top_words\n",
" average_number_of_unique_top_words /= num_topics\n",
" \n",
" if verbose:\n",
" print('Average number of unique top words:', average_number_of_unique_top_words)\n",
" \n",
" return average_number_of_unique_top_words"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Topic 0]\n",
"Number of unique top words: 9\n",
"\n",
"[Topic 1]\n",
"Number of unique top words: 15\n",
"\n",
"[Topic 2]\n",
"Number of unique top words: 19\n",
"\n",
"[Topic 3]\n",
"Number of unique top words: 18\n",
"\n",
"[Topic 4]\n",
"Number of unique top words: 12\n",
"\n",
"[Topic 5]\n",
"Number of unique top words: 13\n",
"\n",
"[Topic 6]\n",
"Number of unique top words: 14\n",
"\n",
"[Topic 7]\n",
"Number of unique top words: 10\n",
"\n",
"[Topic 8]\n",
"Number of unique top words: 8\n",
"\n",
"[Topic 9]\n",
"Number of unique top words: 20\n",
"\n",
"Average number of unique top words: 13.8\n"
]
},
{
"data": {
"text/plain": [
"13.8"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_average_num_unique_words(topic_word_distributions, num_top_words, tf_vectorizer, True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plotting average coherence vs k (number of topics), and average number of unique words vs k\n",
"\n",
"Next, we plot the average coherence vs k and the average number of unique words vs k. Note that these are *not* the only topic model metrics available (much like how CH index is not the only metric available for clustering).\n",
"\n",
"For both average coherence and average number of unique words, we would like these to be high. In this particular example, it turns out k=2 yields very high values for both but if you look at the topics learned for k=2, they are qualitatively quite bad (basically one topic is gibberish and the other is everything else!). This observation reinforces the important idea that while there exist topic modeling metrics (such as coherence and number of unique words), you should definitely still look at what the learned topics are (e.g., by printing the top words per topic) to help decide on what value of k to use.\n",
"\n",
"Also, keep in mind that the results are in some sense \"noisy\" since the LDA fitting procedure is random. We're choosing a specific `random_state` seed value but if we try different random seeds, we can get different results. For simplicity, because LDA fitting is quite computationally expensive, we are *not* doing what we did with GMM's where we did many different random initializations. Thus, the conclusions we draw regarding how many topics to use might actually be different with different random initializations.\n",
"\n",
"At least according to average coherence and average number of unique words for the random seed we use, the results below suggests that using k=4 yields average coherence and average number of unique words that are still reasonably high (as good as or almost as good as the k=2 result), and inspecting the topics learned for k=4, they are definitely more interesting than the ones learned for k=2.\n",
"\n",
"From qualitatively looking at topics, the k=5, k=6, and k=7 topics also look decent. When k gets too large (e.g., k=10), there start to be topics that look like there might be too much overlap (such as multiple topics that seem to be about computers).\n",
"\n",
"Note that one of the things to look out for is whether there are \"stable\" topics, where even for slightly different values of k and different random initializations, LDA keeps finding specific topics (e.g., one on gibberish, one on numbers, etc)."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------------------------------------\n",
"Number of topics: 2\n",
"\n",
"Displaying the top 20 words per topic and their probabilities within the topic...\n",
"\n",
"[Topic 0]\n",
"ax : 0.6670609408224907\n",
"max : 0.04897290356354059\n",
"g9v : 0.013114422643457437\n",
"b8f : 0.01275949912939879\n",
"a86 : 0.010547142707890972\n",
"pl : 0.008819612623778256\n",
"145 : 0.00881675090221303\n",
"1d9 : 0.007305507436215006\n",
"db : 0.007159573058301496\n",
"1t : 0.00557821214241208\n",
"0t : 0.005377089263425004\n",
"25 : 0.005211780342996569\n",
"bhj : 0.005104981847457374\n",
"3t : 0.004773718255446174\n",
"34u : 0.004726396579798662\n",
"giz : 0.004584428325550535\n",
"2di : 0.0045371051926778325\n",
"55 : 0.004286549836996119\n",
"14 : 0.004247079858618981\n",
"wm : 0.004101448384177003\n",
"\n",
"[Topic 1]\n",
"people : 0.00898805018321434\n",
"like : 0.008724015906915197\n",
"don : 0.008445318198979547\n",
"just : 0.008117711165098762\n",
"know : 0.007726560502119272\n",
"use : 0.006807342121575315\n",
"time : 0.006601978036055969\n",
"think : 0.0065897670065797375\n",
"does : 0.00599569573547193\n",
"new : 0.005707662001047904\n",
"good : 0.005477410696991825\n",
"edu : 0.005188926692129768\n",
"make : 0.004467734634022856\n",
"way : 0.004391946942939348\n",
"god : 0.0042037116677492575\n",
"used : 0.00410836241940848\n",
"say : 0.004042354681202132\n",
"ve : 0.004042230793685092\n",
"right : 0.004015407615464605\n",
"file : 0.003959205422499352\n",
"\n",
"\n",
"\n",
"--------------------------------------------------------------------------------\n",
"Number of topics: 3\n",
"\n",
"Displaying the top 20 words per topic and their probabilities within the topic...\n",
"\n",
"[Topic 0]\n",
"people : 0.013987379438521899\n",
"don : 0.011524660963375346\n",
"just : 0.010967976112691066\n",
"think : 0.01021778463447909\n",
"like : 0.009964277649120085\n",
"know : 0.0092342235712899\n",
"time : 0.008182513956960178\n",
"good : 0.007331228467660352\n",
"god : 0.006936159078250954\n",
"said : 0.0063851562833832035\n",
"say : 0.00634022436925281\n",
"did : 0.005991990842787771\n",
"right : 0.005942281837723304\n",
"new : 0.005741586103397404\n",
"way : 0.0056726621280829325\n",
"does : 0.005643696594475119\n",
"make : 0.005429269467006293\n",
"ve : 0.0048784208108294215\n",
"years : 0.004864967623517497\n",
"going : 0.004673802945477093\n",
"\n",
"[Topic 1]\n",
"use : 0.012447734500652183\n",
"edu : 0.011263460190830913\n",
"file : 0.009899941300464015\n",
"program : 0.007018677072679082\n",
"information : 0.006912590192704495\n",
"available : 0.006853780538031967\n",
"data : 0.006844735735509171\n",
"like : 0.006720850461439026\n",
"com : 0.006638007030520867\n",
"does : 0.006444410483775505\n",
"mail : 0.006332373100478371\n",
"windows : 0.006315044731545729\n",
"using : 0.006245824476178386\n",
"key : 0.006193923454661429\n",
"thanks : 0.005966709211683632\n",
"drive : 0.005867773459190472\n",
"used : 0.005786160568218421\n",
"new : 0.005598871363047172\n",
"know : 0.005332454197272465\n",
"software : 0.005290964182953569\n",
"\n",
"[Topic 2]\n",
"ax : 0.68604063105398\n",
"max : 0.05009072233000451\n",
"g9v : 0.013485575446910123\n",
"b8f : 0.01312055234170118\n",
"a86 : 0.010845241613434808\n",
"pl : 0.009068576651875217\n",
"145 : 0.009035086598061274\n",
"1d9 : 0.00751136296576256\n",
"db : 0.007365489778231076\n",
"1t : 0.005734915903615353\n",
"0t : 0.0055280700414600615\n",
"bhj : 0.005248219820615878\n",
"3t : 0.0049075292549175895\n",
"34u : 0.004858861948257405\n",
"giz : 0.004712853760341552\n",
"2di : 0.0046641840133856026\n",
"25 : 0.004555791938429991\n",
"55 : 0.004226656256448206\n",
"wm : 0.004196597365437893\n",
"14 : 0.0040002631672698685\n",
"\n",
"\n",
"\n",
"--------------------------------------------------------------------------------\n",
"Number of topics: 4\n",
"\n",
"Displaying the top 20 words per topic and their probabilities within the topic...\n",
"\n",
"[Topic 0]\n",
"10 : 0.020775606453347972\n",
"00 : 0.015890839621149584\n",
"20 : 0.01370862334313512\n",
"25 : 0.013601644987850624\n",
"year : 0.013594635673486039\n",
"game : 0.012965301259011644\n",
"15 : 0.01281446793442682\n",
"team : 0.012416605185666847\n",
"12 : 0.01200158843407778\n",
"11 : 0.011583394324247782\n",
"new : 0.011183206859657917\n",
"14 : 0.010589397196093081\n",
"50 : 0.010024991529015564\n",
"16 : 0.009833008522346818\n",
"30 : 0.009705445241436112\n",
"17 : 0.009353751988371137\n",
"13 : 0.00920312973948339\n",
"games : 0.00882507455924421\n",
"18 : 0.008546215566491976\n",
"40 : 0.007990140347664858\n",
"\n",
"[Topic 1]\n",
"use : 0.013276949661338217\n",
"edu : 0.012267336422302898\n",
"file : 0.011566434456562217\n",
"program : 0.008065671587659575\n",
"available : 0.007681160188932213\n",
"information : 0.007480435404680009\n",
"data : 0.007434941850183757\n",
"windows : 0.007399896886108286\n",
"com : 0.007340443977972629\n",
"like : 0.007152698596414284\n",
"does : 0.007096002799591012\n",
"mail : 0.007033384374437087\n",
"using : 0.0069138854531540644\n",
"thanks : 0.006892248571444717\n",
"drive : 0.006788681866371032\n",
"software : 0.006203964578431319\n",
"used : 0.005965791036499411\n",
"know : 0.005872577507701732\n",
"space : 0.005619311028569055\n",
"files : 0.005423185775158846\n",
"\n",
"[Topic 2]\n",
"ax : 0.7694336151809155\n",
"max : 0.05590620315698597\n",
"g9v : 0.015123723771983425\n",
"b8f : 0.01471432929726028\n",
"a86 : 0.012162431729599878\n",
"pl : 0.010169665152511954\n",
"145 : 0.009876385397348647\n",
"1d9 : 0.00842326069634432\n",
"1t : 0.006430858200869931\n",
"0t : 0.006198902004432631\n",
"bhj : 0.005885015641233298\n",
"3t : 0.005502916442890248\n",
"34u : 0.005448338232758139\n",
"giz : 0.0052846043454822305\n",
"2di : 0.005230018373874223\n",
"wm : 0.004676695241138819\n",
"2tm : 0.004438521790306373\n",
"75u : 0.004438519274554816\n",
"7ey : 0.003565146249628605\n",
"0d : 0.0031420297101864476\n",
"\n",
"[Topic 3]\n",
"people : 0.015795911140035916\n",
"don : 0.012761597900782592\n",
"just : 0.011936575887743144\n",
"think : 0.010999709691428675\n",
"like : 0.010753928501201888\n",
"know : 0.010324843095132303\n",
"time : 0.008613581455996083\n",
"god : 0.007712410358519168\n",
"say : 0.007084985395544734\n",
"said : 0.0068370461766585336\n",
"good : 0.006593197861237151\n",
"does : 0.006503028570432555\n",
"did : 0.0063234864798292\n",
"right : 0.006271311549565499\n",
"way : 0.006110322587408365\n",
"make : 0.005871057599517065\n",
"ve : 0.005157698494190836\n",
"believe : 0.005155273774726601\n",
"going : 0.0050781770001162\n",
"government : 0.004975447044627339\n",
"\n",
"\n",
"\n",
"--------------------------------------------------------------------------------\n",
"Number of topics: 5\n",
"\n",
"Displaying the top 20 words per topic and their probabilities within the topic...\n",
"\n",
"[Topic 0]\n",
"just : 0.01788986875726945\n",
"don : 0.017557115509672202\n",
"like : 0.01638604814526287\n",
"know : 0.015245768612748994\n",
"think : 0.014491329975876863\n",
"good : 0.012760345039622738\n",
"time : 0.011471087014762496\n",
"people : 0.011219407785017831\n",
"ve : 0.008581726606411383\n",
"did : 0.008116535351038298\n",
"say : 0.007875888166202556\n",
"said : 0.007678048035553147\n",
"way : 0.007542137242486366\n",
"going : 0.007408145097860303\n",
"god : 0.007407188088066578\n",
"really : 0.0072349010252427985\n",
"didn : 0.006692126916253505\n",
"right : 0.00666737865530526\n",
"ll : 0.006630570931032409\n",
"make : 0.00645595939031409\n",
"\n",
"[Topic 1]\n",
"use : 0.014335647877248387\n",
"file : 0.013282860668227264\n",
"edu : 0.011739138588429177\n",
"windows : 0.008689762826347713\n",
"program : 0.008652474941699533\n",
"available : 0.008243612862010934\n",
"using : 0.007731496553926765\n",
"data : 0.00771268913740633\n",
"drive : 0.007664969613300173\n",
"mail : 0.0076012461807667695\n",
"does : 0.007511142367459821\n",
"com : 0.00749771950063221\n",
"thanks : 0.007330801561702599\n",
"software : 0.007274022927197541\n",
"information : 0.007128754550897643\n",
"like : 0.00668596146690744\n",
"files : 0.006285441223146012\n",
"version : 0.006170161674869487\n",
"used : 0.006165552309976765\n",
"ftp : 0.006036171238656378\n",
"\n",
"[Topic 2]\n",
"10 : 0.02435384328905722\n",
"00 : 0.019294286330864857\n",
"25 : 0.015878390555695197\n",
"20 : 0.015827143003794033\n",
"15 : 0.015639730168933063\n",
"12 : 0.015220670126098363\n",
"11 : 0.01453271453828608\n",
"space : 0.014221979059105567\n",
"14 : 0.01320213106771183\n",
"16 : 0.012547268137490174\n",
"17 : 0.011526340617938356\n",
"new : 0.011458708594472993\n",
"13 : 0.011406768858740361\n",
"18 : 0.011265357386054142\n",
"30 : 0.0111825719875938\n",
"50 : 0.011024513142984166\n",
"1993 : 0.01045435942259869\n",
"edu : 0.009536001262584465\n",
"40 : 0.009457996901332416\n",
"24 : 0.009286948231365763\n",
"\n",
"[Topic 3]\n",
"people : 0.01783992569800902\n",
"government : 0.009986220466769662\n",
"law : 0.008382618743941061\n",
"key : 0.008077617684382877\n",
"use : 0.006580405294050514\n",
"does : 0.006188223176450605\n",
"mr : 0.006127602753322014\n",
"god : 0.00605996492996997\n",
"gun : 0.006031913703907318\n",
"state : 0.005851973769972829\n",
"public : 0.005584139747842391\n",
"don : 0.005417000586981049\n",
"db : 0.005401382408991876\n",
"right : 0.00524577124638591\n",
"think : 0.005229693964431443\n",
"time : 0.005053885686666708\n",
"fact : 0.005047232271562811\n",
"believe : 0.00504081794225629\n",
"make : 0.005036157216503344\n",
"president : 0.0049977751324838705\n",
"\n",
"[Topic 4]\n",
"ax : 0.7685586473392418\n",
"max : 0.05579735161776001\n",
"g9v : 0.01510585899971278\n",
"b8f : 0.014696929029633137\n",
"a86 : 0.012147934763857593\n",
"pl : 0.01012367318873562\n",
"145 : 0.009894126036774123\n",
"1d9 : 0.008413042149865622\n",
"1t : 0.006422912313133302\n",
"0t : 0.00619118545087805\n",
"bhj : 0.005877678236271993\n",
"3t : 0.005495997712040083\n",
"34u : 0.005441482595443562\n",
"giz : 0.005277918016504321\n",
"2di : 0.005223393985632815\n",
"wm : 0.00466669166186889\n",
"2tm : 0.004432796847972882\n",
"75u : 0.004432795565325082\n",
"7ey : 0.0035604137127846786\n",
"0d : 0.003137808202915678\n",
"\n",
"\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------------------------------------\n",
"Number of topics: 6\n",
"\n",
"Displaying the top 20 words per topic and their probabilities within the topic...\n",
"\n",
"[Topic 0]\n",
"like : 0.016852479569330264\n",
"just : 0.016249088573412254\n",
"don : 0.014321685532009194\n",
"good : 0.014036402102538259\n",
"think : 0.012288120987062754\n",
"time : 0.01097859057511481\n",
"know : 0.010511006543471512\n",
"year : 0.009071025112486874\n",
"ve : 0.008935106759377032\n",
"new : 0.007749648930000111\n",
"make : 0.00720341475373212\n",
"game : 0.006984723567373851\n",
"really : 0.006972371761322122\n",
"got : 0.006968159542501395\n",
"car : 0.006801159432506462\n",
"way : 0.006668961381068071\n",
"ll : 0.0066400254178518945\n",
"going : 0.006382567779276916\n",
"better : 0.006331053807802955\n",
"years : 0.006279998362410046\n",
"\n",
"[Topic 1]\n",
"file : 0.014684399342629462\n",
"use : 0.014099134117677855\n",
"windows : 0.010163634821532153\n",
"program : 0.009171897841265507\n",
"does : 0.008546948639464496\n",
"drive : 0.008483613061925683\n",
"edu : 0.008365693100682596\n",
"software : 0.00819795366793144\n",
"thanks : 0.007978565222922592\n",
"using : 0.007954376688749416\n",
"available : 0.007745095568464668\n",
"data : 0.007258479078655468\n",
"like : 0.007236749360850786\n",
"mail : 0.0071608019425103055\n",
"files : 0.006983935748058571\n",
"card : 0.006944423921097707\n",
"version : 0.006902729510591507\n",
"know : 0.006690034282180716\n",
"ftp : 0.0064153556413757635\n",
"window : 0.006276827647030423\n",
"\n",
"[Topic 2]\n",
"10 : 0.021815903293409624\n",
"edu : 0.01838028547953334\n",
"00 : 0.018034830821376538\n",
"25 : 0.014346641377615101\n",
"15 : 0.01400647330062541\n",
"20 : 0.013852657886336547\n",
"12 : 0.013794799397224347\n",
"11 : 0.013765440740076346\n",
"space : 0.012364813795544135\n",
"14 : 0.012087206613976983\n",
"16 : 0.011563089974008367\n",
"13 : 0.010760464070105882\n",
"17 : 0.010744102379030262\n",
"18 : 0.010581320050873512\n",
"1993 : 0.01044388129602819\n",
"new : 0.009941470382695387\n",
"30 : 0.009414776602491131\n",
"50 : 0.009184951410556747\n",
"24 : 0.008944515500162618\n",
"university : 0.008727802570915317\n",
"\n",
"[Topic 3]\n",
"people : 0.014054283476713092\n",
"key : 0.013209886651381763\n",
"government : 0.01293923622248849\n",
"law : 0.011905292967421746\n",
"use : 0.010790266780879161\n",
"gun : 0.008957383167459906\n",
"public : 0.00861957495511257\n",
"db : 0.008173298248367055\n",
"state : 0.007676815674760137\n",
"encryption : 0.006892653708530183\n",
"used : 0.006831165815658189\n",
"right : 0.006082953920414706\n",
"does : 0.005995112438565497\n",
"make : 0.005696392595867004\n",
"security : 0.005643721068334061\n",
"states : 0.005627995055029529\n",
"don : 0.005382255862901454\n",
"information : 0.005374594584573991\n",
"number : 0.0053469989355769834\n",
"chip : 0.005195152121240392\n",
"\n",
"[Topic 4]\n",
"people : 0.022516118466123662\n",
"god : 0.01895964782814781\n",
"don : 0.013423499150045688\n",
"know : 0.013379340217087548\n",
"said : 0.013067219553566621\n",
"think : 0.011459323017328422\n",
"just : 0.011360218305226896\n",
"say : 0.010329683194737682\n",
"did : 0.009679356460128416\n",
"like : 0.009010560243454594\n",
"jesus : 0.008763317267492179\n",
"time : 0.008523547545555196\n",
"believe : 0.008112383349375727\n",
"does : 0.007360582520571018\n",
"mr : 0.006698183820263976\n",
"way : 0.006385355120724338\n",
"armenian : 0.0061558638436352816\n",
"life : 0.005743294932649253\n",
"world : 0.005738305093824702\n",
"things : 0.0057307759861880934\n",
"\n",
"[Topic 5]\n",
"ax : 0.7683372376306211\n",
"max : 0.055781543045980056\n",
"g9v : 0.015101062245049828\n",
"b8f : 0.014692250116136429\n",
"a86 : 0.01214398848156198\n",
"pl : 0.010118777713597074\n",
"145 : 0.009906887185733303\n",
"1d9 : 0.008410171226900636\n",
"1t : 0.006420617594518415\n",
"0t : 0.006188957377092919\n",
"bhj : 0.005875536516572993\n",
"3t : 0.005493974745897464\n",
"34u : 0.005439469299154595\n",
"giz : 0.005275946436316086\n",
"2di : 0.005221438160548594\n",
"wm : 0.004667039118601915\n",
"2tm : 0.0044310682296251755\n",
"75u : 0.004431067899009933\n",
"7ey : 0.003558935887574178\n",
"cx : 0.0033107144431898\n",
"\n",
"\n",
"\n",
"--------------------------------------------------------------------------------\n",
"Number of topics: 7\n",
"\n",
"Displaying the top 20 words per topic and their probabilities within the topic...\n",
"\n",
"[Topic 0]\n",
"like : 0.01782903565807633\n",
"just : 0.01696716288308159\n",
"good : 0.015880495154014677\n",
"don : 0.014133543131753438\n",
"think : 0.012098137036437917\n",
"time : 0.011168025495623695\n",
"know : 0.009810679625777263\n",
"year : 0.009551121929627092\n",
"ve : 0.008810499478220418\n",
"new : 0.007977488084654866\n",
"game : 0.007897277764507353\n",
"really : 0.00743691341485462\n",
"make : 0.007243692646968021\n",
"car : 0.007226097605720564\n",
"better : 0.007118680679023737\n",
"got : 0.007051795626240991\n",
"way : 0.006919298167688011\n",
"years : 0.0062912945244246255\n",
"team : 0.006184877235266422\n",
"right : 0.00617063021857011\n",
"\n",
"[Topic 1]\n",
"file : 0.015615156171206259\n",
"use : 0.014186847573469316\n",
"windows : 0.010326170351063787\n",
"program : 0.009297015395362207\n",
"drive : 0.00865816634273918\n",
"software : 0.008390653074867917\n",
"does : 0.008388253612561855\n",
"using : 0.008048226290499713\n",
"thanks : 0.008034733681416589\n",
"edu : 0.0077989378526743605\n",
"available : 0.007692915544929103\n",
"files : 0.007198408425196113\n",
"card : 0.0071693415643614\n",
"like : 0.007153422106531526\n",
"data : 0.007031346468962814\n",
"version : 0.006955266875078989\n",
"mail : 0.006878644710465673\n",
"know : 0.0065469256349839815\n",
"window : 0.006357896321964928\n",
"problem : 0.006357385144621793\n",
"\n",
"[Topic 2]\n",
"10 : 0.02205483183114174\n",
"edu : 0.019491221027876447\n",
"00 : 0.018245947709002296\n",
"25 : 0.014388159272812331\n",
"15 : 0.014186533582213525\n",
"12 : 0.014021331089184162\n",
"20 : 0.013983503222042969\n",
"11 : 0.013962456662952198\n",
"space : 0.01331182514660191\n",
"14 : 0.012054671704414543\n",
"16 : 0.011420189084481636\n",
"13 : 0.010826406189119538\n",
"17 : 0.010765250659190487\n",
"1993 : 0.010577851372959475\n",
"18 : 0.010443679079867334\n",
"new : 0.010046128754133523\n",
"30 : 0.00953035196562446\n",
"50 : 0.009237671914243639\n",
"24 : 0.009071967658496782\n",
"university : 0.00878790722884679\n",
"\n",
"[Topic 3]\n",
"key : 0.018006609069789\n",
"government : 0.014289806229344707\n",
"use : 0.013392912240594145\n",
"gun : 0.011940602564933357\n",
"law : 0.011315145256077441\n",
"db : 0.011021649713148144\n",
"public : 0.010759056534053612\n",
"people : 0.010666502862830718\n",
"encryption : 0.009286849499959561\n",
"chip : 0.007710219777670031\n",
"used : 0.007466027415875507\n",
"security : 0.007361733872826201\n",
"state : 0.0069839577938067415\n",
"control : 0.006699315503031853\n",
"keys : 0.006540765470165356\n",
"number : 0.006536663158213446\n",
"clipper : 0.006401374462094465\n",
"privacy : 0.00621053319250329\n",
"information : 0.005989612884353897\n",
"make : 0.0058940466221002056\n",
"\n",
"[Topic 4]\n",
"people : 0.027940492096063713\n",
"don : 0.017733919199035542\n",
"said : 0.017161206880436826\n",
"know : 0.016831191462503368\n",
"just : 0.013412036399605119\n",
"think : 0.012978914267683314\n",
"did : 0.011581335241183793\n",
"like : 0.010359404215777322\n",
"going : 0.00991886502131875\n",
"say : 0.009660452752960489\n",
"mr : 0.009638214840079104\n",
"time : 0.009511945756932842\n",
"didn : 0.009118489827328711\n",
"armenian : 0.008272743373564529\n",
"right : 0.007113216304728181\n",
"want : 0.0070729285934358664\n",
"president : 0.007006147876163506\n",
"armenians : 0.006773043856663257\n",
"ve : 0.006768157094540099\n",
"turkish : 0.006283041556202246\n",
"\n",
"[Topic 5]\n",
"ax : 0.7684671646505757\n",
"max : 0.05578130264602271\n",
"g9v : 0.015103297797248802\n",
"b8f : 0.014694416421941903\n",
"a86 : 0.012145722768302806\n",
"pl : 0.010119611664950635\n",
"145 : 0.009896058325559759\n",
"1d9 : 0.008411272996077761\n",
"1t : 0.00642138320561334\n",
"0t : 0.0061896837289940184\n",
"bhj : 0.005876208656622851\n",
"3t : 0.00549458468961693\n",
"34u : 0.0054400681477923\n",
"giz : 0.005276516325699887\n",
"2di : 0.005221998811577763\n",
"wm : 0.004667560554911541\n",
"2tm : 0.004431494892029249\n",
"75u : 0.004431494768341797\n",
"7ey : 0.0035592147026092327\n",
"cx : 0.003521838721525389\n",
"\n",
"[Topic 6]\n",
"god : 0.03046622761185441\n",
"jesus : 0.014268222820384745\n",
"people : 0.014077554732074822\n",
"does : 0.013116235482918124\n",
"believe : 0.010887373038090007\n",
"israel : 0.008932027336134853\n",
"christian : 0.008832688982006098\n",
"bible : 0.008561705295926418\n",
"true : 0.008338171554544162\n",
"say : 0.008201329034166304\n",
"think : 0.007794427072508317\n",
"life : 0.007753579926468776\n",
"church : 0.007545619256780749\n",
"question : 0.007398164489349966\n",
"religion : 0.006487366095143032\n",
"faith : 0.006437515350561745\n",
"christ : 0.0062809044390575875\n",
"christians : 0.006258098302843464\n",
"way : 0.0062432929396750965\n",
"point : 0.006015020340420946\n",
"\n",
"\n",
"\n",
"--------------------------------------------------------------------------------\n",
"Number of topics: 8\n",
"\n",
"Displaying the top 20 words per topic and their probabilities within the topic...\n",
"\n",
"[Topic 0]\n",
"like : 0.017661001366634312\n",
"just : 0.016866606325517113\n",
"good : 0.016185365929317633\n",
"don : 0.014220124835484625\n",
"think : 0.011534850091803652\n",
"time : 0.011208080557975517\n",
"year : 0.010238062877606598\n",
"know : 0.010061470659389418\n",
"ve : 0.009095127769626954\n",
"new : 0.008359869162322606\n",
"game : 0.008229422828407875\n",
"car : 0.00774755648361833\n",
"really : 0.007473629499084065\n",
"make : 0.0073903849890153\n",
"got : 0.007227359419145756\n",
"better : 0.007184019628962949\n",
"way : 0.0068432272093408575\n",
"ll : 0.006464853766416209\n",
"team : 0.006457862531771675\n",
"years : 0.006445413687116738\n",
"\n",
"[Topic 1]\n",
"use : 0.014423728653978465\n",
"file : 0.012381234894502144\n",
"windows : 0.01215212554781952\n",
"drive : 0.010204610555986631\n",
"program : 0.009567308170146097\n",
"does : 0.009358234567371048\n",
"using : 0.008830810635433147\n",
"card : 0.008343135987445774\n",
"thanks : 0.008329872605144277\n",
"software : 0.008231538644113391\n",
"version : 0.007777843273367926\n",
"like : 0.007644264115855421\n",
"window : 0.007497499823544402\n",
"problem : 0.007486899473908847\n",
"dos : 0.007293619631793128\n",
"image : 0.0071517838806143905\n",
"know : 0.007099478844245626\n",
"scsi : 0.006981563986623095\n",
"files : 0.006936792549573861\n",
"graphics : 0.006821902909722337\n",
"\n",
"[Topic 2]\n",
"10 : 0.0321393562384871\n",
"00 : 0.027401034052146143\n",
"25 : 0.022109258720895357\n",
"11 : 0.021123805405801156\n",
"15 : 0.020901114367972533\n",
"12 : 0.020749773281693463\n",
"20 : 0.020730858616827184\n",
"14 : 0.018664930285562584\n",
"16 : 0.018208125180919188\n",
"17 : 0.016491699262545866\n",
"13 : 0.016362321996789678\n",
"18 : 0.015392566359377288\n",
"24 : 0.01373613339743992\n",
"50 : 0.013221113556174459\n",
"30 : 0.0132166210485867\n",
"19 : 0.01257184418146775\n",
"55 : 0.012422147000617457\n",
"21 : 0.01235082366937971\n",
"40 : 0.012341784342925264\n",
"23 : 0.011258793493614343\n",
"\n",
"[Topic 3]\n",
"key : 0.02513317274301868\n",
"government : 0.01911745999987274\n",
"gun : 0.016781967437429032\n",
"use : 0.015563194721814753\n",
"db : 0.015449234604217916\n",
"law : 0.013641702326974241\n",
"encryption : 0.012329348351085252\n",
"chip : 0.011672097424691798\n",
"people : 0.01008871096541619\n",
"public : 0.01008382216625293\n",
"keys : 0.009156908890308364\n",
"clipper : 0.008977680963315216\n",
"used : 0.008576281443049442\n",
"guns : 0.0076810595002643365\n",
"number : 0.007662337749576472\n",
"control : 0.007304496313123503\n",
"security : 0.007262506820998999\n",
"state : 0.006920459974583096\n",
"bit : 0.006900738744266417\n",
"make : 0.006327598108311477\n",
"\n",
"[Topic 4]\n",
"people : 0.024887139697562826\n",
"said : 0.020969477483719582\n",
"know : 0.015062099253364209\n",
"don : 0.013073988635596516\n",
"mr : 0.012202594755682358\n",
"did : 0.011799291498782583\n",
"armenian : 0.011151352972789018\n",
"going : 0.010884620721062213\n",
"didn : 0.010526919971877418\n",
"president : 0.010119118720817959\n",
"just : 0.009918308988397673\n",
"time : 0.009522441062580036\n",
"think : 0.009461136850750475\n",
"armenians : 0.009129748124515875\n",
"turkish : 0.00846922321899605\n",
"like : 0.008284154648535543\n",
"say : 0.00798588999782298\n",
"went : 0.007768814905213023\n",
"war : 0.0071963719818615044\n",
"told : 0.006949709018715218\n",
"\n",
"[Topic 5]\n",
"ax : 0.7711893587818562\n",
"max : 0.0559542273907728\n",
"g9v : 0.015156559803946878\n",
"b8f : 0.014746229918060953\n",
"a86 : 0.012188506975756679\n",
"pl : 0.010150072669210678\n",
"145 : 0.009899146883979172\n",
"1d9 : 0.008440826927797771\n",
"1t : 0.006443887787942356\n",
"0t : 0.006211367948434271\n",
"bhj : 0.005896781707820299\n",
"3t : 0.005513806686448754\n",
"34u : 0.005459096523850719\n",
"giz : 0.005294965122729864\n",
"2di : 0.005240254464921044\n",
"wm : 0.004686555168112836\n",
"2tm : 0.004446950043732632\n",
"75u : 0.004446949965272228\n",
"7ey : 0.0035715796395607833\n",
"0d : 0.003147569707121064\n",
"\n",
"[Topic 6]\n",
"edu : 0.020258005577058102\n",
"information : 0.018054330270947612\n",
"space : 0.015401707346716424\n",
"mail : 0.014022074421910844\n",
"com : 0.012276052521184716\n",
"file : 0.011722450391621772\n",
"send : 0.010574219459879944\n",
"list : 0.01044350563583072\n",
"available : 0.009372790383732387\n",
"university : 0.00924721184464295\n",
"new : 0.00894135853884839\n",
"internet : 0.008889356186057327\n",
"research : 0.008629968472021568\n",
"data : 0.008398654806017737\n",
"email : 0.008376287346056013\n",
"nasa : 0.00837495038392586\n",
"anonymous : 0.007980649560909735\n",
"address : 0.007507564553577955\n",
"ftp : 0.007503942031897771\n",
"computer : 0.007312447254529496\n",
"\n",
"[Topic 7]\n",
"god : 0.021211670592902524\n",
"people : 0.01932816808591011\n",
"think : 0.01264690641131503\n",
"don : 0.012231786264765837\n",
"does : 0.01205294816757208\n",
"just : 0.010700113360800064\n",
"believe : 0.010543942149482385\n",
"jesus : 0.009893312119046212\n",
"say : 0.009839471792239337\n",
"know : 0.009515796489662921\n",
"like : 0.008773553704861584\n",
"time : 0.007256185089332646\n",
"way : 0.007250653178948299\n",
"true : 0.0070260883968934346\n",
"question : 0.006725422490598352\n",
"life : 0.0065649652969491335\n",
"make : 0.006464572558674824\n",
"good : 0.006440231163175911\n",
"things : 0.0063245077339640346\n",
"point : 0.006211538865620111\n",
"\n",
"\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------------------------------------\n",
"Number of topics: 9\n",
"\n",
"Displaying the top 20 words per topic and their probabilities within the topic...\n",
"\n",
"[Topic 0]\n",
"like : 0.016290666444561043\n",
"just : 0.016109572302259314\n",
"good : 0.01591998408545131\n",
"don : 0.01345682928131839\n",
"think : 0.01267532379437271\n",
"year : 0.012031043180593363\n",
"time : 0.011506208971841377\n",
"game : 0.0095936756698935\n",
"car : 0.008861658870142054\n",
"new : 0.008491789796663759\n",
"ve : 0.00829786816819498\n",
"know : 0.0078067833055090775\n",
"team : 0.007707870425712116\n",
"make : 0.007674871319121082\n",
"got : 0.007590768892011672\n",
"years : 0.007521317603459092\n",
"better : 0.007518048145453325\n",
"really : 0.007346372828973151\n",
"way : 0.007034423360390621\n",
"power : 0.006566448188612813\n",
"\n",
"[Topic 1]\n",
"drive : 0.015355231395836447\n",
"use : 0.012559908422174576\n",
"card : 0.011884579329690953\n",
"file : 0.010715825975593545\n",
"dos : 0.010264631487329215\n",
"scsi : 0.010213702586526673\n",
"disk : 0.00980979067318461\n",
"software : 0.009777678629631031\n",
"windows : 0.009089278470122012\n",
"output : 0.008908559762058713\n",
"program : 0.008850370933872623\n",
"pc : 0.00879408146612969\n",
"mac : 0.008695904083631504\n",
"version : 0.008565966751210267\n",
"bit : 0.008356445695350767\n",
"available : 0.0079203296723622\n",
"data : 0.007862687522739291\n",
"using : 0.007559820136801286\n",
"memory : 0.007374272389157403\n",
"graphics : 0.006996138114178558\n",
"\n",
"[Topic 2]\n",
"10 : 0.03259595500403585\n",
"00 : 0.027458258523894066\n",
"25 : 0.022241790633267193\n",
"11 : 0.021703064879917067\n",
"15 : 0.021100612635025648\n",
"12 : 0.020702019961833518\n",
"20 : 0.02065322306164026\n",
"14 : 0.018963443912077133\n",
"16 : 0.01758453084887431\n",
"17 : 0.017079653636598024\n",
"18 : 0.016823140412694174\n",
"13 : 0.01681217395503477\n