Skip to content

Instantly share code, notes, and snippets.

@boskaiolo
Created February 19, 2015 17:13
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save boskaiolo/cc3e1341f59bfbd02726 to your computer and use it in GitHub Desktop.
Save boskaiolo/cc3e1341f59bfbd02726 to your computer and use it in GitHub Desktop.
{
"metadata": {
"name": "",
"signature": "sha256:b1574a938f2e291456a5666357dffb74c74dfd7686240c919057657e1bc6b104"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Gensim and LDA: a quick tour"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, fix the verbosity of the logger. In this example we're logging only warnings, but for a better debug, uprint all the INFOs."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import logging\n",
"logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.WARNING)\n",
"logging.root.level = logging.WARNING"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, it's time to get some textual data. We're gonna use the 20 newsgroups dataset (more info here: http://qwone.com/~jason/20Newsgroups). As stated by its creators, it is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.\n",
"\n",
"To make things more real, we're remving email headers, footers (like signatures) and quoted messages."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn import datasets\n",
"news_dataset = datasets.fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# A list of text document is contained in the data variable\n",
"documents = news_dataset.data\n",
"\n",
"print \"In the dataset there are\", len(documents), \"textual documents\"\n",
"print \"And this is the first one:\\n\", documents[0]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"In the dataset there are 18846 textual documents\n",
"And this is the first one:\n",
"\n",
"\n",
"I am sure some bashers of Pens fans are pretty confused about the lack\n",
"of any kind of posts about the recent Pens massacre of the Devils. Actually,\n",
"I am bit puzzled too and a bit relieved. However, I am going to put an end\n",
"to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\n",
"are killing those Devils worse than I thought. Jagr just showed you why\n",
"he is much better than his regular season stats. He is also a lot\n",
"fo fun to watch in the playoffs. Bowman should let JAgr have a lot of\n",
"fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\n",
"regular season game. PENS RULE!!!\n",
"\n",
"\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We do now have a collection of documents. Let's start with some preprocessing steps. At first, we're gonna import all the modules we need. Then, we define a word tokenizer (https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)) with stopword removal (common words like \"the\", \"are\" and \"and\" are excuded from the processing, since they don't have discriminative power and they just increase the processing complexity)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import gensim\n",
"from gensim.utils import simple_preprocess\n",
"from gensim.parsing.preprocessing import STOPWORDS"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def tokenize(text):\n",
" return [token for token in gensim.utils.simple_preprocess(text) if token not in gensim.parsing.preprocessing.STOPWORDS]\n",
"\n",
"print \"After the tokenizer, the previous document becomes:\\n\", tokenize(documents[0])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"After the tokenizer, the previous document becomes:\n",
"[u'sure', u'bashers', u'pens', u'fans', u'pretty', u'confused', u'lack', u'kind', u'posts', u'recent', u'pens', u'massacre', u'devils', u'actually', u'bit', u'puzzled', u'bit', u'relieved', u'going', u'end', u'non', u'pittsburghers', u'relief', u'bit', u'praise', u'pens', u'man', u'killing', u'devils', u'worse', u'thought', u'jagr', u'showed', u'better', u'regular', u'season', u'stats', u'lot', u'fo', u'fun', u'watch', u'playoffs', u'bowman', u'let', u'jagr', u'lot', u'fun', u'couple', u'games', u'pens', u'going', u'beat', u'pulp', u'jersey', u'disappointed', u'islanders', u'lose', u'final', u'regular', u'season', u'game', u'pens', u'rule']\n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next step: tokenise all the documents and build a count dictionary, that contains the count of the tokens over the complete text corpus."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"processed_docs = [tokenize(doc) for doc in documents]\n",
"word_count_dict = gensim.corpora.Dictionary(processed_docs)\n",
"print \"In the corpus there are\", len(word_count_dict), \"unique tokens\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"In the corpus there are 95507 unique tokens\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We might want to further lower the complexity of the process, removing all the very rare tokens (the ones appearing in less than 20 documents) and the very popular ones (the ones appearing in more than 20% documents; in our case circa 4'000)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"word_count_dict.filter_extremes(no_below=20, no_above=0.1) # word must appear >10 times, and no more than 20% documents"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print \"After filtering, in the corpus there are only\", len(word_count_dict), \"unique tokens\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"After filtering, in the corpus there are only 8121 unique tokens\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's not build the bag of words representation (https://en.wikipedia.org/wiki/Bag-of-words_model) of the text documents, to create a nice vector space model (https://en.wikipedia.org/wiki/Vector_space_model). Within this methaphor, a vector lists the multiplicity of the tokens appearing in the document. The vector is indexed by the dictionary of tokens, previously built. Note that, since a restricted subset of words appears in each document, this vector is often represented in a sparse way."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"bow_doc1 = bag_of_words_corpus[0]\n",
"\n",
"print \"Bag of words representation of the first document (tuples are composed by token_id and multiplicity):\\n\", bow_doc1\n",
"print\n",
"for i in range(5):\n",
" print \"In the document, topic_id {} (word \\\"{}\\\") appears {} time[s]\".format(bow_doc1[i][0], word_count_dict[bow_doc1[i][0]], bow_doc1[i][1])\n",
"print \"...\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Bag of words representation of the first document (tuples are composed by token_id and multiplicity):\n",
"[(219, 1), (770, 2), (780, 2), (1353, 1), (1374, 1), (1567, 1), (1722, 2), (2023, 1), (2698, 1), (3193, 1), (3214, 1), (3352, 1), (3466, 1), (3754, 1), (3852, 1), (3879, 1), (3965, 1), (4212, 1), (4303, 2), (4677, 1), (4702, 1), (4839, 1), (4896, 1), (5000, 1), (5242, 5), (5396, 2), (5403, 1), (5453, 2), (5509, 3), (5693, 1), (5876, 1), (5984, 1), (6211, 1), (6272, 1), (6392, 1), (6436, 1), (6676, 1), (6851, 2), (6884, 1), (7030, 1), (7162, 1), (7185, 1), (7370, 1), (7882, 1)]\n",
"\n",
"In the document, topic_id 219 (word \"showed\") appears 1 time[s]\n",
"In the document, topic_id 770 (word \"jagr\") appears 2 time[s]\n",
"In the document, topic_id 780 (word \"going\") appears 2 time[s]\n",
"In the document, topic_id 1353 (word \"recent\") appears 1 time[s]\n",
"In the document, topic_id 1374 (word \"couple\") appears 1 time[s]\n",
"...\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, finally, the core algorithm of the analysis: LDA. Gensim offers two implementations: a monocore one, and a multicore. We use the monocore one, setting the number of topics equal to 10 (you can change it, and check the results). Try themulticore to prove the speedup!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# LDA mono-core\n",
"lda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=10, id2word=word_count_dict, passes=5)\n",
"\n",
"# LDA multicore (in this configuration, defaulty, uses n_cores-1)\n",
"# lda_model = gensim.models.LdaMulticore(bag_of_words_corpus, num_topics=10, id2word=word_count_dict, passes=5)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's a list of the words (and their relative weights) for each topic:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"_ = lda_model.print_topics(-1)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's print now the topics composition, and their scores, for the first document. You will see that only few topics are represented; the others have a nil score."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for index, score in sorted(lda_model[bag_of_words_corpus[0]], key=lambda tup: -1*tup[1]):\n",
" print \"Score: {}\\t Topic: {}\".format(score, lda_model.print_topic(index, 10))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Score: 0.853884500928\t Topic: 0.015*game + 0.012*team + 0.010*year + 0.009*games + 0.007*st + 0.007*play + 0.006*season + 0.006*hockey + 0.005*league + 0.005*players\n",
"Score: 0.0846334499472\t Topic: 0.019*space + 0.008*nasa + 0.007*earth + 0.006*science + 0.005*data + 0.005*research + 0.005*launch + 0.005*center + 0.004*program + 0.004*orbit\n",
"Score: 0.0284017012333\t Topic: 0.010*said + 0.010*israel + 0.006*medical + 0.006*children + 0.006*israeli + 0.005*years + 0.005*women + 0.004*arab + 0.004*killed + 0.004*disease\n",
"Score: 0.0227330510447\t Topic: 0.011*turkish + 0.011*db + 0.009*armenian + 0.008*turkey + 0.006*greek + 0.006*armenians + 0.006*jews + 0.006*muslim + 0.006*homosexuality + 0.005*turks\n"
]
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's wonderful! LDA is able to understand that the article is about a team game, hockey, even though the work hockey *never* appears in the document. Checking the ground truth for that document (the newsgroup category) it's actually correct! It was posted in sport/hockey category. Other topics, if any, account for less than 5%, so they have to be considered marginals (dirt)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"news_dataset.target_names[news_dataset.target[0]]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"'rec.sport.hockey'"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So far, we have dealt with documents contained in the training set. What if we need to process an unseed document? Fortunately, we don't need to re-train the system (wasting lots of time), as we can just infer its topics."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"unseen_document = \"In my spare time I either play badmington or drive my car\"\n",
"print \"The unseen document is composed by the following text:\", unseen_document\n",
"print\n",
"\n",
"bow_vector = word_count_dict.doc2bow(tokenize(unseen_document))\n",
"for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):\n",
" print \"Score: {}\\t Topic: {}\".format(score, lda_model.print_topic(index, 5))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"The unseen document is composed by the following text: In my spare time I either play badmington or drive my car\n",
"\n",
"Score: 0.631871020975\t Topic: 0.007*car + 0.005*ll + 0.005*got + 0.004*little + 0.004*power\n",
"Score: 0.208106465922\t Topic: 0.015*game + 0.012*team + 0.010*year + 0.009*games + 0.007*st\n",
"Score: 0.0200214219043\t Topic: 0.014*windows + 0.014*dos + 0.012*drive + 0.010*thanks + 0.010*card\n",
"Score: 0.0200004776176\t Topic: 0.010*said + 0.010*israel + 0.006*medical + 0.006*children + 0.006*israeli\n",
"Score: 0.0200003461406\t Topic: 0.009*government + 0.009*key + 0.007*public + 0.005*president + 0.005*law\n",
"Score: 0.0200002155703\t Topic: 0.014*god + 0.006*believe + 0.004*jesus + 0.004*said + 0.004*point\n",
"Score: 0.0200000317801\t Topic: 0.011*turkish + 0.011*db + 0.009*armenian + 0.008*turkey + 0.006*greek\n",
"Score: 0.020000020082\t Topic: 0.013*file + 0.013*edu + 0.010*image + 0.008*available + 0.007*ftp\n",
"Score: 0.0200000000038\t Topic: 0.019*space + 0.008*nasa + 0.007*earth + 0.006*science + 0.005*data\n",
"Score: 0.0200000000037\t Topic: 0.791*ax + 0.059*max + 0.009*pl + 0.007*di + 0.007*tm\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print \"Log perplexity of the model is\", lda_model.log_perplexity(bag_of_words_corpus)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Log perplexity of the model is "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"-7.58115143751\n"
]
}
],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 17
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment