Skip to content

Instantly share code, notes, and snippets.

@patrickvankessel
Created April 8, 2019 15:36
Show Gist options
  • Star 39 You must be signed in to star a gist
  • Fork 14 You must be signed in to fork a gist
  • Save patrickvankessel/0d5bd690910edece831dbdf32fb2fb2d to your computer and use it in GitHub Desktop.
Save patrickvankessel/0d5bd690910edece831dbdf32fb2fb2d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import nltk\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package movie_reviews to\n",
"[nltk_data] /home/pvankessel/nltk_data...\n",
"[nltk_data] Package movie_reviews is already up-to-date!\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nltk.download(\"movie_reviews\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2000\n"
]
}
],
"source": [
"rows = []\n",
"for fileid in nltk.corpus.movie_reviews.fileids():\n",
" rows.append({\"text\": nltk.corpus.movie_reviews.raw(fileid)})\n",
"df = pd.DataFrame(rows)\n",
"print(len(df))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"21886\n"
]
}
],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"vectorizer = TfidfVectorizer(\n",
" max_df=.5,\n",
" min_df=10,\n",
" max_features=None,\n",
" ngram_range=(1, 2),\n",
" norm=None,\n",
" binary=True,\n",
" use_idf=False,\n",
" sublinear_tf=False\n",
")\n",
"vectorizer = vectorizer.fit(df['text'])\n",
"tfidf = vectorizer.transform(df['text'])\n",
"vocab = vectorizer.get_feature_names()\n",
"print(len(vocab))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from corextopic import corextopic as ct"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"anchors = []\n",
"model = ct.Corex(n_hidden=8, seed=42)\n",
"model = model.fit(\n",
" tfidf,\n",
" words=vocab\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Topic #1: see, me, had, really, don, know, think, my, because, how\n",
"Topic #2: life, he is, both, never, it is, of his, that he, world, performance, to his\n",
"Topic #3: the first, the most, films, from the, many, by the, since, such, at the, while\n",
"Topic #4: comedy, funny, jokes, humor, laughs, funniest, the funniest, hilarious, the jokes, joke\n",
"Topic #5: young, opening, music, follow, portrayal, cinematography, mars, aspect, art, shown\n",
"Topic #6: murder, crime, thriller, police, killer, dead, the police, he has, turns, prison\n",
"Topic #7: plot, action, case, critique, the plot, suspense, none, blair witch, seem, cool\n",
"Topic #8: horror, horror film, scream, slasher, did last, horror films, scary, you did, williamson\n"
]
}
],
"source": [
"for i, topic_ngrams in enumerate(model.get_topics(n_words=10)):\n",
" topic_ngrams = [ngram[0] for ngram in topic_ngrams if ngram[1] > 0]\n",
" print(\"Topic #{}: {}\".format(i+1, \", \".join(topic_ngrams)))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Anchors designed to nudge the model towards measuring specific genres\n",
"anchors = [\n",
" [\"action\", \"adventure\"],\n",
" [\"drama\"],\n",
" [\"comedy\", \"funny\"],\n",
" [\"horror\", \"suspense\"],\n",
" [\"animated\", \"animation\"],\n",
" [\"sci fi\", \"alien\"],\n",
" [\"romance\", \"romantic\"],\n",
" [\"fantasy\"]\n",
"]\n",
"anchors = [\n",
" [a for a in topic if a in vocab]\n",
" for topic in anchors\n",
"]\n",
"\n",
"model = ct.Corex(n_hidden=8, seed=42)\n",
"model = model.fit(\n",
" tfidf,\n",
" words=vocab,\n",
" anchors=anchors, # Pass the anchors in here\n",
" anchor_strength=3 # Tell the model how much it should rely on the anchors\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Topic #1: action, adventure, the action, scenes, action sequences, where, action scenes, an action, action film, sequences\n",
"Topic #2: drama, performance, mother, director, both, while, and his, to his, role, performances\n",
"Topic #3: comedy, funny, jokes, laughs, humor, funny and, hilarious, very funny, gags, laugh\n",
"Topic #4: horror, really, think, had, me, did, how, see, because, were\n",
"Topic #5: animated, animation, disney, children, the animation, computer, adults, years, voice of, voice\n",
"Topic #6: alien, sci fi, effects, special effects, fi, aliens, sci, planet, special, earth\n",
"Topic #7: romantic, romance, she, love, with her, of her, that she, relationship, woman, romantic comedy\n",
"Topic #8: life, he is, fantasy, world, it is, that the, perhaps, point, does, through\n"
]
}
],
"source": [
"for i, topic_ngrams in enumerate(model.get_topics(n_words=10)):\n",
" topic_ngrams = [ngram[0] for ngram in topic_ngrams if ngram[1] > 0]\n",
" print(\"Topic #{}: {}\".format(i+1, \", \".join(topic_ngrams)))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"topic_df = pd.DataFrame(\n",
" model.transform(tfidf), \n",
" columns=[\"topic_{}\".format(i+1) for i in range(8)]\n",
").astype(float)\n",
"topic_df.index = df.index\n",
"df = pd.concat([df, topic_df], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>topic_1</th>\n",
" <th>topic_2</th>\n",
" <th>topic_3</th>\n",
" <th>topic_4</th>\n",
" <th>topic_5</th>\n",
" <th>topic_6</th>\n",
" <th>topic_7</th>\n",
" <th>topic_8</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1860</th>\n",
" <td>the verdict : spine-chilling drama from horror...</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>353</th>\n",
" <td>\" the 44 caliber killer has struck again . \" ...</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1333</th>\n",
" <td>in the company of men made a splash at the sun...</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>905</th>\n",
" <td>in the year 2029 , captain leo davidson ( mark...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1289</th>\n",
" <td>[note that followups are directed to rec . art...</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text topic_1 topic_2 \\\n",
"1860 the verdict : spine-chilling drama from horror... 1.0 1.0 \n",
"353 \" the 44 caliber killer has struck again . \" ... 0.0 1.0 \n",
"1333 in the company of men made a splash at the sun... 0.0 1.0 \n",
"905 in the year 2029 , captain leo davidson ( mark... 0.0 0.0 \n",
"1289 [note that followups are directed to rec . art... 1.0 0.0 \n",
"\n",
" topic_3 topic_4 topic_5 topic_6 topic_7 topic_8 \n",
"1860 0.0 1.0 0.0 1.0 1.0 0.0 \n",
"353 0.0 1.0 0.0 0.0 0.0 1.0 \n",
"1333 1.0 1.0 0.0 1.0 1.0 1.0 \n",
"905 0.0 0.0 0.0 1.0 1.0 0.0 \n",
"1289 1.0 0.0 0.0 1.0 0.0 0.0 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(5, random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "pewthon2",
"language": "python",
"name": "pewthon2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.15"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@ImSajeed
Copy link

ImSajeed commented May 27, 2019

@patrickvankessel, i tried running corEx model on data of 600K its taking more than one hour to process is this expected

@patrickvankessel
Copy link
Author

@patrickvankessel, i tried running corEx model on data of 600K its taking more than one hour to process is this expected

I've never tried scaling it up to a dataset that large - it may eventually finish, but it could take hours or days, especially if you have longer documents and a large vocabulary. If you want a faster option, you could fit the model on a sample of 50-100k documents, and then apply the model to the full dataset afterwards. You could also try narrowing the vocabulary by tweaking the TF-IDF vectorizer parameters and setting a max_features limit.

@ImSajeed
Copy link

@patrickvankessel, whole corpus Vocabulary size 850k, trying on 100k data points .

@cheevahagadog
Copy link

This was very helpful! Thanks @patrickvankessel

@nguyenhaidang94
Copy link

Hi @patrickvankessel, in your examples, there are 8 topics. Is it obligatory to give anchors for all topics?
Can I only give anchors for 6 topics? I want the model to naturally learn two new topics.

@nadia-felix
Copy link

Hi @patrickvankessel, in your examples, there are 8 topics. Is it obligatory to give anchors for all topics?
Can I only give anchors for 6 topics? I want the model to naturally learn two new topics.

@patrickvankessel
Copy link
Author

You can provide anchors for as many or as a few topics as you want - it's perfectly fine to leave some (or all) of them empty!

@ImKH310
Copy link

ImKH310 commented Feb 24, 2022

Hi! @patrickvankessel, I really appreciate your example that exactly what I want to implement!, however, I have one question about the final dataframe. In the final dataframe, each text of row has several topics that showed 1.0. How can I determine only one topic per one text? and Can I get any other float numbers except 0 or 1 as a result of CorEx?

@GiarteDataTeam
Copy link

How do you predict topics for new documents? I faced an issue when calling the model.predict() for new documents.

@eduamf
Copy link

eduamf commented May 3, 2022

Hi @patrickvankessel, I like your blog. I pass through the same process, thinking: "it's a big mess!". After some adjustments, with all topics "overcooked", I was shocked to see the incoherent results (to me)!

My question: you did not remove the stop word. I keep them before the last step in the process, but your topics maintained them. There was any reason?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment