Topic Modeling Amazon Reviews
"source": [
"from nltk.tokenize import RegexpTokenizer\n",
"from nltk.corpus import stopwords\n",
"from stop_words import get_stop_words\n",
"from nltk.stem.snowball import SnowballStemmer\n",
"from gensim import corpora, models\n",
"import gensim"
"### Loading our data\n",
"Warning: you will receive an error message when trying to use nltk's stopwords if you don't explicitly download the stopwords first: \n",
"import nltk\n",
"\"stopwords\") \n",
"Loading the provided reviews subset JSON into a Pandas dataframe:"
"source": [
"import pandas as pd\n",
"import gzip\n",
"# one-review-per-line in json\n",
"def parse(path):\n",
" g =, 'rb')\n",
" for l in g:\n",
" yield eval(l)\n",
"def getDF(path):\n",
" i = 0\n",
" df = {}\n",
" for d in parse(path):\n",
" df[i] = d\n",
" i += 1\n",
" return pd.DataFrame.from_dict(df, orient='index')\n",
"df = getDF('reviews_Automotive_5.json.gz')\n",
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 20473 entries, 0 to 20472\n",
"Data columns (total 9 columns):\n",
"reviewerID 20473 non-null object\n",
"asin 20473 non-null object\n",
"reviewerName 20260 non-null object\n",
"helpful 20473 non-null object\n",
"unixReviewTime 20473 non-null int64\n",
"reviewText 20473 non-null object\n",
"overall 20473 non-null float64\n",
"reviewTime 20473 non-null object\n",
"summary 20473 non-null object\n",
"dtypes: float64(1), int64(1), object(7)\n",
"memory usage: 1.6+ MB\n"

"Now that we have a nice corpus of text, lets go through some of the standard preprocessing required for almost any topic modeling or NLP problem.\n",
"Our Approach will involve:\n",
"1. Tokenizing: converting a document to its atomic elements\n",
"2. Stopping: removing meaningless words\n",
"3. Stemming: merging words that are equivalent in meaning\n",
"### Tokenization\n",
"We have many ways to segment our document into its atomic elements. To start we'll tokenize the document into words. For this instance we'll use NLTK’s `tokenize.regexp` module. You can see how this works in a fun interactive way here: try 'w+' at\n",
"![alt text]( \"\")"
"# Using one of our docs as an example\n",
"tokens = tokenizer.tokenize(doc_1.lower())\n",
"print('{} characters in string vs {} words in a list'.format(len(doc_1), len(tokens)))\n",
"### Stop Words\n",
"Determiners like \"the\" and conjunctions such as \"or\" and \"for\" do not add value to our simple topic model. We refer to these types of words as stop words and want to remove them from our list of tokens. The definition of a stop work changes depending on the context of the documents we are examining. If considering Product Reviews for [children's board games on]( we would not find \"Chutes and Ladders\" as a token and eventually an entity in some other model if we remove the word \"and\" as we'll end up with a distinct \"chutes\" AND \"ladders\" in our list.\n",
"Let's make a super list of stop words from the `stop_words` and `nltk` package below. By the way if you're using Python 3 you can make use of an odd new feature to unpack lists into a new list:\n",
"merged_stopwords = [*nltk_stpwd, *stop_words_stpwd] # Python 3 oddity insanity to merge lists\n",
"nltk_stpwd = stopwords.words('english')\n",
"stop_words_stpwd = get_stop_words('en')\n",
"merged_stopwords = list(set(nltk_stpwd + stop_words_stpwd))\n",
"stopped_tokens = [token for token in tokens if not token in merged_stopwords]\n",
"### Stemming\n",
"Stemming allows us to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance, running and runner to run. Another example:\n",
"*Amazon's catalog contains bike tires in different sizes and colors $\\Rightarrow$ Amazon catalog contain bike tire in differ size and color*\n",
"Stemming is a basic and crude heuristic compared to [Lemmatization]( which understands vocabulary and morphological analysis instead of lobbing off the end of words. Essentially Lemmatization removes inflectional endings to return the word to its base or dictionary form of a word, which is defined as the lemma. Great illustrative examples from Wikipedia:\n",
"1. *The word \"better\" has \"good\" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.*\n",
"2. *The word \"walk\" is the base form for word \"walking\", and hence this is matched in both stemming and lemmatisation.*\n",
"3. *The word \"meeting\" can be either the base form of a noun or a form of a verb (\"to meet\") depending on the context, e.g., \"in our last meeting\" or \"We are meeting again tomorrow\". Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.*\n",
"We'll start with the common [Snowball stemming method](, a successor of sorts of the original Porter Stemmer which is implemented in NLTK:"
"stemmed_tokens = [sb_stemmer.stem(token) for token in stopped_tokens]\n",
"### Putting together a document-term matrix\n",
"In order to create an LDA model we'll need to put the 3 steps from above (tokenizing, stopping, stemming) together to create a list of documents (list of lists) to then generate a document-term matrix (unique terms as rows, documents or reviews as columns). This matrix will tell us how frequently each term occurs with each individual document. "
"### Transform tokenized documents into an id-term dictionary\n",
"Gensim's Dictionary method encapsulates the mapping between normalized words and their integer ids. Note a term will have an id of some number and in the subsequent bag of words step we can see that id will have a count associated with it."
"metadata": {},
"source": [
"To see the mapping between words and their ids we can use the `token2id` method:"
"cell_type": "code",
"We went from **19216** unique tokens to **2462** after filtering. Looking at the top 10 tokens it looks like we got more specific subjects opposed to adjectives.\n",
"### Creating bag of words\n",
"Next let's turn `texts_dict` into a bag of words instead. doc2bow converts a `document` (a list of words) into the bag-of-words format (list of `(token_id, token_count)` tuples)."
"corpus = [texts_dict.doc2bow(text) for text in texts]\n",
"The corpus is 20473 long, the amount of reviews in our dataset and in our dataframe. Let's dump this bag-of-words into a file to avoid parsing the entire text again:"
"### Training an LDA model\n",
"As a topic modeling newbie this part is unsatisfying to me. In this unsupervised learning application I can see how a lot of people would arbitrarily set a number of topics, similar to centroids in k-means clustering, and then have a human evaluate if the topics \"make sense\". You can go very deep very quickly by researching this online. For now let's plead ignorance and go through with a simple model FULL of assumptions :)\n",
"Training an LDA model using our BOW corpus as training data:"
"1. Performance Parts & Accessories\n",
"2. Replacement Parts\n",
"3. Truck Accessories\n",
"4. Interior Accessories\n",
"5. Exterior Accessories\n",
"6. Tires & Wheels\n",
"7. Car Care\n",
"8. Tools & Equipment\n",
"9. Motorcycle & Powersports Accessories\n",
"10. Car Electronics\n",
"11. Enthusiast Merchandise\n",
"I think these categories could be compressed into 5 general topics. We might consider rolling #9 into 4 & 5, and rolling the products in #3 across other accessory categories and so on."
"### Inferring Topics \n",
"Below are the top 5 words associated with 5 random topics. The float next to each word is the weight showing how much the given word influences this specific topic. In this case, we see that for topic `4`, light and battery are the most telling words. We might interpret that topic `4` might be close to Amazon's Tools & Equipment category which has a sub-category titled \"Jump Starters, Battery Chargers & Portable Power\". Similarly we might infer topic `1` refers to Car Care, maybe sub category \"Exterior Care\".\n"
"# For `num_topics` number of topics, return `num_words` most significant words\n",
Note that LDA is a probabilistic mixture of mixtures (or admixture) model for grouped data. The observed data (words) within the groups (documents) are the result of probabilistically choosing words from a specific topic (multinomial over the vocabulary), where the topic is itself drawn from a document-specific multinomial that has a global Dirichlet prior. This means that words can belong to various topics in various degrees. For example, the word 'pressure' might refer to a category/topic of automotive wash products and a category of tire products (in the case where we think the topics are about classes of products).
### Querying the LDA Model
We cannot pass an arbitrary string to our model and evaluate what topics are most associated with it.
"### Querying the LDA Model\n",
"We cannot pass an arbitrary string to our model and evaluate what topics are most associated with it."
