kevindavenport/Topic_Modeling_Amazon_Reviews.ipynb

## Topic_Modeling_Amazon_Reviews.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from nltk.tokenize import RegexpTokenizer\n",
    "from nltk.corpus import stopwords\n",
    "from stop_words import get_stop_words\n",
    "from nltk.stem.snowball import SnowballStemmer\n",
    "from gensim import corpora, models\n",
    "import gensim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading our data\n",
    "Warning: you will receive an error message when trying to use nltk's stopwords if you don't explicitly download the stopwords first: \n",
    "```python\n",
    "import nltk\n",
    "nltk.download(\"stopwords\") \n",
    "```\n",
    "\n",
    "Loading the provided reviews subset JSON into a Pandas dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>reviewerID</th>\n",
       "      <th>asin</th>\n",
       "      <th>reviewerName</th>\n",
       "      <th>helpful</th>\n",
       "      <th>unixReviewTime</th>\n",
       "      <th>reviewText</th>\n",
       "      <th>overall</th>\n",
       "      <th>reviewTime</th>\n",
       "      <th>summary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A3F73SC1LY51OO</td>\n",
       "      <td>B00002243X</td>\n",
       "      <td>Alan Montgomery</td>\n",
       "      <td>[4, 4]</td>\n",
       "      <td>1313539200</td>\n",
       "      <td>I needed a set of jumper cables for my new car...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>08 17, 2011</td>\n",
       "      <td>Work Well - Should Have Bought Longer Ones</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A20S66SKYXULG2</td>\n",
       "      <td>B00002243X</td>\n",
       "      <td>alphonse</td>\n",
       "      <td>[1, 1]</td>\n",
       "      <td>1315094400</td>\n",
       "      <td>These long cables work fine for my truck, but ...</td>\n",
       "      <td>4.0</td>\n",
       "      <td>09 4, 2011</td>\n",
       "      <td>Okay long cables</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A2I8LFSN2IS5EO</td>\n",
       "      <td>B00002243X</td>\n",
       "      <td>Chris</td>\n",
       "      <td>[0, 0]</td>\n",
       "      <td>1374710400</td>\n",
       "      <td>Can't comment much on these since they have no...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>07 25, 2013</td>\n",
       "      <td>Looks and feels heavy Duty</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A3GT2EWQSO45ZG</td>\n",
       "      <td>B00002243X</td>\n",
       "      <td>DeusEx</td>\n",
       "      <td>[19, 19]</td>\n",
       "      <td>1292889600</td>\n",
       "      <td>I absolutley love Amazon!!!  For the price of ...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>12 21, 2010</td>\n",
       "      <td>Excellent choice for Jumper Cables!!!</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>A3ESWJPAVRPWB4</td>\n",
       "      <td>B00002243X</td>\n",
       "      <td>E. Hernandez</td>\n",
       "      <td>[0, 0]</td>\n",
       "      <td>1341360000</td>\n",
       "      <td>I purchased the 12' feet long cable set and th...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>07 4, 2012</td>\n",
       "      <td>Excellent, High Quality Starter Cables</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       reviewerID        asin     reviewerName   helpful  unixReviewTime  \\\n",
       "0  A3F73SC1LY51OO  B00002243X  Alan Montgomery    [4, 4]      1313539200   \n",
       "1  A20S66SKYXULG2  B00002243X         alphonse    [1, 1]      1315094400   \n",
       "2  A2I8LFSN2IS5EO  B00002243X            Chris    [0, 0]      1374710400   \n",
       "3  A3GT2EWQSO45ZG  B00002243X           DeusEx  [19, 19]      1292889600   \n",
       "4  A3ESWJPAVRPWB4  B00002243X     E. Hernandez    [0, 0]      1341360000   \n",
       "\n",
       "                                          reviewText  overall   reviewTime  \\\n",
       "0  I needed a set of jumper cables for my new car...      5.0  08 17, 2011   \n",
       "1  These long cables work fine for my truck, but ...      4.0   09 4, 2011   \n",
       "2  Can't comment much on these since they have no...      5.0  07 25, 2013   \n",
       "3  I absolutley love Amazon!!!  For the price of ...      5.0  12 21, 2010   \n",
       "4  I purchased the 12' feet long cable set and th...      5.0   07 4, 2012   \n",
       "\n",
       "                                      summary  \n",
       "0  Work Well - Should Have Bought Longer Ones  \n",
       "1                            Okay long cables  \n",
       "2                  Looks and feels heavy Duty  \n",
       "3       Excellent choice for Jumper Cables!!!  \n",
       "4      Excellent, High Quality Starter Cables  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import gzip\n",
    "\n",
    "# one-review-per-line in json\n",
    "def parse(path):\n",
    "    g = gzip.open(path, 'rb')\n",
    "    for l in g:\n",
    "        yield eval(l)\n",
    "\n",
    "def getDF(path):\n",
    "    i = 0\n",
    "    df = {}\n",
    "    for d in parse(path):\n",
    "        df[i] = d\n",
    "        i += 1\n",
    "    return pd.DataFrame.from_dict(df, orient='index')\n",
    "\n",
    "df = getDF('reviews_Automotive_5.json.gz')\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"I needed a set of jumper cables for my new car and these had good reviews and were at a good price.  They have been used a few times already and do what they are supposed to - no complaints there.What I will say is that 12 feet really isn't an ideal length.  Sure, if you pull up front bumper to front bumper they are plenty long, but a lot of times you will be beside another car or can't get really close.  Because of this, I would recommend something a little longer than 12'.Great brand - get 16' version though.\""
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['reviewText'][0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 20473 entries, 0 to 20472\n",
      "Data columns (total 9 columns):\n",
      "reviewerID        20473 non-null object\n",
      "asin              20473 non-null object\n",
      "reviewerName      20260 non-null object\n",
      "helpful           20473 non-null object\n",
      "unixReviewTime    20473 non-null int64\n",
      "reviewText        20473 non-null object\n",
      "overall           20473 non-null float64\n",
      "reviewTime        20473 non-null object\n",
      "summary           20473 non-null object\n",
      "dtypes: float64(1), int64(1), object(7)\n",
      "memory usage: 1.6+ MB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have a nice corpus of text, lets go through some of the standard preprocessing required for almost any topic modeling or NLP problem.\n",
    "\n",
    "Our Approach will involve:\n",
    "1. Tokenizing: converting a document to its atomic elements\n",
    "2. Stopping: removing meaningless words\n",
    "3. Stemming: merging words that are equivalent in meaning\n",
    "\n",
    "### Tokenization\n",
    "We have many ways to segment our document into its atomic elements. To start we'll tokenize the document into words. For this instance we'll use NLTK’s `tokenize.regexp` module. You can see how this works in a fun interactive way here: try 'w+' at http://regexr.com/:\n",
    "![alt text](http://kldavenport.com/wp-content/uploads/2017/03/regex.gif \"regexr.com\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tokenizer = RegexpTokenizer(r'\\w+')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running through part of the first review to demonstrate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "doc_1 = df.reviewText[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "516 characters in string vs 103 words in a list\n",
      "['i', 'needed', 'a', 'set', 'of', 'jumper', 'cables', 'for', 'my', 'new']\n"
     ]
    }
   ],
   "source": [
    "# Using one of our docs as an example\n",
    "tokens = tokenizer.tokenize(doc_1.lower())\n",
    "\n",
    "print('{} characters in string vs {} words in a list'.format(len(doc_1),                                                             len(tokens)))\n",
    "print(tokens[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stop Words\n",
    "Determiners like \"the\" and conjunctions such as \"or\" and \"for\" do not add value to our simple topic model. We refer to these types of words as stop words and want to remove them from our list of tokens. The definition of a stop work changes depending on the context of the documents we are examining. If considering Product Reviews for [children's board games on Amazon.com](https://www.amazon.com/s/ref=nb_sb_ss_c_0_18?url=search-alias%3Dtoys-and-games&field-keywords=childrens+board+games&sprefix=childrens+board+games%2Caps%2C200&rh=n%3A165793011%2Ck%3Achildrens+board+games) we would not find \"Chutes and Ladders\" as a token and eventually an entity in some other model if we remove the word \"and\" as we'll end up with a distinct \"chutes\" AND \"ladders\" in our list.\n",
    "\n",
    "Let's make a super list of stop words from the `stop_words` and `nltk` package below. By the way if you're using Python 3 you can make use of an odd new feature to unpack lists into a new list:\n",
    "```python\n",
    "merged_stopwords = [*nltk_stpwd, *stop_words_stpwd] # Python 3 oddity insanity to merge lists\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "207\n",
      "[u'all', u\"she'll\", u'just', u\"don't\", u'being', u'over', u'both', u'through', u'yourselves', u'its']\n"
     ]
    }
   ],
   "source": [
    "nltk_stpwd = stopwords.words('english')\n",
    "stop_words_stpwd = get_stop_words('en')\n",
    "merged_stopwords = list(set(nltk_stpwd + stop_words_stpwd))\n",
    "\n",
    "print(len(set(merged_stopwords)))\n",
    "print(merged_stopwords[:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['needed', 'set', 'jumper', 'cables', 'new', 'car', 'good', 'reviews', 'good', 'price']\n"
     ]
    }
   ],
   "source": [
    "stopped_tokens = [token for token in tokens if not token in merged_stopwords]\n",
    "print(stopped_tokens[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stemming\n",
    "Stemming allows us to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance, running and runner to run. Another example:\n",
    "\n",
    "*Amazon's catalog contains bike tires in different sizes and colors $\\Rightarrow$ Amazon catalog contain bike tire in differ size and color*\n",
    "\n",
    "Stemming is a basic and crude heuristic compared to [Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) which understands vocabulary and morphological analysis instead of lobbing off the end of words. Essentially Lemmatization removes inflectional endings to return the word to its base or dictionary form of a word, which is defined as the lemma. Great illustrative examples from Wikipedia:\n",
    "1. *The word \"better\" has \"good\" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.*\n",
    "2. *The word \"walk\" is the base form for word \"walking\", and hence this is matched in both stemming and lemmatisation.*\n",
    "3. *The word \"meeting\" can be either the base form of a noun or a form of a verb (\"to meet\") depending on the context, e.g., \"in our last meeting\" or \"We are meeting again tomorrow\". Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.*\n",
    "\n",
    "We'll start with the common [Snowball stemming method](http://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg), a successor of sorts of the original Porter Stemmer which is implemented in NLTK:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Instantiate a Snowball stemmer\n",
    "sb_stemmer = SnowballStemmer('english')    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that p_stemmer requires all tokens to be type str. p_stemmer returns the string parameter in stemmed form, so we need to loop through our stopped_tokens:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'need', u'set', u'jumper', u'cabl', u'new', u'car', u'good', u'review', u'good', u'price', u'use', u'time', u'alreadi', u'suppos', u'complaint', u'say', '12', u'feet', u'realli', u'ideal', u'length', u'sure', u'pull', u'front', u'bumper', u'front', u'bumper', u'plenti', u'long', u'lot', u'time', u'besid', u'anoth', u'car', u'get', u'realli', u'close', u'recommend', u'someth', u'littl', u'longer', '12', u'great', u'brand', u'get', '16', u'version', u'though']\n"
     ]
    }
   ],
   "source": [
    "stemmed_tokens = [sb_stemmer.stem(token) for token in stopped_tokens]\n",
    "print(stemmed_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Putting together a document-term matrix\n",
    "\n",
    "In order to create an LDA model we'll need to put the 3 steps from above (tokenizing, stopping, stemming) together to create a list of documents (list of lists) to then generate a document-term matrix (unique terms as rows, documents or reviews as columns). This matrix will tell us how frequently each term occurs with each individual document. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 36.6 s, sys: 263 ms, total: 36.9 s\n",
      "Wall time: 36.9 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "num_reviews = df.shape[0]\n",
    "\n",
    "doc_set = [df.reviewText[i] for i in range(num_reviews)]\n",
    "\n",
    "texts = []\n",
    "\n",
    "for doc in doc_set:\n",
    "    # putting our three steps together\n",
    "    tokens = tokenizer.tokenize(doc.lower())\n",
    "    stopped_tokens = [token for token in tokens if not token in merged_stopwords]\n",
    "    stemmed_tokens = [sb_stemmer.stem(token) for token in stopped_tokens]\n",
    "    \n",
    "    # add tokens to list\n",
    "    texts.append(stemmed_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'need', u'set', u'jumper', u'cabl', u'new', u'car', u'good', u'review', u'good', u'price', u'use', u'time', u'alreadi', u'suppos', u'complaint', u'say', '12', u'feet', u'realli', u'ideal', u'length', u'sure', u'pull', u'front', u'bumper', u'front', u'bumper', u'plenti', u'long', u'lot', u'time', u'besid', u'anoth', u'car', u'get', u'realli', u'close', u'recommend', u'someth', u'littl', u'longer', '12', u'great', u'brand', u'get', '16', u'version', u'though']\n"
     ]
    }
   ],
   "source": [
    "print texts[0] # examine review 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transform tokenized documents into an id-term dictionary\n",
    "Gensim's Dictionary method encapsulates the mapping between normalized words and their integer ids. Note a term will have an id of some number and in the subsequent bag of words step we can see that id will have a count associated with it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dictionary(19216 unique tokens: [u'circuitri', u'html4', u'pathfind', u'spoonssmal', u'suspend']...)\n"
     ]
    }
   ],
   "source": [
    "# Gensim's Dictionary encapsulates the mapping between normalized words and their integer ids.\n",
    "texts_dict = corpora.Dictionary(texts)\n",
    "texts_dict.save('auto_review.dict') # lets save to disk for later use\n",
    "# Examine each token’s unique id\n",
    "print(texts_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see the mapping between words and their ids we can use the `token2id` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "IDs 1 through 10: [(u'set', 0), (u'cabl', 1), (u'realli', 2), (u'feet', 3), (u'say', 4), (u'alreadi', 5), (u'long', 6), (u'need', 7), (u'close', 8), (u'use', 9)]\n"
     ]
    }
   ],
   "source": [
    "import operator\n",
    "print(\"IDs 1 through 10: {}\".format(sorted(texts_dict.token2id.items(), key=operator.itemgetter(1), reverse = False)[:10]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try to guess the original work and examine the count difference between our #1 most frequent term and our #10 most frequent term:\n",
    "\n",
    "print(df.reviewText.str.contains(\"complaint\").value_counts())\n",
    "print()\n",
    "print(df.reviewText.str.contains(\"lot\").value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have a lot of unique tokens, let's see what happens if we ignore tokens that appear in less than 30 documents or more than 15% documents. Granted this is arbitrary but a quick search shows tons of methods for reducing noise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dictionary(2464 unique tokens: [u'saver', u'yellow', u'hitch', u'four', u'sleev']...)\n",
      "top terms:\n",
      "[(u'saver', 0), (u'yellow', 1), (u'hitch', 2), (u'four', 3), (u'sleev', 4), (u'upsid', 5), (u'hate', 6), (u'forget', 7), (u'accur', 8), (u'sorri', 9)]\n"
     ]
    }
   ],
   "source": [
    "texts_dict.filter_extremes(no_below=30, no_above=0.15) # inlace filter\n",
    "print(texts_dict)\n",
    "print(\"top terms:\")\n",
    "print(sorted(texts_dict.token2id.items(), key=operator.itemgetter(1), reverse = False)[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We went from **19216** unique tokens to **2462** after filtering. Looking at the top 10 tokens it looks like we got more specific subjects opposed to adjectives.\n",
    "\n",
    "### Creating bag of words\n",
    "Next let's turn `texts_dict` into a bag of words instead. doc2bow converts a `document` (a list of words) into the bag-of-words format (list of `(token_id, token_count)` tuples)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "20473"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus = [texts_dict.doc2bow(text) for text in texts]\n",
    "len(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The corpus is 20473 long, the amount of reviews in our dataset and in our dataframe. Let's dump this bag-of-words into a file to avoid parsing the entire text again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.34 s, sys: 70.2 ms, total: 2.42 s\n",
      "Wall time: 2.42 s\n"
     ]
    }
   ],
   "source": [
    "%%time \n",
    "# Matrix Market format https://radimrehurek.com/gensim/corpora/mmcorpus.html, why exactly? I don't know\n",
    "gensim.corpora.MmCorpus.serialize('amzn_auto_review.mm', corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training an LDA model\n",
    "As a topic modeling newbie this part is unsatisfying to me. In this unsupervised learning application I can see how a lot of people would arbitrarily set a number of topics, similar to centroids in k-means clustering, and then have a human evaluate if the topics \"make sense\". You can go very deep very quickly by researching this online. For now let's plead ignorance and go through with a simple model FULL of assumptions :)\n",
    "\n",
    "\n",
    "Training an LDA model using our BOW corpus as training data:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![alt text](http://kldavenport.com/wp-content/uploads/2017/03/auto-categories.jpg \"amazon.com\")\n",
    "\n",
    "The number of topics is arbitrary, I'll use the browse taxonomy visible off https://www.amazon.com/automotive to inform the number we choose: \n",
    "1. Performance Parts & Accessories\n",
    "2. Replacement Parts\n",
    "3. Truck Accessories\n",
    "4. Interior Accessories\n",
    "5. Exterior Accessories\n",
    "6. Tires & Wheels\n",
    "7. Car Care\n",
    "8. Tools & Equipment\n",
    "9. Motorcycle & Powersports Accessories\n",
    "10. Car Electronics\n",
    "11. Enthusiast Merchandise\n",
    "\n",
    "I think these categories could be compressed into 5 general topics. We might consider rolling #9 into 4 & 5, and rolling the products in #3 across other accessory categories and so on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 6min 24s, sys: 2.7 s, total: 6min 27s\n",
      "Wall time: 6min 28s\n"
     ]
    }
   ],
   "source": [
    "%%time \n",
    "lda_model = gensim.models.LdaModel(corpus,alpha='auto', num_topics=5,id2word=texts_dict, passes=20)\n",
    "# ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = texts_dict, passes=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: Gensim offers a fantastic multicore implementation of LDAModel that reduced my training time by 75%, but it does not have the auto alpha parameter available. Exploring hyperparameter tuning will go beyond the high-level of this post. See here for a great resource: http://stats.stackexchange.com/questions/37405/natural-interpretation-for-lda-hyperparameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inferring Topics \n",
    "Below are the top 5 words associated with 5 random topics. The float next to each word is the weight showing how much the given word influences this specific topic. In this case, we see that for topic `4`, light and battery are the most telling words. We might interpret that topic `4` might be close to Amazon's Tools & Equipment category which has a sub-category titled \"Jump Starters, Battery Chargers & Portable Power\". Similarly we might infer topic `1` refers to Car Care, maybe sub category \"Exterior Care\".\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  u'0.031*\"blade\" + 0.025*\"wiper\" + 0.017*\"hose\" + 0.016*\"water\" + 0.012*\"windshield\"'),\n",
       " (1,\n",
       "  u'0.017*\"towel\" + 0.016*\"clean\" + 0.013*\"wash\" + 0.013*\"dri\" + 0.010*\"wax\"'),\n",
       " (2,\n",
       "  u'0.010*\"fit\" + 0.009*\"tire\" + 0.007*\"instal\" + 0.006*\"nice\" + 0.006*\"back\"'),\n",
       " (3,\n",
       "  u'0.013*\"oil\" + 0.011*\"drive\" + 0.011*\"filter\" + 0.009*\"engin\" + 0.008*\"app\"'),\n",
       " (4,\n",
       "  u'0.033*\"light\" + 0.024*\"batteri\" + 0.014*\"power\" + 0.013*\"charg\" + 0.012*\"bulb\"')]"
      ]
     },
     "execution_count": 136,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# For `num_topics` number of topics, return `num_words` most significant words\n",
    "lda_model.show_topics(num_topics=5,num_words=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that LDA is a probabilistic mixture of mixtures (or admixture) model for grouped data. The observed data (words) within the groups (documents) are the result of probabilistically choosing words from a specific topic (multinomial over the vocabulary), where the topic is itself drawn from a document-specific multinomial that has a global Dirichlet prior. This means that words can belong to various topics in various degrees. For example, the word 'pressure' might refer to a category/topic of automotive wash products and a category of tire products (in the case where we think the topics are about classes of products).\n",
    "\n",
    "### Querying the LDA Model\n",
    "We cannot pass an arbitrary string to our model and evaluate what topics are most associated with it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'portabl', u'air', u'compressor']\n"
     ]
    }
   ],
   "source": [
    "raw_query = 'portable air compressor'\n",
    "\n",
    "query_words = raw_query.split()\n",
    "query = []\n",
    "for word in query_words:\n",
    "    # ad-hoc reuse steps from above\n",
    "    q_tokens = tokenizer.tokenize(word.lower())\n",
    "    q_stopped_tokens = [word for word in q_tokens if not word in merged_stopwords]\n",
    "    q_stemmed_tokens = [sb_stemmer.stem(word) for word in q_stopped_tokens]\n",
    "    query.append(q_stemmed_tokens[0]) # different frome above, this is not a lists of lists!\n",
    "    \n",
    "print query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# translate words in query to ids and frequencies. \n",
    "id2word = gensim.corpora.Dictionary()\n",
    "_ = id2word.merge_with(texts_dict) # garbage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 139,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(53, 1), (238, 1), (408, 1)]\n"
     ]
    }
   ],
   "source": [
    "# translate this document into (word, frequency) pairs\n",
    "query = id2word.doc2bow(query)\n",
    "print(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we run this constructed query against our trained mode we will get each topic and the likelihood that the `query` relates to that topic. Remember we arbitrarily specified 11 topics when we made the model. When we organize this list to find the most relative topics, we see some intuitive results. We see that our query of 'battery powered inflator' relates most to a topic we thought might align to Amazon's Tools & Equipment category which has a sub-category titled \"Jump Starters, Battery Chargers & Portable Power\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 140,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 0.017966903726103274),\n",
       " (3, 0.027522816624803454),\n",
       " (1, 0.029587736744938701),\n",
       " (2, 0.049382812891815127),\n",
       " (4, 0.87553973001233942)]"
      ]
     },
     "execution_count": 140,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = list(sorted(lda_model[query], key=lambda x: x[1])) # sort by the second entry in the tuple\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 141,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u'0.031*\"blade\" + 0.025*\"wiper\" + 0.017*\"hose\" + 0.016*\"water\" + 0.012*\"windshield\" + 0.010*\"mat\" + 0.010*\"instal\" + 0.009*\"rain\" + 0.008*\"fit\" + 0.008*\"tank\"'"
      ]
     },
     "execution_count": 141,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lda_model.print_topic(a[0][0]) #least related"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u'0.033*\"light\" + 0.024*\"batteri\" + 0.014*\"power\" + 0.013*\"charg\" + 0.012*\"bulb\" + 0.009*\"led\" + 0.008*\"plug\" + 0.008*\"bright\" + 0.008*\"phone\" + 0.008*\"connect\"'"
      ]
     },
     "execution_count": 142,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lda_model.print_topic(a[-1][0]) #most related"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What can we do with this in production?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could take these inferred topics and analyze the sentiment of their corresponding documents (reviews) to find out what customers are saying (or feeling) about specific products. We can also use an LDA model to extract representative statements or quotes, enabling us to summarize customers’ opinions about products, perhaps even displaying them on the site.We could also use LDA to model groups of customers to topics which are groups of products that frequently occur within some customer's orders over time."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}