nllho/quora-nlp.ipynb Secret

## quora-nlp.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature engineering for semantic analysis\n",
    "Natalie Ho | April 2017\n",
    "\n",
    "This notebook explores a variety of approaches to common natural language processing (NLP) problems. These techniques will be explained and applied in context of feature engineering a Quora dataset.\n",
    "\n",
    "I hope that by the end of this notebook, you'll gain familiarity with standard practices as well as recent methods used for NLP tasks. The ever-evolving field has a range of applications from [information retrieval](https://cloud.google.com/natural-language/) to [AI](https://www.ibm.com/developerworks/library/os-ind-watson/), and is well worth a [deeper](https://www.ibm.com/watson/developercloud/doc/natural-language-understanding/index.html) [dive](https://www.ted.com/talks/deb_roy_the_birth_of_a_word).\n",
    "\n",
    "\n",
    "## Table of Contents\n",
    "\n",
    "1. [Introduction](#bullet-1)<br/>\n",
    "    1.1 [Data preview & pre-processing](#bullet-2)<br/>\n",
    "    1.2 [What is feature engineering?](#bullet-3)<br/>\n",
    "    1.3 [What is NLP?](#bullet-4)<br/>\n",
    "<br/>    \n",
    "2. [Syntax](#bullet-5)<br/>\n",
    "    2.1 [Basic string cleaning](#bullet-6)<br/>\n",
    "    2.2 [Simplify question pairs](#bullet-7)<br/>\n",
    "    2.3 [Measuring similarity](#bullet-8)<br/>\n",
    "<br/>   \n",
    "3. [Semantics](#bullet-9)<br/>\n",
    "    3.1 [Single word analysis](#bullet-10)<br/>\n",
    "    3.2 [Sentence analysis](#bullet-11)<br/>\n",
    "    3.3 [Weighted analysis](#bullet-12)<br/>\n",
    "    3.4 [Feature creation](#bullet-13)\n",
    "    \n",
    "    \n",
    "\n",
    "## 1.0 Introduction<a class=\"anchor\" id=\"bullet-1\"></a>\n",
    "\n",
    "Quora is a knowledge sharing platform that functions simply on questions and answers. Their mission, plainly stated: \"We want the Quora answer to be the definitive answer for everybody forever.\" In order to ensure the quality of these answers, Quora must protect the integrity of the questions. They accomplish this by adhering to a principle that each logically distinct question should reside on its own page. Unfortunately, the English language is a fickle thing, and intention can vary significantly with subtle shifts in syntactic structure.\n",
    "\n",
    "Our goal is to create features for syntactically similar, but semantically distinct pairs of strings. We'll be working with Quora's first [public dataset](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs).\n",
    "\n",
    "### Load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_csv('questions.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 Data preview  & pre-processing<a class=\"anchor\" id=\"bullet-2\"></a>\n",
    "The Quora dataset is simple, containing columns for question strings, unique IDs, and a binary variable indicating whether the pair is logically distinct. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(404349, 6)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>qid1</th>\n",
       "      <th>qid2</th>\n",
       "      <th>question1</th>\n",
       "      <th>question2</th>\n",
       "      <th>is_duplicate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>29</td>\n",
       "      <td>30</td>\n",
       "      <td>What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?</td>\n",
       "      <td>What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>31</td>\n",
       "      <td>32</td>\n",
       "      <td>What would a Trump presidency mean for current international master’s students on an F1 visa?</td>\n",
       "      <td>How will a Trump presidency affect the students presently in US or planning to study in US?</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>33</td>\n",
       "      <td>34</td>\n",
       "      <td>What does manipulation mean?</td>\n",
       "      <td>What does manipulation means?</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>35</td>\n",
       "      <td>36</td>\n",
       "      <td>Why do girls want to be friends with the guy they reject?</td>\n",
       "      <td>How do guys feel after rejecting a girl?</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>37</td>\n",
       "      <td>38</td>\n",
       "      <td>Why are so many Quora users posting questions that are readily answered on Google?</td>\n",
       "      <td>Why do people ask Quora questions which can be answered easily by Google?</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    id  qid1  qid2  \\\n",
       "14  14  29    30     \n",
       "15  15  31    32     \n",
       "16  16  33    34     \n",
       "17  17  35    36     \n",
       "18  18  37    38     \n",
       "\n",
       "                                                                                                                                        question1  \\\n",
       "14  What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?   \n",
       "15  What would a Trump presidency mean for current international master’s students on an F1 visa?                                                   \n",
       "16  What does manipulation mean?                                                                                                                    \n",
       "17  Why do girls want to be friends with the guy they reject?                                                                                       \n",
       "18  Why are so many Quora users posting questions that are readily answered on Google?                                                              \n",
       "\n",
       "                                                                                                                                       question2  \\\n",
       "14  What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?   \n",
       "15  How will a Trump presidency affect the students presently in US or planning to study in US?                                                    \n",
       "16  What does manipulation means?                                                                                                                  \n",
       "17  How do guys feel after rejecting a girl?                                                                                                       \n",
       "18  Why do people ask Quora questions which can be answered easily by Google?                                                                      \n",
       "\n",
       "    is_duplicate  \n",
       "14  0             \n",
       "15  1             \n",
       "16  1             \n",
       "17  0             \n",
       "18  1             "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import display\n",
    "pd.set_option('display.max_colwidth', -1)\n",
    "\n",
    "# checking for missing values\n",
    "df.isnull().any()\n",
    "\n",
    "# drop rows with missing values\n",
    "df=df.dropna()\n",
    "\n",
    "print df.shape\n",
    "display(df[14:19])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 What is feature engineering?<a class=\"anchor\" id=\"bullet-3\"></a>\n",
    "\n",
    "**Feature engineering** is the practice of generating data attributes that are useful for prediction. Although the task is loosely defined and depends heavily on the domain in question, it is a key process for optimizing model building. The goal is to find information which best describes the target to be predicted. \n",
    "\n",
    "In our case, the target is logical distinction - will one answer suffice for each pair of questions? This target is described by the binary is_duplicate label in the dataset. We will need to process the Quora data to create features that capture the structure and semantics of each question. This will be accomplished by using natural language processing (NLP) methods on the strings. \n",
    "\n",
    "### 1.3 What is natural language processing?<a class=\"anchor\" id=\"bullet-4\"></a>\n",
    "NLP is the field concerned with computational handling of natural language. Grammar is full of seemingly arbitrary exceptions, vocabulary is constantly transforming, and meaning hinges precariously on culture and context. It is no small feat for a machine to find patterns in this dynamic mess (which is somehow easily grasped by the human brain).\n",
    "\n",
    "We will start with the simpler task of describing syntax. Skip to [section 3](link) for semantic processing techniques.\n",
    "\n",
    "## 2.0 Syntax<a class=\"anchor\" id=\"bullet-5\"></a>\n",
    "\n",
    "A **corpus** is the body of text that we are working with - in this case, the dataset of Quora questions. Our first task is to break the strings down into more manageable units. We will be using the Natural Language Processing Toolkit (NLTK) library to apply the following methods.\n",
    "\n",
    "\n",
    "### 2.1 Basic string cleaning<a class=\"anchor\" id=\"bullet-6\"></a>\n",
    "\n",
    "**Tokenization**<br/>\n",
    "Converting each string into a series of useful units (usually words). We can use NLTK's word_tokenize function to convert a question string into a list of word tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "What can make Physics easy to learn?\n",
      "['What', 'can', 'make', 'Physics', 'easy', 'to', 'learn', '?']\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "from nltk.tokenize import word_tokenize\n",
    "\n",
    "teststring = df['question1'][12]\n",
    "tokens = word_tokenize(df['question1'][12])\n",
    "\n",
    "print teststring\n",
    "print tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Stopwords**<br/>\n",
    "Common words to the corpus that do not significantly alter meaning. The NLTK library includes a set of English language stopwords (e.g. I, you, this, that), which we'll remove from the list of word tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['What', 'make', 'Physics', 'easy', 'learn']\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import stopwords\n",
    "stop_words = stopwords.words('english')\n",
    "stop_words += ['?'] # adding ? character to stop words, since we are working with a corpus of questions\n",
    "\n",
    "filtered_tokens = [t for t in tokens if not t in stop_words]\n",
    "print filtered_tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Stemming**<br/>\n",
    "Removes prefixes and suffixes to extract the **stem** of a word, which may be derived from a **root**. For example, the word \"destabilized\", has the stem \"destablize\", but the root \"stabil-\". The Porter stemming algorithm is often used in practice to handle this task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['What', 'make', 'Physic', 'easi', 'learn']\n"
     ]
    }
   ],
   "source": [
    "#!pip install stemming\n",
    "from stemming.porter2 import stem\n",
    "\n",
    "stem_tokens = [stem(t) for t in filtered_tokens]\n",
    "print stem_tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Simplify question pairs <a class=\"anchor\" id=\"bullet-7\"></a>\n",
    "We will combine the string cleaning methods into a function, and apply that across both question columns in the dataset. To prepare for basic comparison, the function will also convert the words to lowercase and sort them alphabetically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question1</th>\n",
       "      <th>q1_tokens</th>\n",
       "      <th>question2</th>\n",
       "      <th>q2_tokens</th>\n",
       "      <th>is_duplicate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>What can make Physics easy to learn?</td>\n",
       "      <td>easi learn make physic</td>\n",
       "      <td>How can you make physics easy to learn?</td>\n",
       "      <td>easi learn make physic</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               question1               q1_tokens  \\\n",
       "12  What can make Physics easy to learn?  easi learn make physic   \n",
       "\n",
       "                                  question2               q2_tokens  \\\n",
       "12  How can you make physics easy to learn?  easi learn make physic   \n",
       "\n",
       "    is_duplicate  \n",
       "12  1             "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import string\n",
    "\n",
    "def simplify(s):\n",
    "    s = str(s).lower().decode('utf-8')\n",
    "    tokens = word_tokenize(s)\n",
    "    stop_words = stopwords.words('english')\n",
    "    stop_words += string.punctuation\n",
    "    filtered_tokens = [t for t in tokens if not t in stop_words]\n",
    "    stem_tokens = [stem(t) for t in filtered_tokens]\n",
    "    sort_tokens = sorted(stem_tokens)\n",
    "    if sort_tokens is not []:\n",
    "        tokenstr = \" \".join(sort_tokens)\n",
    "    else:\n",
    "        tokenstr = \"\"\n",
    "    return tokenstr.encode('utf-8')\n",
    "\n",
    "df['q1_tokens'] = df['question1'].map(simplify)\n",
    "df['q2_tokens'] = df['question2'].map(simplify)\n",
    "\n",
    "simplifydf=df[['question1','q1_tokens','question2','q2_tokens','is_duplicate']]\n",
    "display(simplifydf[12:13])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.5 Measuring similarity<a class=\"anchor\" id=\"bullet-8\"></a>\n",
    "\n",
    "The simplest way to compare the difference between two strings is by **edit distance**. \n",
    "\n",
    "**Levenshtein distance**: calculates edit distance by counting the number of operations (add, replace, or delete) that are required to transform one string into another.\n",
    "\n",
    "**Token sort ratio**: A method from the [FuzzyWuzzy library](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) that uses Levenshtein distance to get the proportion of common tokens between two strings. The score is normalized from 0-100 for easier interpretation.\n",
    "\n",
    "We'll create our first two features with these methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question1</th>\n",
       "      <th>q1_tokens</th>\n",
       "      <th>question2</th>\n",
       "      <th>q2_tokens</th>\n",
       "      <th>edit_distance</th>\n",
       "      <th>in_common</th>\n",
       "      <th>is_duplicate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>508</th>\n",
       "      <td>What is the best way to learn algebra by yourself?</td>\n",
       "      <td>algebra best learn way</td>\n",
       "      <td>How do you learn algebra 1 fast?</td>\n",
       "      <td>1 algebra fast learn</td>\n",
       "      <td>8</td>\n",
       "      <td>76</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>509</th>\n",
       "      <td>How does it feel to retake a class in college?</td>\n",
       "      <td>class colleg feel retak</td>\n",
       "      <td>Does retaking subjects in college affect future job prospects?</td>\n",
       "      <td>affect colleg futur job prospect retak subject</td>\n",
       "      <td>30</td>\n",
       "      <td>49</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              question1  \\\n",
       "508  What is the best way to learn algebra by yourself?   \n",
       "509  How does it feel to retake a class in college?       \n",
       "\n",
       "                   q1_tokens  \\\n",
       "508  algebra best learn way    \n",
       "509  class colleg feel retak   \n",
       "\n",
       "                                                          question2  \\\n",
       "508  How do you learn algebra 1 fast?                                 \n",
       "509  Does retaking subjects in college affect future job prospects?   \n",
       "\n",
       "                                          q2_tokens  edit_distance  in_common  \\\n",
       "508  1 algebra fast learn                            8              76          \n",
       "509  affect colleg futur job prospect retak subject  30             49          \n",
       "\n",
       "     is_duplicate  \n",
       "508  1             \n",
       "509  0             "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "#!pip install python-Levenshtein\n",
    "from Levenshtein import distance\n",
    "#!pip install fuzzywuzzy\n",
    "from fuzzywuzzy import fuzz\n",
    "\n",
    "df['edit_distance'] = df.apply(lambda x: distance(x['q1_tokens'], x['q2_tokens']), axis=1)\n",
    "df['in_common'] = df.apply(lambda x: fuzz.token_sort_ratio(x['q1_tokens'], x['q2_tokens']), axis=1)\n",
    "\n",
    "syntaxdf=df[['question1','q1_tokens','question2','q2_tokens','edit_distance','in_common','is_duplicate']]\n",
    "display(syntaxdf[508:510]) # example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Clearly the edit distance or proportion of common tokens is not sufficient to predict duplicate intention. For example, question pair 508 is duplicate, but has a larger edit distance and smaller proportion of common tokens than pair 509.\n",
    "\n",
    "Let's try to improve on our features by working with semantic methods.\n",
    "\n",
    "\n",
    "## 3.0 Semantics<a class=\"anchor\" id=\"bullet-9\"></a>\n",
    "\n",
    "To a machine, words look like characters stored next to one another. Syntax methods allow us to compare words by manipulating them mathematically - counting the number of characters, measuring the amount of work needed to turn one set of characters into another. \n",
    "\n",
    "Semantic analysis strives to represent how each sequence of characters is related to any other sequence of characters. These relationships can be derived from large bodies of language as a separate machine learning task. A **document** is the group of words in question. In our case, each question from the Quora corpus is one document.\n",
    "\n",
    "To start, we'll create lists of word tokens (filtered for stopwords, but not stemmed), to support the methods we'll use in this section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question1</th>\n",
       "      <th>q1_words</th>\n",
       "      <th>question2</th>\n",
       "      <th>q2_words</th>\n",
       "      <th>is_duplicate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>508</th>\n",
       "      <td>What is the best way to learn algebra by yourself?</td>\n",
       "      <td>[best, way, learn, algebra]</td>\n",
       "      <td>How do you learn algebra 1 fast?</td>\n",
       "      <td>[learn, algebra, 1, fast]</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>509</th>\n",
       "      <td>How does it feel to retake a class in college?</td>\n",
       "      <td>[feel, retake, class, college]</td>\n",
       "      <td>Does retaking subjects in college affect future job prospects?</td>\n",
       "      <td>[retaking, subjects, college, affect, future, job, prospects]</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              question1  \\\n",
       "508  What is the best way to learn algebra by yourself?   \n",
       "509  How does it feel to retake a class in college?       \n",
       "\n",
       "                           q1_words  \\\n",
       "508  [best, way, learn, algebra]      \n",
       "509  [feel, retake, class, college]   \n",
       "\n",
       "                                                          question2  \\\n",
       "508  How do you learn algebra 1 fast?                                 \n",
       "509  Does retaking subjects in college affect future job prospects?   \n",
       "\n",
       "                                                          q2_words  \\\n",
       "508  [learn, algebra, 1, fast]                                       \n",
       "509  [retaking, subjects, college, affect, future, job, prospects]   \n",
       "\n",
       "     is_duplicate  \n",
       "508  1             \n",
       "509  0             "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "def word_set(s,t,q):\n",
    "    s = str(s).lower().decode('utf-8')\n",
    "    t = str(t).lower().decode('utf-8')\n",
    "    \n",
    "    s_tokens, t_tokens = word_tokenize(s), word_tokenize(t)\n",
    "    \n",
    "    stop_words = stopwords.words('english')\n",
    "    stop_words += string.punctuation\n",
    "    \n",
    "    s_tokens = [x for x in s_tokens if not x in stop_words]\n",
    "    t_tokens = [x for x in t_tokens if not x in stop_words]\n",
    "    \n",
    "    s_temp = set(s_tokens)\n",
    "    t_temp = set(t_tokens)\n",
    "    \n",
    "    s_distinct = [x for x in s_tokens if x not in t_temp]\n",
    "    t_distinct = [x for x in t_tokens if x not in s_temp]\n",
    "\n",
    "    if q == \"q1_words\":\n",
    "        return s_tokens\n",
    "    elif q == \"q2_words\":\n",
    "        return t_tokens\n",
    "    elif q == \"q1_distinct\":\n",
    "        return s_distinct\n",
    "    elif q == \"q2_distinct\":\n",
    "        return t_distinct\n",
    "\n",
    "df['q1_words'] = df.apply(lambda x: word_set(x['question1'], x['question2'],\"q1_words\"), axis=1)\n",
    "df['q2_words'] = df.apply(lambda x: word_set(x['question1'], x['question2'],\"q2_words\"), axis=1)\n",
    "\n",
    "wordsdf=df[['question1','q1_words','question2','q2_words','is_duplicate']]\n",
    "display(wordsdf[508:510])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Single word analysis<a class=\"anchor\" id=\"bullet-10\"></a>\n",
    "\n",
    "**Word embeddings**<br/>\n",
    "This method represents individual words as vectors, and semantic relationships as the distance between vectors. The more related words are, the closer they should exist in vector space. Word embeddings come from the field of [distributional semantics](https://en.wikipedia.org/wiki/Distributional_semantics), which suggests that words are semantically related if they are frequently used in similar contexts (i.e. they are often surrounded by the same words).\n",
    "\n",
    "For example, 'Canada' and 'Toronto' should exist closer together in the vector space than 'Canada' and 'Camara' (which would be closer in edit distance).\n",
    "\n",
    "**Word2Vec**<br/>\n",
    "The mapping of words to vectors is in itself the result of a machine learning algorithm. Developed by Google in 2013, the Word2Vec algorithm is a neural network that takes a large corpus as training data, and produces vector co-ordinates for each word by the word embedding concept. \n",
    "\n",
    "We will be using an pre-trained model from Google that was created from over 100 billion words from Google News. The model needs to be [downloaded](https://code.google.com/archive/p/word2vec/) and handled using the [gensim library](https://radimrehurek.com/gensim/) for word vectors. The model is a dictionary that contains every word and its corresponding vector representation, which look like 300 dimensional co-ordinates stored in an array.\n",
    "\n",
    "**Comparing word vectors**<br/>\n",
    "To compare word vectors, we can use cosine similarity. As the name suggests, this metric measures similarity by taking the cosine of the angle between vectors. The cosine function scales the similarity between 0 and 1, representing words from least to most semantically related."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\IBM_ADMIN\\Anaconda2\\lib\\site-packages\\gensim\\utils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n",
      "  warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cosine of angle between Canada, Toronto:\n",
      "0.564658820406\n",
      "\n",
      "Cosine of angle between Canada, Camara:\n",
      "0.0305788253005\n"
     ]
    }
   ],
   "source": [
    "#!pip install gensim\n",
    "import gensim\n",
    "\n",
    "model = gensim.models.KeyedVectors.load_word2vec_format('googlenews-vectors.bin', binary=True)\n",
    "\n",
    "# using gensim built-in similarity function for examples\n",
    "print \"Cosine of angle between Canada, Toronto:\" + \"\\n\",\n",
    "print model.similarity('Canada','Toronto')\n",
    "print \"\\n\" + \"Cosine of angle between Canada, Camara:\" + \"\\n\",\n",
    "print model.similarity('Canada','Camara')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Sentence analysis<a class=\"anchor\" id=\"bullet-11\"></a>\n",
    "\n",
    "Since the model works like a dictionary, it can only give us vector representations for single words. There are two ways to get a vector representation of a sentence: \n",
    "\n",
    "1. Train a model on ordered words (e.g. sentences or phrases). Since word order is included during training, the resulting vectors will preserve the relationships between words. I won't be training a new model in this notebook, as it is computationally heavy, but here are some [resources](https://rare-technologies.com/doc2vec-tutorial) for the curious.\n",
    "<br/>\n",
    "<br/>\n",
    "2. Convert a sentence to a set of words, and get the corresponding set of vectors. Averaging the vector set (summing and dividing by total vector length) will give us a single vector that represents that particular set of words. This method can only give a 'bag of words' representation - i.e. word order is not captured.\n",
    "\n",
    "*Comment: I think that getting new embeddings specific to a corpus is the best-performing method in practice. For the purpose of illustrating NLP problem-solving, I will do my best with bag-of-words methods.*\n",
    "\n",
    "The following function implements the second method to get the average embedded vector from a set of words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from __future__ import division\n",
    "\n",
    "def vectorize(words):\n",
    "    V = np.zeros(300)\n",
    "    \n",
    "    for w in words:\n",
    "        try: \n",
    "            V = np.add(V,model[w]) \n",
    "        except:\n",
    "            continue\n",
    "    else:\n",
    "        avg_vector = V / np.sqrt((V ** 2).sum())\n",
    "        return avg_vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how the average vectors compare for question pair 508:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question1</th>\n",
       "      <th>q1_words</th>\n",
       "      <th>question2</th>\n",
       "      <th>q2_words</th>\n",
       "      <th>is_duplicate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>508</th>\n",
       "      <td>What is the best way to learn algebra by yourself?</td>\n",
       "      <td>[best, way, learn, algebra]</td>\n",
       "      <td>How do you learn algebra 1 fast?</td>\n",
       "      <td>[learn, algebra, 1, fast]</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              question1  \\\n",
       "508  What is the best way to learn algebra by yourself?   \n",
       "\n",
       "                        q1_words                         question2  \\\n",
       "508  [best, way, learn, algebra]  How do you learn algebra 1 fast?   \n",
       "\n",
       "                      q2_words  is_duplicate  \n",
       "508  [learn, algebra, 1, fast]  1             "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Cosine similarity of [best, way, learn, algebra] and [learn, algebra, 1, fast]:\n",
      "0.788908032578\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "\n",
    "sent1_q508 = wordsdf['q1_words'][508]\n",
    "sent2_q508 = wordsdf['q2_words'][508]\n",
    "\n",
    "vec1_q508 = vectorize(sent1_q508).reshape(1,-1)\n",
    "vec2_q508 = vectorize(sent2_q508).reshape(1,-1)\n",
    "\n",
    "display(wordsdf[508:509])\n",
    "\n",
    "print \"\\n\" + \"Cosine similarity of [best, way, learn, algebra] and [learn, algebra, 1, fast]:\" + \"\\n\",\n",
    "print cosine_similarity(vec1_q508, vec2_q508)[0][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How do the averaged vectors represent the cosine similarities of its components? \n",
    "\n",
    "Intuitively, if our question pair differs by a closely related word (best vs. ideal) we would get a larger cosine similarity. And if our question pair differs by a very distinct word (algebra vs. juggling), the cosine similarity is smaller."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Distance between [best, way, learn, algebra] and [learn, algebra, best, way]:\n",
      "1.0\n",
      "\n",
      "Distance between [best, way, learn, algebra] and [ideal, way, learn, algebra]:\n",
      "0.905135532973\n",
      "\n",
      "Distance between [best, way, learn, algebra] and [best, way, learn, juggling]:\n",
      "0.732704534446\n"
     ]
    }
   ],
   "source": [
    "# bag of words, so same set of words in a different order does not matter\n",
    "print \"\\n\" + \"Distance between [best, way, learn, algebra] and [learn, algebra, best, way]:\" + \"\\n\",\n",
    "print model.n_similarity(['best','way','learn','algebra'],['learn','algebra','best','way'])\n",
    "\n",
    "# difference is a semantically similar word\n",
    "print \"\\n\" + \"Distance between [best, way, learn, algebra] and [ideal, way, learn, algebra]:\" + \"\\n\",\n",
    "print model.n_similarity(['best','way','learn','algebra'],['ideal','way','learn','algebra'])\n",
    "\n",
    "# difference is not semantically similar\n",
    "print \"\\n\" + \"Distance between [best, way, learn, algebra] and [best, way, learn, juggling]:\" + \"\\n\",\n",
    "print model.n_similarity(['best','way','learn','algebra'],['best','way','learn','juggling'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Word mover's distance** <br/>\n",
    "An implementation of Earth mover's distance for natural language processing problems by Kusner et al. <a href=\"#footnote-1\"><sup>[1]</sup></a>\n",
    "\n",
    "WM distance is an approach that combines the ideas of edit distance with vector representation. It measures the work required to transform one set of vectors into another. Instead of counting edit operations, we use distance between word vectors - how far one vector would have to move to occupy the same spot as the second.\n",
    "\n",
    "How Word Mover's Distance is calculated:\n",
    "</a><br/><img src=\"https://raw.githubusercontent.com/nllho/quora-nlp/master/images/wmd.PNG\" width=\"400\" height=\"400\"/>\n",
    "1. All the words in each set are paired off with each other\n",
    "2. Calculate the distance between each pair (instead of cosine similarity, Euclidean distance is used here)\n",
    "3. Sum the distances between pairs with minimum distances\n",
    "\n",
    "If the two sets do not have the same number of words, the problem becomes an optimization of another measurement called **flow**.\n",
    "</a><br/><img src=\"https://raw.githubusercontent.com/nllho/quora-nlp/master/images/flow.PNG\" width=\"320\" height=\"320\"/>\n",
    "\n",
    "1. The flow is equal to 1/(number of words in the set), so words from the smaller set have a larger flow<br/>\n",
    "(words on the bottom have a flow of 0.33, while words on the top have a flow of 0.25)\n",
    "2. Extra flow gets attributed to the next most similar words<br/>\n",
    "(see the arrows drawn from the bottom words to more than one word in the top row)\n",
    "3. The optimization problem identifies the pairs with minimum distances by solving for minimum flow.\n",
    "\n",
    "We can use the WM distance method directly from gensim."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "WM distance between [best, way, learn, algebra] and [learn, algebra, 1, fast]:\n",
      "1.43031281792\n"
     ]
    }
   ],
   "source": [
    "print \"\\n\" + \"WM distance between [best, way, learn, algebra] and [learn, algebra, 1, fast]:\" + \"\\n\",\n",
    "print model.wmdistance(sent1_q508, sent2_q508)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 Weighted analysis<a class=\"anchor\" id=\"bullet-12\"></a>\n",
    "\n",
    "In the example below, we can see that the words are the same except for the name of the country in question (Canada vs. Japan). However, the country name makes all the semantic difference, which we fail to capture using only cosine similarity or WM distance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question1</th>\n",
       "      <th>q1_words</th>\n",
       "      <th>question2</th>\n",
       "      <th>q2_words</th>\n",
       "      <th>is_duplicate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?</td>\n",
       "      <td>[laws, change, status, student, visa, green, card, us, compare, immigration, laws, canada]</td>\n",
       "      <td>What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?</td>\n",
       "      <td>[laws, change, status, student, visa, green, card, us, compare, immigration, laws, japan]</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                                                                        question1  \\\n",
       "14  What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?   \n",
       "\n",
       "                                                                                      q1_words  \\\n",
       "14  [laws, change, status, student, visa, green, card, us, compare, immigration, laws, canada]   \n",
       "\n",
       "                                                                                                                                       question2  \\\n",
       "14  What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?   \n",
       "\n",
       "                                                                                     q2_words  \\\n",
       "14  [laws, change, status, student, visa, green, card, us, compare, immigration, laws, japan]   \n",
       "\n",
       "    is_duplicate  \n",
       "14  0             "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Cosine angle:\n",
      "0.970663125514\n",
      "\n",
      "WM distance:\n",
      "0.332179243272\n"
     ]
    }
   ],
   "source": [
    "display(wordsdf[14:15])\n",
    "\n",
    "sent1_q14 = wordsdf['q1_words'][14]\n",
    "sent2_q14 = wordsdf['q2_words'][14]\n",
    "\n",
    "print \"\\n\" + \"Cosine angle:\" + \"\\n\",\n",
    "print model.n_similarity(sent1_q14, sent2_q14)\n",
    "\n",
    "print \"\\n\" + \"WM distance:\" + \"\\n\",\n",
    "print model.wmdistance(sent1_q14, sent2_q14)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Weighing uncommon words**<br/>\n",
    "Let's assume that 'rare' words are more likely to be semantically significant. We can represent this at the word vector level by multiplying those words by a numerical weight.  \n",
    "\n",
    "**Term frequency-inverse document frequency** (tf-idf) is a method that assigns weights to word vectors depending on how common they are to a document. The frequency of a word is measured in two ways:\n",
    "\n",
    "* How many documents contain the word (N)\n",
    "* How many times a word appears in one document (f)\n",
    "\n",
    "The weight is calculated from the frequency as log(N/f), so the less frequently a word appears in some documents, the higher its weight.\n",
    "\n",
    "This method can be implemented via sci-kit learn's built in [Tf-idf Vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), which generates weights given a corpus. To save memory and computing time, I decided to simplify the premise of tf-idf for use on pairs of similar questions.\n",
    "\n",
    "(1) Assume that distinct words  are the most important in telling the difference between question pairs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question1</th>\n",
       "      <th>q1_words</th>\n",
       "      <th>q1_distinct</th>\n",
       "      <th>question2</th>\n",
       "      <th>q2_words</th>\n",
       "      <th>q2_distinct</th>\n",
       "      <th>is_duplicate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?</td>\n",
       "      <td>[laws, change, status, student, visa, green, card, us, compare, immigration, laws, canada]</td>\n",
       "      <td>[canada]</td>\n",
       "      <td>What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?</td>\n",
       "      <td>[laws, change, status, student, visa, green, card, us, compare, immigration, laws, japan]</td>\n",
       "      <td>[japan]</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                                                                        question1  \\\n",
       "14  What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?   \n",
       "\n",
       "                                                                                      q1_words  \\\n",
       "14  [laws, change, status, student, visa, green, card, us, compare, immigration, laws, canada]   \n",
       "\n",
       "   q1_distinct  \\\n",
       "14  [canada]     \n",
       "\n",
       "                                                                                                                                       question2  \\\n",
       "14  What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?   \n",
       "\n",
       "                                                                                     q2_words  \\\n",
       "14  [laws, change, status, student, visa, green, card, us, compare, immigration, laws, japan]   \n",
       "\n",
       "   q2_distinct  is_duplicate  \n",
       "14  [japan]     0             "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# get list of distinct words for each question\n",
    "df['q1_distinct'] = df.apply(lambda x: word_set(x['question1'], x['question2'],\"q1_distinct\"), axis=1)\n",
    "df['q2_distinct'] = df.apply(lambda x: word_set(x['question1'], x['question2'],\"q2_distinct\"), axis=1)\n",
    "\n",
    "distinctdf=df[['question1','q1_words','q1_distinct','question2','q2_words','q2_distinct','is_duplicate']]\n",
    "display(distinctdf[14:15])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It might be useful to get features for the cosine similarity and WM distance for distinct words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cosine similarity between distinct vectors (canada, japan):\n",
      "0.482060432649\n",
      "\n",
      "WM distance between distinct vectors (canada, japan):\n",
      "3.98616600037\n"
     ]
    }
   ],
   "source": [
    "distinct1 = distinctdf['q1_distinct'][14]\n",
    "distinct2 = distinctdf['q2_distinct'][14]\n",
    "\n",
    "distinct_vec1 = vectorize(distinct1).reshape(1,-1)\n",
    "distinct_vec2 = vectorize(distinct2).reshape(1,-1)\n",
    "\n",
    "print \"Cosine similarity between distinct vectors ({0}, {1}):\".format(distinct1[0], distinct2[0]) + \"\\n\",\n",
    "print cosine_similarity(distinct_vec1, distinct_vec2)[0][0]\n",
    "\n",
    "print \"\\n\" + \"WM distance between distinct vectors ({0}, {1}):\".format(distinct1[0], distinct2[0]) + \"\\n\",\n",
    "print model.wmdistance(distinct1, distinct2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(2) Distinct words only appear in one of the two questions, so we can take N = 1. We assumed that distinct words are important, so we assign the distinct words a small frequency of 1/(number of words in the question) for a larger weight."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# modify vectorize function to add weights\n",
    "def get_weight(words):\n",
    "    n = len(words)\n",
    "    weight = 1\n",
    "    \n",
    "    if n != 0:\n",
    "        weight = np.log(1/(1/n))\n",
    "        \n",
    "    return weight"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(3) Generate an array containing the weights for every question in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# empty arrays\n",
    "q1_weights = np.zeros((df.shape[0],300))\n",
    "q2_weights = np.zeros((df.shape[0],300))\n",
    "\n",
    "# fill arrays with weights for each question\n",
    "for i, q in enumerate(df.q1_words.values):\n",
    "    q1_weights[i, :] = get_weight(q)\n",
    "    \n",
    "for i, q in enumerate(df.q2_words.values):\n",
    "    q2_weights[i, :] = get_weight(q)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(4) Calculate the average weighted vectors. We can see how weighing distinct words translates to reduced cosine similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Cosine similarity between averaged question vectors:\n",
      "0.970663125278\n",
      "\n",
      "Cosine similiarity between weighted question vectors:\n",
      "0.752143561408\n"
     ]
    }
   ],
   "source": [
    "avg_vec1 = vectorize(sent1_q14).reshape(1,-1)\n",
    "avg_vec2 = vectorize(sent2_q14).reshape(1,-1)\n",
    "\n",
    "print \"\\n\" + \"Cosine similarity between averaged question vectors:\" + \"\\n\",\n",
    "print cosine_similarity(avg_vec1, avg_vec2)[0][0]\n",
    "\n",
    "w_distinct_vec1 = distinct_vec1*q1_weights[14]\n",
    "w_distinct_vec2 = distinct_vec2*q2_weights[14]\n",
    "\n",
    "avg_weight_distinct_vec1 = np.add(avg_vec1, -(distinct_vec1), w_distinct_vec1) \n",
    "avg_weight_distinct_vec2 = np.add(avg_vec2, -(distinct_vec2), w_distinct_vec2)\n",
    "\n",
    "print \"\\n\" + \"Cosine similiarity between weighted question vectors:\" + \"\\n\",\n",
    "print cosine_similarity(avg_weight_distinct_vec1, avg_weight_distinct_vec2)[0][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### 3.4 Feature creation<a class=\"anchor\" id=\"bullet-13\"></a> <a href=\"#footnote-1\"><sup>[2]</sup></a>\n",
    "We can apply these methods to our dataset to create the following features:\n",
    "\n",
    "* Word mover's distance between sentence sets\n",
    "* Word mover's distance between distinct word sets\n",
    "* Angle between averaged sentence vectors\n",
    "* Angle between averaged distinct word vectors\n",
    "* Angle between weighted sentence vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\IBM_ADMIN\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:13: RuntimeWarning: invalid value encountered in true_divide\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>qid1</th>\n",
       "      <th>qid2</th>\n",
       "      <th>question1</th>\n",
       "      <th>question2</th>\n",
       "      <th>is_duplicate</th>\n",
       "      <th>q1_tokens</th>\n",
       "      <th>q2_tokens</th>\n",
       "      <th>edit_distance</th>\n",
       "      <th>in_common</th>\n",
       "      <th>q1_words</th>\n",
       "      <th>q2_words</th>\n",
       "      <th>q1_distinct</th>\n",
       "      <th>q2_distinct</th>\n",
       "      <th>wm_dist_words</th>\n",
       "      <th>wm_dist_distinct</th>\n",
       "      <th>cos_angle_words</th>\n",
       "      <th>cos_angle_distinct</th>\n",
       "      <th>cos_angle_weighted</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>29</td>\n",
       "      <td>30</td>\n",
       "      <td>What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?</td>\n",
       "      <td>What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?</td>\n",
       "      <td>0</td>\n",
       "      <td>canada card chang compar green immigr law law status student us visa</td>\n",
       "      <td>card chang compar green immigr japan law law status student us visa</td>\n",
       "      <td>13</td>\n",
       "      <td>90</td>\n",
       "      <td>[laws, change, status, student, visa, green, card, us, compare, immigration, laws, canada]</td>\n",
       "      <td>[laws, change, status, student, visa, green, card, us, compare, immigration, laws, japan]</td>\n",
       "      <td>[canada]</td>\n",
       "      <td>[japan]</td>\n",
       "      <td>0.332179</td>\n",
       "      <td>3.986166</td>\n",
       "      <td>0.970663</td>\n",
       "      <td>0.48206</td>\n",
       "      <td>0.752144</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    id  qid1  qid2  \\\n",
       "14  14  29    30     \n",
       "\n",
       "                                                                                                                                        question1  \\\n",
       "14  What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?   \n",
       "\n",
       "                                                                                                                                       question2  \\\n",
       "14  What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?   \n",
       "\n",
       "    is_duplicate  \\\n",
       "14  0              \n",
       "\n",
       "                                                               q1_tokens  \\\n",
       "14  canada card chang compar green immigr law law status student us visa   \n",
       "\n",
       "                                                              q2_tokens  \\\n",
       "14  card chang compar green immigr japan law law status student us visa   \n",
       "\n",
       "    edit_distance  in_common  \\\n",
       "14  13             90          \n",
       "\n",
       "                                                                                      q1_words  \\\n",
       "14  [laws, change, status, student, visa, green, card, us, compare, immigration, laws, canada]   \n",
       "\n",
       "                                                                                     q2_words  \\\n",
       "14  [laws, change, status, student, visa, green, card, us, compare, immigration, laws, japan]   \n",
       "\n",
       "   q1_distinct q2_distinct  wm_dist_words  wm_dist_distinct  cos_angle_words  \\\n",
       "14  [canada]    [japan]     0.332179       3.986166          0.970663          \n",
       "\n",
       "    cos_angle_distinct  cos_angle_weighted  \n",
       "14  0.48206             0.752144            "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# word mover's distance between sentence sets\n",
    "df['wm_dist_words'] = df.apply(lambda x: model.wmdistance(x['q1_words'], x['q2_words']), axis=1)\n",
    "\n",
    "# word mover's distance between distinct sets\n",
    "df['wm_dist_distinct'] = df.apply(lambda x: model.wmdistance(x['q1_distinct'], x['q2_distinct']), axis=1)\n",
    "\n",
    "# angle between averaged sentence vectors\n",
    "q1_avg_vectors = np.zeros((df.shape[0], 300))\n",
    "q2_avg_vectors = np.zeros((df.shape[0], 300))\n",
    "\n",
    "for i, q in enumerate(df.q1_words.values):\n",
    "    q1_avg_vectors[i, :] = vectorize(q)\n",
    "\n",
    "for i, q in enumerate(df.q2_words.values):\n",
    "    q2_avg_vectors[i, :] = vectorize(q)\n",
    "    \n",
    "    \n",
    "df['cos_angle_words'] = [cosine_similarity(x.reshape(1,-1), y.reshape(1,-1))[0][0]\n",
    "                        for (x, y) in zip(np.nan_to_num(q1_avg_vectors),\n",
    "                                          np.nan_to_num(q2_avg_vectors))]\n",
    "\n",
    "# angle between averaged distinct sentence vectors\n",
    "q1_dist_vectors = np.zeros((df.shape[0], 300))\n",
    "q2_dist_vectors = np.zeros((df.shape[0], 300))\n",
    "\n",
    "for i, q in enumerate(df.q1_distinct.values):\n",
    "    q1_dist_vectors[i, :] = vectorize(q)\n",
    "\n",
    "for i, q in enumerate(df.q2_distinct.values):\n",
    "    q2_dist_vectors[i, :] = vectorize(q)\n",
    "    \n",
    "    \n",
    "df['cos_angle_distinct'] = [cosine_similarity(x.reshape(1,-1), y.reshape(1,-1))[0][0]\n",
    "                           for (x, y) in zip(np.nan_to_num(q1_dist_vectors),\n",
    "                                             np.nan_to_num(q2_dist_vectors))]\n",
    "\n",
    "# get array of weighted distinct vectors\n",
    "q1_weight_distinct_vec = np.multiply(q1_dist_vectors,q1_weights)\n",
    "q2_weight_distinct_vec = np.multiply(q2_dist_vectors,q2_weights)\n",
    "\n",
    "# get sentence vectors with weights\n",
    "q1_avg_weight_vectors = np.add(q1_avg_vectors, -(q1_dist_vectors), + q1_weight_distinct_vec)\n",
    "q2_avg_weight_vectors = np.add(q2_avg_vectors, -(q2_dist_vectors), + q2_weight_distinct_vec)\n",
    "\n",
    "df['cos_angle_weighted'] = [cosine_similarity(x.reshape(1,-1), y.reshape(1,-1))[0][0]\n",
    "                           for (x, y) in zip(np.nan_to_num(q1_avg_weight_vectors),\n",
    "                                             np.nan_to_num(q2_avg_weight_vectors))]\n",
    "\n",
    "df[14:15]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can now export the feature engineered dataset for use with your preferred model!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "featuredf = df.drop(['q1_tokens','q2_tokens','q1_words','q2_words','q1_distinct','q2_distinct'], axis=1)\n",
    "featuredf.to_csv('/nlp/quora_features.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Further reading\n",
    "\n",
    "* Follow [this tutorial](http://nbviewer.jupyter.org/gist/nllho/4496a06e2bec93f06858851b5d822298) to build an XGBoost classifier, and make predictions using our new features\n",
    "* Try [Doc2Vec](https://rare-technologies.com/doc2vec-tutorial) to train a model for sentences or phrases\n",
    "* Try [Tf-idf Vectorizer](http://www.markhneedham.com/blog/2015/02/15/pythonscikit-learn-calculating-tfidf-on-how-i-met-your-mother-transcripts/) to generate specific weights based on word frequency in a corpus\n",
    "\n",
    "\n",
    "## References\n",
    "\n",
    "<p id=\"footnote-1\"><sup>[1]</sup> Kusner, M. J. and Sun, Y. and Kolkin, N. I. and Weinberger, K. Q. (2015) [From Word Embeddings to Document Distances](http://proceedings.mlr.press/v37/kusnerb15.pdf)\n",
    "\n",
    "<p id=\"footnote-1\"><sup>[2]</sup> Thakur, A. (April 2017) [Is that a duplicate Quora Question?](https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}