thebishorup/Sentiment Classification using SpaCy for IMDB and Amazon Review Dataset.ipynb

## Sentiment Classification using SpaCy for IMDB and Amazon Review Dataset.ipynb
{
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Data Cleaning Option\n* Case Normalization\n* Removing stop words\n* Removing punctuation or special characters\n* Lemmatization or stemming\n* Part of speech tagging\n* Entity detection\n* Bag of words\n* TF-IDF"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "import spacy\nfrom spacy import displacy",
      "execution_count": 13,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "nlp = spacy.load('en_core_web_lg')",
      "execution_count": 14,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "#### Tokenization"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "text = 'This is first sentence. This is second sentence. This is third senctence.'\ndoc = nlp(text)",
      "execution_count": 5,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "for token in doc:\n    print(token)",
      "execution_count": 8,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "This\nis\nfirst\nsentence\n.\nThis\nis\nsecond\nsentence\n.\nThis\nis\nthird\nsenctence\n.\n"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "#### Create sentencizer"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "sent = nlp.create_pipe('sentencizer')\nnlp.add_pipe(sent, before = 'parser')",
      "execution_count": 9,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "doc = nlp(text)",
      "execution_count": 10,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "for sent in doc.sents:\n    print(sent)",
      "execution_count": 11,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "This is first sentence.\nThis is second sentence.\nThis is third senctence.\n"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "#### Stop words"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "from spacy.lang.en.stop_words import STOP_WORDS",
      "execution_count": 16,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "stop_words = list(STOP_WORDS)\nprint(stop_words)",
      "execution_count": 17,
      "outputs": [
        {
          "output_type": "stream",
          "text": "['is', 'often', 'ever', 'two', 'has', '‘ll', 'own', \"'s\", 'hundred', 'towards', 'he', 'becoming', 'of', 'another', 'fifty', 'in', 'give', 'beside', 'being', '’s', \"n't\", 'the', 'put', 'therein', 'may', 'done', 'beforehand', 'hers', 'most', 'does', 'since', 'by', 'seem', 'see', '’m', 'neither', 'there', 'forty', 'already', 'all', 'everywhere', 'due', 'thereafter', 'nevertheless', 'others', 'mostly', 'further', 'each', 'nobody', 'eleven', 'myself', 'nothing', 'various', 'make', 'using', 'enough', 'across', 'latter', 'side', 'much', 'we', 'last', 'which', 'even', 'are', 'sometimes', 'anyhow', 'did', 'now', 'perhaps', 'empty', 'become', 'himself', 'serious', 'nine', 'seems', 'herein', 'behind', 'why', 'what', 'bottom', 'show', \"'d\", 'quite', 'fifteen', 'whereupon', 'ca', 'i', 'made', 'n‘t', 'above', 'from', 'seemed', 'meanwhile', 'not', 'until', 'do', 'against', 'wherein', 'but', 'afterwards', 'alone', 'below', 'whoever', 'four', 'upon', 'a', 'something', 'themselves', 'onto', 'thereupon', 'off', 'part', 'at', 'whom', 'regarding', 'will', 'under', 'twenty', 'one', 'used', 'as', 'or', 'whose', \"'m\", 'really', 'call', 'many', 're', 'several', 'than', 'yet', 'our', 'on', 'keep', 'anyway', 'no', 'must', 'three', 'these', 'full', 'when', 'eight', 'any', 'always', 'back', 'whole', 'just', 'again', 'however', 'anyone', 'whereafter', 'him', 'your', 'were', 'elsewhere', 'former', 'too', 'indeed', 'both', '‘d', '‘re', 'say', 'whereby', 'six', 'some', 'my', \"'re\", 'hereupon', 'front', '’ve', 'toward', 'whereas', 'almost', \"'ll\", 'his', 'am', 'few', 'go', 'moreover', 'she', 'third', 'about', 'yourselves', 'noone', 'thus', 'whenever', 'into', 'well', 'seeming', '’ll', 'yourself', 'up', 'us', 'once', 'twelve', 'everyone', 'such', 'none', 'was', 'except', 'without', 'unless', 'latterly', 'least', 'whither', 'n’t', 'hereby', 'out', 'for', 'had', 'then', 'throughout', 'because', 'amongst', 'be', 'within', 'beyond', 'namely', 'would', '‘ve', 'thru', 'whence', 'move', 'somewhere', 'how', 'before', 'those', 'other', 'so', 'someone', 'became', 'over', 'more', 'herself', 'that', 'its', 'can', 'should', 'only', 'her', 'it', 'doing', 'have', \"'ve\", 'same', 'down', 'together', 'ten', 'whether', 'mine', 'nor', 'to', 'still', 'hence', 'either', 'therefore', 'their', 'take', 'never', 'around', 'during', 'cannot', 'sometime', 'you', 'nowhere', 'this', 'along', 'sixty', 'among', 'me', 'they', 'everything', 'hereafter', 'via', 'wherever', 'please', 'top', 'get', 'besides', 'and', 'whatever', '‘m', 'itself', 'five', 'somehow', 'otherwise', 'per', 'thence', 'also', 'amount', 'an', 'after', 'formerly', 'between', 'ourselves', 'less', 'yours', 'anything', 'name', 'very', 'while', 'if', 'with', 'thereby', 'could', 'every', '’re', 'been', 'ours', 'here', 'next', 'who', 'although', 'might', 'though', '‘s', 'where', 'through', 'else', 'rather', 'becomes', 'them', 'first', '’d', 'anywhere']\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "len(stop_words)",
      "execution_count": 14,
      "outputs": [
        {
          "data": {
            "text/plain": "326"
          },
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "#### Remove stop words (has no meaning) from sentence"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "for token in doc:\n    if token.is_stop == False:\n        print(token)",
      "execution_count": 15,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "sentence\n.\nsecond\nsentence\n.\nsenctence\n.\n"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "#### Lemmatization"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "doc = nlp('eat eats eating eater')\nfor token in doc:\n    print(token.text, token.lemma_)",
      "execution_count": 16,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "eat eat\neats eat\neating eat\neater eater\n"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "#### Part of Speech (POS)"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "doc = nlp('A quick jackle in running around the house.')\nfor token in doc:\n    print(token.text, token.pos_)",
      "execution_count": 17,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "A DET\nquick ADJ\njackle NOUN\nin ADP\nrunning VERB\naround ADP\nthe DET\nhouse NOUN\n. PUNCT\n"
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "displacy.render(doc, style='dep')",
      "execution_count": 18,
      "outputs": [
        {
          "data": {
            "text/html": "<span class=\"tex2jax_ignore\"><svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"eae7ce90451c403a926e3cbf43a64b38-0\" class=\"displacy\" width=\"1450\" height=\"312.0\" direction=\"ltr\" style=\"max-width: none; height: 312.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"222.0\">\n    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">A</tspan>\n    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">DET</tspan>\n</text>\n\n<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"222.0\">\n    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"225\">quick</tspan>\n    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"225\">ADJ</tspan>\n</text>\n\n<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"222.0\">\n    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"400\">jackle</tspan>\n    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"400\">NOUN</tspan>\n</text>\n\n<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"222.0\">\n    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"575\">in</tspan>\n    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"575\">ADP</tspan>\n</text>\n\n<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"222.0\">\n    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"750\">running</tspan>\n    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"750\">VERB</tspan>\n</text>\n\n<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"222.0\">\n    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"925\">around</tspan>\n    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"925\">ADP</tspan>\n</text>\n\n<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"222.0\">\n    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1100\">the</tspan>\n    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1100\">DET</tspan>\n</text>\n\n<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"222.0\">\n    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1275\">house.</tspan>\n    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1275\">NOUN</tspan>\n</text>\n\n<g class=\"displacy-arrow\">\n    <path class=\"displacy-arc\" id=\"arrow-eae7ce90451c403a926e3cbf43a64b38-0-0\" stroke-width=\"2px\" d=\"M70,177.0 C70,2.0 400.0,2.0 400.0,177.0\" fill=\"none\" stroke=\"currentColor\"/>\n    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n        <textPath xlink:href=\"#arrow-eae7ce90451c403a926e3cbf43a64b38-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n    </text>\n    <path class=\"displacy-arrowhead\" d=\"M70,179.0 L62,167.0 78,167.0\" fill=\"currentColor\"/>\n</g>\n\n<g class=\"displacy-arrow\">\n    <path class=\"displacy-arc\" id=\"arrow-eae7ce90451c403a926e3cbf43a64b38-0-1\" stroke-width=\"2px\" d=\"M245,177.0 C245,89.5 395.0,89.5 395.0,177.0\" fill=\"none\" stroke=\"currentColor\"/>\n    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n        <textPath xlink:href=\"#arrow-eae7ce90451c403a926e3cbf43a64b38-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n    </text>\n    <path class=\"displacy-arrowhead\" d=\"M245,179.0 L237,167.0 253,167.0\" fill=\"currentColor\"/>\n</g>\n\n<g class=\"displacy-arrow\">\n    <path class=\"displacy-arc\" id=\"arrow-eae7ce90451c403a926e3cbf43a64b38-0-2\" stroke-width=\"2px\" d=\"M420,177.0 C420,89.5 570.0,89.5 570.0,177.0\" fill=\"none\" stroke=\"currentColor\"/>\n    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n        <textPath xlink:href=\"#arrow-eae7ce90451c403a926e3cbf43a64b38-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n    </text>\n    <path class=\"displacy-arrowhead\" d=\"M570.0,179.0 L578.0,167.0 562.0,167.0\" fill=\"currentColor\"/>\n</g>\n\n<g class=\"displacy-arrow\">\n    <path class=\"displacy-arc\" id=\"arrow-eae7ce90451c403a926e3cbf43a64b38-0-3\" stroke-width=\"2px\" d=\"M595,177.0 C595,89.5 745.0,89.5 745.0,177.0\" fill=\"none\" stroke=\"currentColor\"/>\n    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n        <textPath xlink:href=\"#arrow-eae7ce90451c403a926e3cbf43a64b38-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pcomp</textPath>\n    </text>\n    <path class=\"displacy-arrowhead\" d=\"M745.0,179.0 L753.0,167.0 737.0,167.0\" fill=\"currentColor\"/>\n</g>\n\n<g class=\"displacy-arrow\">\n    <path class=\"displacy-arc\" id=\"arrow-eae7ce90451c403a926e3cbf43a64b38-0-4\" stroke-width=\"2px\" d=\"M770,177.0 C770,89.5 920.0,89.5 920.0,177.0\" fill=\"none\" stroke=\"currentColor\"/>\n    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n        <textPath xlink:href=\"#arrow-eae7ce90451c403a926e3cbf43a64b38-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n    </text>\n    <path class=\"displacy-arrowhead\" d=\"M920.0,179.0 L928.0,167.0 912.0,167.0\" fill=\"currentColor\"/>\n</g>\n\n<g class=\"displacy-arrow\">\n    <path class=\"displacy-arc\" id=\"arrow-eae7ce90451c403a926e3cbf43a64b38-0-5\" stroke-width=\"2px\" d=\"M1120,177.0 C1120,89.5 1270.0,89.5 1270.0,177.0\" fill=\"none\" stroke=\"currentColor\"/>\n    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n        <textPath xlink:href=\"#arrow-eae7ce90451c403a926e3cbf43a64b38-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n    </text>\n    <path class=\"displacy-arrowhead\" d=\"M1120,179.0 L1112,167.0 1128,167.0\" fill=\"currentColor\"/>\n</g>\n\n<g class=\"displacy-arrow\">\n    <path class=\"displacy-arc\" id=\"arrow-eae7ce90451c403a926e3cbf43a64b38-0-6\" stroke-width=\"2px\" d=\"M945,177.0 C945,2.0 1275.0,2.0 1275.0,177.0\" fill=\"none\" stroke=\"currentColor\"/>\n    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n        <textPath xlink:href=\"#arrow-eae7ce90451c403a926e3cbf43a64b38-0-6\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n    </text>\n    <path class=\"displacy-arrowhead\" d=\"M1275.0,179.0 L1283.0,167.0 1267.0,167.0\" fill=\"currentColor\"/>\n</g>\n</svg></span>",
            "text/plain": "<IPython.core.display.HTML object>"
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "#### Entity Detection"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "doc = nlp('Meanwhile, as coronavirus numbers skyrocket, some states are holding back on easing restrictions. Texas Gov. Greg Abbott paused any further phases to reopen the state on Thursday and issued an order to ensure hospital beds be available for Covid-19 patients.')",
      "execution_count": 19,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "doc",
      "execution_count": 20,
      "outputs": [
        {
          "data": {
            "text/plain": "Meanwhile, as coronavirus numbers skyrocket, some states are holding back on easing restrictions. Texas Gov. Greg Abbott paused any further phases to reopen the state on Thursday and issued an order to ensure hospital beds be available for Covid-19 patients."
          },
          "execution_count": 20,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "displacy.render(doc, style='ent')",
      "execution_count": 21,
      "outputs": [
        {
          "data": {
            "text/html": "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">Meanwhile, as coronavirus numbers skyrocket, some states are holding back on easing restrictions. \n<mark class=\"entity\" style=\"background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n    Texas\n    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">GPE</span>\n</mark>\n Gov. \n<mark class=\"entity\" style=\"background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n    Greg Abbott\n    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PERSON</span>\n</mark>\n paused any further phases to reopen the state on \n<mark class=\"entity\" style=\"background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n    Thursday\n    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DATE</span>\n</mark>\n and issued an order to ensure hospital beds be available for Covid-19 patients.</div></span>",
            "text/plain": "<IPython.core.display.HTML object>"
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Text Classification"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "#### Import necessary libraries"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "import pandas as pd\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import accuracy_score, classification_report, confusion_matrix",
      "execution_count": 1,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "data_imdb = pd.read_csv('imdb.csv')",
      "execution_count": 2,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "data_imdb.head()",
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 3,
          "data": {
            "text/plain": "                                              review sentiment\n0  One of the other reviewers has mentioned that ...  positive\n1  A wonderful little production. <br /><br />The...  positive\n2  I thought this was a wonderful way to spend ti...  positive\n3  Basically there's a family where a little boy ...  negative\n4  Petter Mattei's \"Love in the Time of Money\" is...  positive",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>review</th>\n      <th>sentiment</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>One of the other reviewers has mentioned that ...</td>\n      <td>positive</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>A wonderful little production. &lt;br /&gt;&lt;br /&gt;The...</td>\n      <td>positive</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>I thought this was a wonderful way to spend ti...</td>\n      <td>positive</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Basically there's a family where a little boy ...</td>\n      <td>negative</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Petter Mattei's \"Love in the Time of Money\" is...</td>\n      <td>positive</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "data_imdb.shape",
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 4,
          "data": {
            "text/plain": "(50000, 2)"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "### Distribution of sentiment\ndata_imdb['sentiment'].value_counts()",
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 5,
          "data": {
            "text/plain": "positive    25000\nnegative    25000\nName: sentiment, dtype: int64"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "### Checking the null value\ndata_imdb.isnull().sum()",
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 6,
          "data": {
            "text/plain": "review       0\nsentiment    0\ndtype: int64"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "#### Tokenization"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "import string",
      "execution_count": 7,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "punct =  string.punctuation",
      "execution_count": 8,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "punct",
      "execution_count": 9,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 9,
          "data": {
            "text/plain": "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "#### Data cleaning function"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "def text_data_cleaning(sentence):\n    doc = nlp(sentence)\n    \n    # 1. Case Normalization\n    tokens = []\n    for token in doc:\n        # if token is not pronoun -- lower case\n        if token.lemma_ != '-PRON-':\n            temp = token.lemma_.lower().strip()\n        else:\n            temp = token.lower_\n        \n        tokens.append(temp)\n        \n    # 2. Clearning\n    cleaned_tokens = []\n    \n    for token in tokens:\n        if token not in stop_words and token not in punct:\n            cleaned_tokens.append(token)\n    \n    return cleaned_tokens",
      "execution_count": 10,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "text_data_cleaning('Hello there, I am enjoying the video lesson.')",
      "execution_count": 18,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 18,
          "data": {
            "text/plain": "['hello', 'enjoy', 'video', 'lesson']"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "markdown",
      "source": "#### Vectorization Feature Engineering (TF-IDF)"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "from sklearn.svm import LinearSVC",
      "execution_count": 19,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "tfidf = TfidfVectorizer(tokenizer=text_data_cleaning)\nclassifier = LinearSVC()",
      "execution_count": 20,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "X = data_imdb['review']\ny = data_imdb['sentiment']",
      "execution_count": 21,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)",
      "execution_count": 22,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "X_train.shape, y_train.shape",
      "execution_count": 23,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 23,
          "data": {
            "text/plain": "((40000,), (40000,))"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "X_test.shape, y_test.shape",
      "execution_count": 24,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 24,
          "data": {
            "text/plain": "((10000,), (10000,))"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "clf = Pipeline([('tfidf', tfidf), ('clf', classifier)])\nclf.fit(X_train, y_train)",
      "execution_count": 25,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 25,
          "data": {
            "text/plain": "Pipeline(memory=None,\n         steps=[('tfidf',\n                 TfidfVectorizer(analyzer='word', binary=False,\n                                 decode_error='strict',\n                                 dtype=<class 'numpy.float64'>,\n                                 encoding='utf-8', input='content',\n                                 lowercase=True, max_df=1.0, max_features=None,\n                                 min_df=1, ngram_range=(1, 1), norm='l2',\n                                 preprocessor=None, smooth_idf=True,\n                                 stop_words=None, strip_accents=None,\n                                 sublinear_tf=False,\n                                 token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n                                 tokenizer=<function text_data_cleaning at 0x0000015888517948>,\n                                 use_idf=True, vocabulary=None)),\n                ('clf',\n                 LinearSVC(C=1.0, class_weight=None, dual=True,\n                           fit_intercept=True, intercept_scaling=1,\n                           loss='squared_hinge', max_iter=1000,\n                           multi_class='ovr', penalty='l2', random_state=None,\n                           tol=0.0001, verbose=0))],\n         verbose=False)"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "clf",
      "execution_count": 26,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 26,
          "data": {
            "text/plain": "Pipeline(memory=None,\n         steps=[('tfidf',\n                 TfidfVectorizer(analyzer='word', binary=False,\n                                 decode_error='strict',\n                                 dtype=<class 'numpy.float64'>,\n                                 encoding='utf-8', input='content',\n                                 lowercase=True, max_df=1.0, max_features=None,\n                                 min_df=1, ngram_range=(1, 1), norm='l2',\n                                 preprocessor=None, smooth_idf=True,\n                                 stop_words=None, strip_accents=None,\n                                 sublinear_tf=False,\n                                 token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n                                 tokenizer=<function text_data_cleaning at 0x0000015888517948>,\n                                 use_idf=True, vocabulary=None)),\n                ('clf',\n                 LinearSVC(C=1.0, class_weight=None, dual=True,\n                           fit_intercept=True, intercept_scaling=1,\n                           loss='squared_hinge', max_iter=1000,\n                           multi_class='ovr', penalty='l2', random_state=None,\n                           tol=0.0001, verbose=0))],\n         verbose=False)"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "y_pred = clf.predict(X_test)",
      "execution_count": 27,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "print(classification_report(y_test, y_pred))",
      "execution_count": 28,
      "outputs": [
        {
          "output_type": "stream",
          "text": "              precision    recall  f1-score   support\n\n    negative       0.90      0.88      0.89      4961\n    positive       0.88      0.91      0.89      5039\n\n    accuracy                           0.89     10000\n   macro avg       0.89      0.89      0.89     10000\nweighted avg       0.89      0.89      0.89     10000\n\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "confusion_matrix(y_test, y_pred)",
      "execution_count": 30,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 30,
          "data": {
            "text/plain": "array([[4357,  604],\n       [ 471, 4568]], dtype=int64)"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "# Precision = TP / (TP + FP) = 4357 / (4357 + 471)\n# Recall = TP / (TP + FN) = 4357 / (4357 + 604)",
      "execution_count": 31,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "clf.predict([\"Wow, this is a stupid lession!\"])",
      "execution_count": 35,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 35,
          "data": {
            "text/plain": "array(['negative'], dtype=object)"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "clf.predict(['Waste of time.'])",
      "execution_count": 36,
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 36,
          "data": {
            "text/plain": "array(['negative'], dtype=object)"
          },
          "metadata": {}
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.7.6",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "toc": {
      "nav_menu": {},
      "number_sections": true,
      "sideBar": true,
      "skip_h1_title": false,
      "base_numbering": 1,
      "title_cell": "Table of Contents",
      "title_sidebar": "Contents",
      "toc_cell": false,
      "toc_position": {},
      "toc_section_display": true,
      "toc_window_display": false
    },
    "gist": {
      "id": "",
      "data": {
        "description": "Sentiment Classification using SpaCy for IMDB and Amazon Review Dataset.ipynb",
        "public": true
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}