CateGitau/NLP.ipynb

## NLP.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Natural Language Processing (NLP)\n",
    "\n",
    "## Text Classification for Sentiment Analysis \n",
    "## using Naive Bayes Classifier\n",
    "\n",
    "\n",
    "<img src=\"Desktop\\NLP&Sentiment Analysis/Capture5.jpg\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "In Natural Language Processing there is a concept known as *Sentiment Analysis*.\n",
    "\n",
    "Given a movie review or a tweet, it can be automatically classified in categories.\n",
    "These categories can be user defined (positive, negative) or whichever classes you want.\n",
    "\n",
    "Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it’s often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I’ll focus on 2 possible sentiment classifications: positive and negative."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Prerequisites\n",
    "Basic knowledge of Python is assumed \n",
    "\n",
    "\n",
    "\n",
    "# URL : goo.gl/AwKCn4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "This codelab is based on the Python programming language together with an open source library called (Natural Language toolkit)NLTK.\n",
    "\n",
    "NLTK Includes extensive software, data and documentation,text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning all free on http://www.nltk.org/ \n",
    "\n",
    "It also contains free texts for analysis from Movie reviews, Twitter data, work from shakespeare etc\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "\n",
    "### Steps to install NLTK and its data:\n",
    "Install Pip: run in terminal:\n",
    "\n",
    "sudo easy_install pip\n",
    "\n",
    "Install NLTK: run in terminal :\n",
    "\n",
    "sudo pip install -U nltk\n",
    "\n",
    "Download NLTK data: run python shell (in terminal) and write the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import nltk\n",
    "#nltk.download()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Let's start with the basics:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<img src=\"Desktop/NLP&Sentiment Analysis/Capture1.PNG\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<img src=\"Desktop/NLP&Sentiment Analysis/Capture2.PNG\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "**AFINN-111** - A list of english words rated for valence with an integer between -5 and + 5.\n",
    "\n",
    "<img src=\"Desktop/NLP&Sentiment Analysis/AFINN.PNG\">\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "#Build dictionary in python\n",
    "sentiment_dictionary ={}\n",
    "\n",
    "for line in open('Desktop/NLP&Sentiment Analysis/AFINN-111.txt'):\n",
    "    word, score = line.split('\\t')\n",
    "    sentiment_dictionary[word] = int(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### **Tokenization** \n",
    "is the process of breaking down a stream of text into words, phrases or symbols known as *tokens*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['i', 'hate', 'this', 'novel', '!']\n"
     ]
    }
   ],
   "source": [
    "from nltk.tokenize import word_tokenize\n",
    "words =word_tokenize('I hate this novel!'.lower())\n",
    "print(words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'i'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum(sentiment_dictionary.get(word,0)for word in words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### What if the text is  really long?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I love this book!\n",
      "Though I hate the beginning.\n",
      "It would be great for you.\n"
     ]
    }
   ],
   "source": [
    "#split into sentences\n",
    "from nltk.tokenize import sent_tokenize\n",
    "sentences = sent_tokenize('''I love this book! Though I hate the beginning. It would be great for you.''')\n",
    "for s in sentences:print(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n",
      "-3\n",
      "3\n"
     ]
    }
   ],
   "source": [
    "#computer score for each sentence\n",
    "for sentence in sentences:\n",
    "    words = word_tokenize(sentence)\n",
    "    print(sum(sentiment_dictionary.get(word,0)for word in words))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## What about new words? or domain specific terms?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "\n",
    "<img src=\"Desktop/NLP&Sentiment Analysis/Machine-Learning.PNG\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<img src=\"Desktop/NLP&Sentiment Analysis/Capture3.PNG\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<img src=\"Desktop/NLP&Sentiment Analysis/Capture6.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<img src=\"Desktop/NLP&Sentiment Analysis/Capture4.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Naive Bayes Algorithm\n",
    "This is a classification algorithm that works on Bayes theorem of probability to predict the class of unknown outcome. It assumes that the presence of a particular feature in a class in unrelated to the presence of any other feature.\n",
    "\n",
    "\n",
    "<img src=\"Desktop/NLP&Sentiment Analysis/bayes.jpg\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "import nltk.classify.util #calculates accuracy\n",
    "from nltk.classify import NaiveBayesClassifier #imports the classifier Naive Bayes\n",
    "from nltk.corpus import movie_reviews #imports movie reviews from nltk\n",
    "from nltk.corpus import stopwords #imports stopwords from nltk\n",
    "from nltk.corpus import wordnet #imports wordnet(lexical database for the english language) from nltk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'plot', u':', u'two', u'teen', u'couples', u'go', ...]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#import movie_reviews\n",
    "from nltk.corpus import movie_reviews\n",
    "\n",
    "#see words in the review\n",
    "movie_reviews.words()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'neg', u'pos']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movie_reviews.categories()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(u',', 77717),\n",
       " (u'the', 76529),\n",
       " (u'.', 65876),\n",
       " (u'a', 38106),\n",
       " (u'and', 35576),\n",
       " (u'of', 34123),\n",
       " (u'to', 31937),\n",
       " (u\"'\", 30585),\n",
       " (u'is', 25195),\n",
       " (u'in', 21822)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#frequency distribution of words in movie review\n",
    "all_words = movie_reviews.words()\n",
    "freq_dist = nltk.FreqDist(all_words)\n",
    "freq_dist.most_common(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Stopwords"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "\n",
    "These are words that carry little or no meaning in a sentence, but are really common(High frequency words). eg a, I , is, the etc\n",
    "\n",
    "When doing Language processing, we need to get rid of these words since they take up a large part of any sentence without adding any context or info."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'i',\n",
       " u'me',\n",
       " u'my',\n",
       " u'myself',\n",
       " u'we',\n",
       " u'our',\n",
       " u'ours',\n",
       " u'ourselves',\n",
       " u'you',\n",
       " u'your',\n",
       " u'yours',\n",
       " u'yourself',\n",
       " u'yourselves',\n",
       " u'he',\n",
       " u'him',\n",
       " u'his']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#inbuilt list of stopwords in nltk\n",
    "stopwords.words('english')[:16]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## How do we remove stopwords?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "sent = \"the program was open to all women between the ages of 17 and 35, in good health, who had graduated from an accredited high school\"\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['program', 'open', 'women', 'ages', '17', '35', ',', 'good', 'health', ',', 'graduated', 'accredited', 'high', 'school']\n"
     ]
    }
   ],
   "source": [
    "#a token is a word or entity in a text\n",
    "words = word_tokenize(sent)\n",
    "useful_words = [word for word in words if word not in stopwords.words('english')]\n",
    "print(useful_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# This is how the Naive Bayes classifier expects the input\n",
    "def create_word_features(words):\n",
    "    useful_words = [word for word in words if word not in stopwords.words(\"english\")]\n",
    "    my_dict = dict([(word, True) for word in useful_words])\n",
    "    return my_dict\n",
    "\n",
    "#For each word, we create a dictionary with all the words and True. Why a dictionary? So that words are not repeated.\n",
    "#If a word already exists, it won’t be added to the dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'brown': True, 'fox': True, 'quick': True}"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "create_word_features([\"the\", \"quick\", \"brown\", \"quick\", \"a\", \"fox\"])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000\n"
     ]
    }
   ],
   "source": [
    "neg_reviews = [] #We creates an empty list\n",
    "\n",
    "#loop over all the files in the neg folder and applies the create_word_features\n",
    "for fileid in movie_reviews.fileids('neg'):\n",
    "    words = movie_reviews.words(fileid)\n",
    "    neg_reviews.append((create_word_features(words),\"negative\")) \n",
    "    \n",
    "print(len(neg_reviews))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000\n"
     ]
    }
   ],
   "source": [
    "pos_reviews = []\n",
    "for fileid in movie_reviews.fileids('pos'):\n",
    "    words = movie_reviews.words(fileid)\n",
    "    pos_reviews.append((create_word_features(words), \"positive\"))\n",
    "    \n",
    "#print(pos_reviews[0])    \n",
    "print(len(pos_reviews))\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Training and Testing the Naive Bayes Classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "The movie reviews corpus has 1000 positive files and 1000 negative files. We’ll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of ‘pos’ or ‘neg’. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1500, 500)\n"
     ]
    }
   ],
   "source": [
    "train_set = neg_reviews[:750] + pos_reviews[:750]\n",
    "test_set =  neg_reviews[750:] + pos_reviews[750:]\n",
    "print(len(train_set),  len(test_set))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "#create the NaiveBayesClassifier\n",
    "classifier = NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "72.4\n"
     ]
    }
   ],
   "source": [
    "accuracy = nltk.classify.util.accuracy(classifier, test_set)\n",
    "print(accuracy * 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "review_emoji_movie = '''\n",
    "This engaging adventure triumphs because of its empowering storyline, which pays tribute to Polynesian culture, and because of its feel-good music, courtesy of Hamilton creator Lin-Manuel Miranda.\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'positive'"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words = word_tokenize(review_emoji_movie)\n",
    "words = create_word_features(words)\n",
    "classifier.classify(words)\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# Natural Language Processing (NLP)\n",
	"\n",
	"## Text Classification for Sentiment Analysis \n",
	"## using Naive Bayes Classifier\n",
	"\n",
	"\n",
	"<img src=\"Desktop\\NLP&Sentiment Analysis/Capture5.jpg\">\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"In Natural Language Processing there is a concept known as Sentiment Analysis.\n",
	"\n",
	"Given a movie review or a tweet, it can be automatically classified in categories.\n",
	"These categories can be user defined (positive, negative) or whichever classes you want.\n",
	"\n",
	"Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it’s often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I’ll focus on 2 possible sentiment classifications: positive and negative."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Prerequisites\n",
	"Basic knowledge of Python is assumed \n",
	"\n",
	"\n",
	"\n",
	"# URL : goo.gl/AwKCn4"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"source": [
	"This codelab is based on the Python programming language together with an open source library called (Natural Language toolkit)NLTK.\n",
	"\n",
	"NLTK Includes extensive software, data and documentation,text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning all free on http://www.nltk.org/ \n",
	"\n",
	"It also contains free texts for analysis from Movie reviews, Twitter data, work from shakespeare etc\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true,
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"\n",
	"### Steps to install NLTK and its data:\n",
	"Install Pip: run in terminal:\n",
	"\n",
	"sudo easy_install pip\n",
	"\n",
	"Install NLTK: run in terminal :\n",
	"\n",
	"sudo pip install -U nltk\n",
	"\n",
	"Download NLTK data: run python shell (in terminal) and write the following code:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": true,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"import nltk\n",
	"#nltk.download()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Let's start with the basics:"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"source": [
	"<img src=\"Desktop/NLP&Sentiment Analysis/Capture1.PNG\">"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"<img src=\"Desktop/NLP&Sentiment Analysis/Capture2.PNG\">"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"AFINN-111 - A list of english words rated for valence with an integer between -5 and + 5.\n",
	"\n",
	"<img src=\"Desktop/NLP&Sentiment Analysis/AFINN.PNG\">\n",
	"\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [],
	"source": [
	"#Build dictionary in python\n",
	"sentiment_dictionary ={}\n",
	"\n",
	"for line in open('Desktop/NLP&Sentiment Analysis/AFINN-111.txt'):\n",
	" word, score = line.split('\\t')\n",
	" sentiment_dictionary[word] = int(score)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"source": [
	"### Tokenization \n",
	"is the process of breaking down a stream of text into words, phrases or symbols known as tokens"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"['i', 'hate', 'this', 'novel', '!']\n"
	]
	}
	],
	"source": [
	"from nltk.tokenize import word_tokenize\n",
	"words =word_tokenize('I hate this novel!'.lower())\n",
	"print(words)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "skip"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'i'"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"words[0]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"-1"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sum(sentiment_dictionary.get(word,0)for word in words)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### What if the text is really long?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"I love this book!\n",
	"Though I hate the beginning.\n",
	"It would be great for you.\n"
	]
	}
	],
	"source": [
	"#split into sentences\n",
	"from nltk.tokenize import sent_tokenize\n",
	"sentences = sent_tokenize('''I love this book! Though I hate the beginning. It would be great for you.''')\n",
	"for s in sentences:print(s)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"3\n",
	"-3\n",
	"3\n"
	]
	}
	],
	"source": [
	"#computer score for each sentence\n",
	"for sentence in sentences:\n",
	" words = word_tokenize(sentence)\n",
	" print(sum(sentiment_dictionary.get(word,0)for word in words))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"## What about new words? or domain specific terms?"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"source": [
	"\n",
	"<img src=\"Desktop/NLP&Sentiment Analysis/Machine-Learning.PNG\">"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"<img src=\"Desktop/NLP&Sentiment Analysis/Capture3.PNG\">"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"<img src=\"Desktop/NLP&Sentiment Analysis/Capture6.png\">"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"<img src=\"Desktop/NLP&Sentiment Analysis/Capture4.png\">"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# Naive Bayes Algorithm\n",
	"This is a classification algorithm that works on Bayes theorem of probability to predict the class of unknown outcome. It assumes that the presence of a particular feature in a class in unrelated to the presence of any other feature.\n",
	"\n",
	"\n",
	"<img src=\"Desktop/NLP&Sentiment Analysis/bayes.jpg\">"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {
	"collapsed": true,
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [],
	"source": [
	"import nltk.classify.util #calculates accuracy\n",
	"from nltk.classify import NaiveBayesClassifier #imports the classifier Naive Bayes\n",
	"from nltk.corpus import movie_reviews #imports movie reviews from nltk\n",
	"from nltk.corpus import stopwords #imports stopwords from nltk\n",
	"from nltk.corpus import wordnet #imports wordnet(lexical database for the english language) from nltk"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[u'plot', u':', u'two', u'teen', u'couples', u'go', ...]"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"#import movie_reviews\n",
	"from nltk.corpus import movie_reviews\n",
	"\n",
	"#see words in the review\n",
	"movie_reviews.words()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[u'neg', u'pos']"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"movie_reviews.categories()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[(u',', 77717),\n",
	" (u'the', 76529),\n",
	" (u'.', 65876),\n",
	" (u'a', 38106),\n",
	" (u'and', 35576),\n",
	" (u'of', 34123),\n",
	" (u'to', 31937),\n",
	" (u\"'\", 30585),\n",
	" (u'is', 25195),\n",
	" (u'in', 21822)]"
	]
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"#frequency distribution of words in movie review\n",
	"all_words = movie_reviews.words()\n",
	"freq_dist = nltk.FreqDist(all_words)\n",
	"freq_dist.most_common(10)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Stopwords"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"source": [
	"\n",
	"These are words that carry little or no meaning in a sentence, but are really common(High frequency words). eg a, I , is, the etc\n",
	"\n",
	"When doing Language processing, we need to get rid of these words since they take up a large part of any sentence without adding any context or info."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[u'i',\n",
	" u'me',\n",
	" u'my',\n",
	" u'myself',\n",
	" u'we',\n",
	" u'our',\n",
	" u'ours',\n",
	" u'ourselves',\n",
	" u'you',\n",
	" u'your',\n",
	" u'yours',\n",
	" u'yourself',\n",
	" u'yourselves',\n",
	" u'he',\n",
	" u'him',\n",
	" u'his']"
	]
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"#inbuilt list of stopwords in nltk\n",
	"stopwords.words('english')[:16]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"## How do we remove stopwords?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {
	"collapsed": true,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"sent = \"the program was open to all women between the ages of 17 and 35, in good health, who had graduated from an accredited high school\"\n",
	"\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"['program', 'open', 'women', 'ages', '17', '35', ',', 'good', 'health', ',', 'graduated', 'accredited', 'high', 'school']\n"
	]
	}
	],
	"source": [
	"#a token is a word or entity in a text\n",
	"words = word_tokenize(sent)\n",
	"useful_words = [word for word in words if word not in stopwords.words('english')]\n",
	"print(useful_words)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {
	"collapsed": true,
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [],
	"source": [
	"# This is how the Naive Bayes classifier expects the input\n",
	"def create_word_features(words):\n",
	" useful_words = [word for word in words if word not in stopwords.words(\"english\")]\n",
	" my_dict = dict([(word, True) for word in useful_words])\n",
	" return my_dict\n",
	"\n",
	"#For each word, we create a dictionary with all the words and True. Why a dictionary? So that words are not repeated.\n",
	"#If a word already exists, it won’t be added to the dictionary."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"{'brown': True, 'fox': True, 'quick': True}"
	]
	},
	"execution_count": 17,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"create_word_features([\"the\", \"quick\", \"brown\", \"quick\", \"a\", \"fox\"])\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1000\n"
	]
	}
	],
	"source": [
	"neg_reviews = [] #We creates an empty list\n",
	"\n",
	"#loop over all the files in the neg folder and applies the create_word_features\n",
	"for fileid in movie_reviews.fileids('neg'):\n",
	" words = movie_reviews.words(fileid)\n",
	" neg_reviews.append((create_word_features(words),\"negative\")) \n",
	" \n",
	"print(len(neg_reviews))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1000\n"
	]
	}
	],
	"source": [
	"pos_reviews = []\n",
	"for fileid in movie_reviews.fileids('pos'):\n",
	" words = movie_reviews.words(fileid)\n",
	" pos_reviews.append((create_word_features(words), \"positive\"))\n",
	" \n",
	"#print(pos_reviews[0]) \n",
	"print(len(pos_reviews))\n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"## Training and Testing the Naive Bayes Classifier"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "notes"
	}
	},
	"source": [
	"The movie reviews corpus has 1000 positive files and 1000 negative files. We’ll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of ‘pos’ or ‘neg’. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"(1500, 500)\n"
	]
	}
	],
	"source": [
	"train_set = neg_reviews[:750] + pos_reviews[:750]\n",
	"test_set = neg_reviews[750:] + pos_reviews[750:]\n",
	"print(len(train_set), len(test_set))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [],
	"source": [
	"#create the NaiveBayesClassifier\n",
	"classifier = NaiveBayesClassifier.train(train_set)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"72.4\n"
	]
	}
	],
	"source": [
	"accuracy = nltk.classify.util.accuracy(classifier, test_set)\n",
	"print(accuracy * 100)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [],
	"source": [
	"review_emoji_movie = '''\n",
	"This engaging adventure triumphs because of its empowering storyline, which pays tribute to Polynesian culture, and because of its feel-good music, courtesy of Hamilton creator Lin-Manuel Miranda.\n",
	"'''"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"metadata": {
	"collapsed": false,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'positive'"
	]
	},
	"execution_count": 29,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"words = word_tokenize(review_emoji_movie)\n",
	"words = create_word_features(words)\n",
	"classifier.classify(words)\n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"source": []
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"celltoolbar": "Slideshow",
	"kernelspec": {
	"display_name": "Python [Root]",
	"language": "python",
	"name": "Python [Root]"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.12"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}