CateGitau/nlp.pynb

## nlp.pynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine Learning "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Natural Language Processing and Sentiment Analysis with Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prerequisites\n",
    "Basic knowledge of Python is assumed "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "This codelab is based on the Python programming language together with an open source library called (Natural Language toolkit)NLTK.\n",
    "\n",
    "NLTK Includes extensive software, data and documentation all free on http://www.nltk.org/ \n",
    "\n",
    "eg. It contains free texts for analysis from Movie reviews, Twitter data, work from shakespeare etc\n",
    "\n",
    "### steps to install NLTK and its data:\n",
    "Install Pip: run in terminal:\n",
    "\n",
    "sudo easy_install pip\n",
    "\n",
    "Install NLTK: run in terminal :\n",
    "\n",
    "sudo pip install -U nltk\n",
    "\n",
    "Download NLTK data: run python shell (in terminal) and write the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import nltk\n",
    "#nltk.download()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's start with the basics:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"Desktop/NLP&Sentiment Analysis/Capture1.PNG\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"Desktop/NLP&Sentiment Analysis/Capture2.PNG\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**AFINN-111** - A list of english words rated for valence with an integer between -5 and + 5.\n",
    "\n",
    "<img src=\"Desktop/NLP&Sentiment Analysis/AFINN.PNG\">\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sentiment_dictionary ={}\n",
    "\n",
    "for line in open('Desktop/NLP&Sentiment Analysis/AFINN-111.txt'):\n",
    "    word, score = line.split('\\t')\n",
    "    sentiment_dictionary[word] = int(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Tokenization** is the process of breaking down a stream of text into words, phrases or simboles known as *tokens*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['i', 'hate', 'this', 'novel', '!']\n"
     ]
    }
   ],
   "source": [
    "from nltk.tokenize import word_tokenize\n",
    "words =word_tokenize('I hate this novel!'.lower())\n",
    "print(words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-3"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum(sentiment_dictionary.get(word,0)for word in words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I love this book!\n",
      "Though I hate the beginning.\n",
      "It would be great for you.\n"
     ]
    }
   ],
   "source": [
    "#split into sentences\n",
    "from nltk.tokenize import sent_tokenize\n",
    "sentences = sent_tokenize('''I love this book! Though I hate the beginning. It would be great for you.''')\n",
    "for s in sentences:print(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n",
      "-3\n",
      "3\n"
     ]
    }
   ],
   "source": [
    "#computer score for each sentence\n",
    "for sentence in sentences:\n",
    "    words = word_tokenize(sentence)\n",
    "    print(sum(sentiment_dictionary.get(word,0)for word in words))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What about new words? or domain specific terms?\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "<img src=\"Desktop/NLP&Sentiment Analysis/Machine-Learning.PNG\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"Desktop/NLP&Sentiment Analysis/Capture3.PNG\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"Desktop/NLP&Sentiment Analysis/Capture4.PNG\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import nltk.classify.util #calculates accuracy\n",
    "from nltk.classify import NaiveBayesClassifier #imports the classifier Naive Bayes\n",
    "from nltk.corpus import movie_reviews #imports movie reviews from nltk\n",
    "from nltk.corpus import stopwords #imports stopwords from nltk\n",
    "from nltk.corpus import wordnet #imports wordnet(lexical database for the english language) from nltk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stopwords\n",
    "\n",
    "These are words that carry little or no meaning in a sentence, but are really common(High frequency words). eg a, I , is, the etc\n",
    "\n",
    "When doing Language processing, we need to get rid of these words since they take up a large part of any sentence without adding any context or info."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'i',\n",
       " u'me',\n",
       " u'my',\n",
       " u'myself',\n",
       " u'we',\n",
       " u'our',\n",
       " u'ours',\n",
       " u'ourselves',\n",
       " u'you',\n",
       " u'your',\n",
       " u'yours',\n",
       " u'yourself',\n",
       " u'yourselves',\n",
       " u'he',\n",
       " u'him',\n",
       " u'his']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#inbuilt list of stopwords in nltk\n",
    "stopwords.words('english')[:16]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sent = \"the program was open to all women between the ages of 17 and 35, in good health,who had graduated from an accredited high school\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['program', 'open', 'women', 'ages', '17', '35', ',', 'good', 'health', ',', 'graduated', 'accredited', 'high', 'school']\n"
     ]
    }
   ],
   "source": [
    "#a token is a word or entity in a text\n",
    "words = word_tokenize(sent)\n",
    "useful_words = [word for word in words if word not in stopwords.words('english')]\n",
    "print(useful_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'plot', u':', u'two', u'teen', u'couples', u'go', ...]"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#import movie_reviews\n",
    "from nltk.corpus import movie_reviews\n",
    "\n",
    "#see words in the review\n",
    "movie_reviews.words()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'neg', u'pos']"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movie_reviews.categories()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(u',', 77717),\n",
       " (u'the', 76529),\n",
       " (u'.', 65876),\n",
       " (u'a', 38106),\n",
       " (u'and', 35576),\n",
       " (u'of', 34123),\n",
       " (u'to', 31937),\n",
       " (u\"'\", 30585),\n",
       " (u'is', 25195),\n",
       " (u'in', 21822)]"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#frequency distribution of words in movie review\n",
    "all_words = movie_reviews.words()\n",
    "freq_dist = nltk.FreqDist(all_words)\n",
    "freq_dist.most_common(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most of these words are stop words. When we build our sentiment analysis program, we’ll have to get rid of them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sentiment Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### what is semtiment analysis ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example is going to use Naive Bayes classifier for this example. This is a simple machine learning algorithm that works mainly with probabilities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# This is how the Naive Bayes classifier expects the input\n",
    "def create_word_features(words):\n",
    "    useful_words = [word for word in words if word not in stopwords.words(\"english\")]\n",
    "    my_dict = dict([(word, True) for word in useful_words])\n",
    "    return my_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'I': True, 'beautiful': True, 'girl': True}"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "create_word_features([\"I\",\"am\",\"a\",\"beautiful\",\"girl\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each word, we create a dictionary with all the words and True. Why a dictionary? So that words are not repeated. If a word already exists, it won’t be added to the dictionary. This is the format the Naive Bayes classifier in nltk expects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we create an empty list called *neg_reviews*. Next, we loop over all the files in the *neg* folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000\n"
     ]
    }
   ],
   "source": [
    "neg_reviews = []\n",
    "for fileid in movie_reviews.fileids('neg'):\n",
    "    words = movie_reviews.words(fileid)\n",
    "    neg_reviews.append((create_word_features(words),\"negative\")) \n",
    "    \n",
    "print(len(neg_reviews))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000\n"
     ]
    }
   ],
   "source": [
    "pos_reviews = []\n",
    "for fileid in movie_reviews.fileids('pos'):\n",
    "    words = movie_reviews.words(fileid)\n",
    "    pos_reviews.append((create_word_features(words), \"positive\"))\n",
    "    \n",
    "#print(pos_reviews[0])    \n",
    "print(len(pos_reviews))\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we create out test and train samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1500, 500)\n"
     ]
    }
   ],
   "source": [
    "train_set = neg_reviews[:750] + pos_reviews[:750]\n",
    "test_set =  neg_reviews[750:] + pos_reviews[750:]\n",
    "print(len(train_set),  len(test_set))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#create the NaiveBayesClassifier\n",
    "classifier = NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "72.4\n"
     ]
    }
   ],
   "source": [
    "accuracy = nltk.classify.util.accuracy(classifier, test_set)\n",
    "print(accuracy * 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "review_boss_baby = '''\n",
    "This movie is just annoying!\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'negative'"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words = word_tokenize(review_boss_baby)\n",
    "words = create_word_features(words)\n",
    "classifier.classify(words)\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Machine Learning "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Natural Language Processing and Sentiment Analysis with Python"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Prerequisites\n",
	"Basic knowledge of Python is assumed "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"This codelab is based on the Python programming language together with an open source library called (Natural Language toolkit)NLTK.\n",
	"\n",
	"NLTK Includes extensive software, data and documentation all free on http://www.nltk.org/ \n",
	"\n",
	"eg. It contains free texts for analysis from Movie reviews, Twitter data, work from shakespeare etc\n",
	"\n",
	"### steps to install NLTK and its data:\n",
	"Install Pip: run in terminal:\n",
	"\n",
	"sudo easy_install pip\n",
	"\n",
	"Install NLTK: run in terminal :\n",
	"\n",
	"sudo pip install -U nltk\n",
	"\n",
	"Download NLTK data: run python shell (in terminal) and write the following code:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import nltk\n",
	"#nltk.download()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Let's start with the basics:"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<img src=\"Desktop/NLP&Sentiment Analysis/Capture1.PNG\">"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<img src=\"Desktop/NLP&Sentiment Analysis/Capture2.PNG\">"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"AFINN-111 - A list of english words rated for valence with an integer between -5 and + 5.\n",
	"\n",
	"<img src=\"Desktop/NLP&Sentiment Analysis/AFINN.PNG\">\n",
	"\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"sentiment_dictionary ={}\n",
	"\n",
	"for line in open('Desktop/NLP&Sentiment Analysis/AFINN-111.txt'):\n",
	" word, score = line.split('\\t')\n",
	" sentiment_dictionary[word] = int(score)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Tokenization is the process of breaking down a stream of text into words, phrases or simboles known as tokens"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"['i', 'hate', 'this', 'novel', '!']\n"
	]
	}
	],
	"source": [
	"from nltk.tokenize import word_tokenize\n",
	"words =word_tokenize('I hate this novel!'.lower())\n",
	"print(words)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"-3"
	]
	},
	"execution_count": 16,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sum(sentiment_dictionary.get(word,0)for word in words)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"I love this book!\n",
	"Though I hate the beginning.\n",
	"It would be great for you.\n"
	]
	}
	],
	"source": [
	"#split into sentences\n",
	"from nltk.tokenize import sent_tokenize\n",
	"sentences = sent_tokenize('''I love this book! Though I hate the beginning. It would be great for you.''')\n",
	"for s in sentences:print(s)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"3\n",
	"-3\n",
	"3\n"
	]
	}
	],
	"source": [
	"#computer score for each sentence\n",
	"for sentence in sentences:\n",
	" words = word_tokenize(sentence)\n",
	" print(sum(sentiment_dictionary.get(word,0)for word in words))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## What about new words? or domain specific terms?\n",
	"\n",
	"\n",
	"\n",
	"\n",
	"<img src=\"Desktop/NLP&Sentiment Analysis/Machine-Learning.PNG\">"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<img src=\"Desktop/NLP&Sentiment Analysis/Capture3.PNG\">"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<img src=\"Desktop/NLP&Sentiment Analysis/Capture4.PNG\">"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 33,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import nltk.classify.util #calculates accuracy\n",
	"from nltk.classify import NaiveBayesClassifier #imports the classifier Naive Bayes\n",
	"from nltk.corpus import movie_reviews #imports movie reviews from nltk\n",
	"from nltk.corpus import stopwords #imports stopwords from nltk\n",
	"from nltk.corpus import wordnet #imports wordnet(lexical database for the english language) from nltk"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Stopwords\n",
	"\n",
	"These are words that carry little or no meaning in a sentence, but are really common(High frequency words). eg a, I , is, the etc\n",
	"\n",
	"When doing Language processing, we need to get rid of these words since they take up a large part of any sentence without adding any context or info."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[u'i',\n",
	" u'me',\n",
	" u'my',\n",
	" u'myself',\n",
	" u'we',\n",
	" u'our',\n",
	" u'ours',\n",
	" u'ourselves',\n",
	" u'you',\n",
	" u'your',\n",
	" u'yours',\n",
	" u'yourself',\n",
	" u'yourselves',\n",
	" u'he',\n",
	" u'him',\n",
	" u'his']"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"#inbuilt list of stopwords in nltk\n",
	"stopwords.words('english')[:16]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 41,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"sent = \"the program was open to all women between the ages of 17 and 35, in good health,who had graduated from an accredited high school\""
	]
	},
	{
	"cell_type": "code",
	"execution_count": 42,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"['program', 'open', 'women', 'ages', '17', '35', ',', 'good', 'health', ',', 'graduated', 'accredited', 'high', 'school']\n"
	]
	}
	],
	"source": [
	"#a token is a word or entity in a text\n",
	"words = word_tokenize(sent)\n",
	"useful_words = [word for word in words if word not in stopwords.words('english')]\n",
	"print(useful_words)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 43,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[u'plot', u':', u'two', u'teen', u'couples', u'go', ...]"
	]
	},
	"execution_count": 43,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"#import movie_reviews\n",
	"from nltk.corpus import movie_reviews\n",
	"\n",
	"#see words in the review\n",
	"movie_reviews.words()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 44,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[u'neg', u'pos']"
	]
	},
	"execution_count": 44,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"movie_reviews.categories()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 45,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[(u',', 77717),\n",
	" (u'the', 76529),\n",
	" (u'.', 65876),\n",
	" (u'a', 38106),\n",
	" (u'and', 35576),\n",
	" (u'of', 34123),\n",
	" (u'to', 31937),\n",
	" (u\"'\", 30585),\n",
	" (u'is', 25195),\n",
	" (u'in', 21822)]"
	]
	},
	"execution_count": 45,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"#frequency distribution of words in movie review\n",
	"all_words = movie_reviews.words()\n",
	"freq_dist = nltk.FreqDist(all_words)\n",
	"freq_dist.most_common(10)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Most of these words are stop words. When we build our sentiment analysis program, we’ll have to get rid of them."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Sentiment Analysis"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### what is semtiment analysis ?"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This example is going to use Naive Bayes classifier for this example. This is a simple machine learning algorithm that works mainly with probabilities"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 46,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# This is how the Naive Bayes classifier expects the input\n",
	"def create_word_features(words):\n",
	" useful_words = [word for word in words if word not in stopwords.words(\"english\")]\n",
	" my_dict = dict([(word, True) for word in useful_words])\n",
	" return my_dict"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 47,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"{'I': True, 'beautiful': True, 'girl': True}"
	]
	},
	"execution_count": 47,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"create_word_features([\"I\",\"am\",\"a\",\"beautiful\",\"girl\"])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"For each word, we create a dictionary with all the words and True. Why a dictionary? So that words are not repeated. If a word already exists, it won’t be added to the dictionary. This is the format the Naive Bayes classifier in nltk expects"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Next we create an empty list called neg_reviews. Next, we loop over all the files in the neg folder"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 48,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1000\n"
	]
	}
	],
	"source": [
	"neg_reviews = []\n",
	"for fileid in movie_reviews.fileids('neg'):\n",
	" words = movie_reviews.words(fileid)\n",
	" neg_reviews.append((create_word_features(words),\"negative\")) \n",
	" \n",
	"print(len(neg_reviews))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 49,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1000\n"
	]
	}
	],
	"source": [
	"pos_reviews = []\n",
	"for fileid in movie_reviews.fileids('pos'):\n",
	" words = movie_reviews.words(fileid)\n",
	" pos_reviews.append((create_word_features(words), \"positive\"))\n",
	" \n",
	"#print(pos_reviews[0]) \n",
	"print(len(pos_reviews))\n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now we create out test and train samples"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 50,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"(1500, 500)\n"
	]
	}
	],
	"source": [
	"train_set = neg_reviews[:750] + pos_reviews[:750]\n",
	"test_set = neg_reviews[750:] + pos_reviews[750:]\n",
	"print(len(train_set), len(test_set))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 51,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"#create the NaiveBayesClassifier\n",
	"classifier = NaiveBayesClassifier.train(train_set)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 52,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"72.4\n"
	]
	}
	],
	"source": [
	"accuracy = nltk.classify.util.accuracy(classifier, test_set)\n",
	"print(accuracy * 100)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 61,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"review_boss_baby = '''\n",
	"This movie is just annoying!\n",
	"'''"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 62,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'negative'"
	]
	},
	"execution_count": 62,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"words = word_tokenize(review_boss_baby)\n",
	"words = create_word_features(words)\n",
	"classifier.classify(words)\n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": []
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"kernelspec": {
	"display_name": "Python [Root]",
	"language": "python",
	"name": "Python [Root]"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.12"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}