rtkgupta/Sentiwordnet.md

## Sentiwordnet.md

      
    Raw
  

              Sentiwordnet.md
            
          
    Introduction

Hello guys! I am going to walk you through my implementation of Sentiwordnet 3.0 on movie reviws to find the overall sentiment of ech review. I have mentioned the datasets and more about Sentiwordnet below. I will be using python 2.7 for coding. Also a few of its libraries like pandas, sklearn and nltk. NLTK has inbuilt modules for Sentiwordnet and Pos Tagger which will also be used in our code. So let's get started !
SentiWordnet

SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. Each of the three scores ranges in the interval
[0.0 ; 1.0], and  their  sum  is 1.0 for  each  synset.
Sentiwordnet  was designed by ranking subjectivity of all terms or synsets according to the part of speech the term belongs to. The parts of speech represented by the sentiwordnet are adjective, noun, adverb and verb which are represented respectively as 'a', 'n', 'r', 'v'. the database has five columns,the part of speech, the offset – which is a numerical ID, that when matched with a particular part of speech, identifies a synset; positive score, negative score (bottom from 0 to 1) and synset terms that includes all terms belonging to a particular synset.
Dataset Used

I have tested sentiwordnet 3.0 on two datsets of movie reviews, both pucked up from kaggle competitions. First, the Rotten Tomatoes dataset available here. This dataset is a .tsv file already parsed by Stanford Parser into phrases, range over 5 levels of sentiments given below:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

The second dataset is the Large Stanford dataset which is a labeled data set consisting of 50,000 IMDB movie reviews with binary labels of 0 and 1.
Text Preprocessing

A utilities function has been created for text preprocessing for removal of stop words, markup tags, extra spaces etc. The code snippet for the function is given below.
class DataClean:
    def __init__(self,clean_list,html_clean = False,split_words=False):
        self.clean_list = clean_list
        self.html_clean = html_clean
        self.split_words = split_words
        self.stopwords_eng = stopwords.words("english") + [u"film",u"movie"]
The complete utilities.py is available at this link.
Methodology

The usage of sentiwordnet 3.0 for sentiment analysis is the main objective. The approach is simple:

Each word is tagged using POS Tagger
The lemma of each tagged word is found using wnl lemmatizer
Scores using synsets for each lemmatized word are computed and added.
This process is iterated over for each sentence/phrase to find the sentence score

 def compute_score(self,sentence):
        taggedsentence = []
        sent_score = []
        taggedsentence.append(tagger.tag(sentence.split()))
        wnl = nltk.WordNetLemmatizer()
        for idx, words in enumerate(taggedsentence):
            for idx2, t in enumerate(words):
                newtag = ''
                lemmatizedsent = wnl.lemmatize(t[0])
                if t[1].startswith('NN'):
                    newtag = 'n'
                elif t[1].startswith('JJ'):
                    newtag = 'a'
                elif t[1].startswith('V'):
                    newtag = 'v'
                elif t[1].startswith('R'):
                    newtag = 'r'
                else:
                    newtag = ''
                if (newtag != ''):
                    synsets = list(swn.senti_synsets(lemmatizedsent, newtag))
                    score = 0.0
                    if (len(synsets) > 0):
                        for syn in synsets:
                            score += syn.pos_score() - syn.neg_score()
                        sent_score.append(score / len(synsets))
            if (len(sent_score)==0 or len(sent_score)==1):
                return (float(0.0))
            else:
                return (sum([word_score for word_score in sent_score]) / (len(sent_score)))
The complete code to it is available on github.
Results

I will be using the following metrics for the code of Sentiwordnet:

Acuuray
Confusion Matrix

You can read more on these from dataschool.
On conducting 4-fold cross validation, the above code results in 51% accuracy on the Rotten Tomatoes 5-level sentiment data and 56.73% on the binary Large IMDb Movie Review dataset.