Skip to content

Instantly share code, notes, and snippets.

@rtkgupta
Created March 6, 2016 07:02
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save rtkgupta/da5a7c5b66383d8db2b1 to your computer and use it in GitHub Desktop.
Save rtkgupta/da5a7c5b66383d8db2b1 to your computer and use it in GitHub Desktop.

Introduction

Hello guys! I am going to walk you through my implementation of Sentiwordnet 3.0 on movie reviws to find the overall sentiment of ech review. I have mentioned the datasets and more about Sentiwordnet below. I will be using python 2.7 for coding. Also a few of its libraries like pandas, sklearn and nltk. NLTK has inbuilt modules for Sentiwordnet and Pos Tagger which will also be used in our code. So let's get started !

SentiWordnet

SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. Each of the three scores ranges in the interval [0.0 ; 1.0], and their sum is 1.0 for each synset.

Sentiwordnet was designed by ranking subjectivity of all terms or synsets according to the part of speech the term belongs to. The parts of speech represented by the sentiwordnet are adjective, noun, adverb and verb which are represented respectively as 'a', 'n', 'r', 'v'. the database has five columns,the part of speech, the offset – which is a numerical ID, that when matched with a particular part of speech, identifies a synset; positive score, negative score (bottom from 0 to 1) and synset terms that includes all terms belonging to a particular synset.

Dataset Used

I have tested sentiwordnet 3.0 on two datsets of movie reviews, both pucked up from kaggle competitions. First, the Rotten Tomatoes dataset available here. This dataset is a .tsv file already parsed by Stanford Parser into phrases, range over 5 levels of sentiments given below:

  • 0 - negative
  • 1 - somewhat negative
  • 2 - neutral
  • 3 - somewhat positive
  • 4 - positive

The second dataset is the Large Stanford dataset which is a labeled data set consisting of 50,000 IMDB movie reviews with binary labels of 0 and 1.

Text Preprocessing

A utilities function has been created for text preprocessing for removal of stop words, markup tags, extra spaces etc. The code snippet for the function is given below.

class DataClean:
    def __init__(self,clean_list,html_clean = False,split_words=False):
        self.clean_list = clean_list
        self.html_clean = html_clean
        self.split_words = split_words
        self.stopwords_eng = stopwords.words("english") + [u"film",u"movie"]

The complete utilities.py is available at this link.

Methodology

The usage of sentiwordnet 3.0 for sentiment analysis is the main objective. The approach is simple:

  • Each word is tagged using POS Tagger
  • The lemma of each tagged word is found using wnl lemmatizer
  • Scores using synsets for each lemmatized word are computed and added.
  • This process is iterated over for each sentence/phrase to find the sentence score
 def compute_score(self,sentence):
        taggedsentence = []
        sent_score = []
        taggedsentence.append(tagger.tag(sentence.split()))
        wnl = nltk.WordNetLemmatizer()
        for idx, words in enumerate(taggedsentence):
            for idx2, t in enumerate(words):
                newtag = ''
                lemmatizedsent = wnl.lemmatize(t[0])
                if t[1].startswith('NN'):
                    newtag = 'n'
                elif t[1].startswith('JJ'):
                    newtag = 'a'
                elif t[1].startswith('V'):
                    newtag = 'v'
                elif t[1].startswith('R'):
                    newtag = 'r'
                else:
                    newtag = ''
                if (newtag != ''):
                    synsets = list(swn.senti_synsets(lemmatizedsent, newtag))
                    score = 0.0
                    if (len(synsets) > 0):
                        for syn in synsets:
                            score += syn.pos_score() - syn.neg_score()
                        sent_score.append(score / len(synsets))
            if (len(sent_score)==0 or len(sent_score)==1):
                return (float(0.0))
            else:
                return (sum([word_score for word_score in sent_score]) / (len(sent_score)))

The complete code to it is available on github.

Results

I will be using the following metrics for the code of Sentiwordnet:

  • Acuuray
  • Confusion Matrix

You can read more on these from dataschool. On conducting 4-fold cross validation, the above code results in 51% accuracy on the Rotten Tomatoes 5-level sentiment data and 56.73% on the binary Large IMDb Movie Review dataset.

@iamniteshkumar
Copy link

where is the full code of methodology

@shabnamderakhshan
Copy link

Hi, where is the full code of methodology?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment