Skip to content

Instantly share code, notes, and snippets.

@veekaybee
Created March 17, 2017 02:09
Show Gist options
  • Save veekaybee/791569d6888ca736c79634b6e87eb9db to your computer and use it in GitHub Desktop.
Save veekaybee/791569d6888ca736c79634b6e87eb9db to your computer and use it in GitHub Desktop.
Deep-diving into NLTK corpora

# What is NLTK? 

A natural-language processing library written in Python, used for tons of applications, including analyzing [movie and restaurant reviews](http://crowdsourcing-class.org/assignments/downloads/pak-paroubek.pdf). 
More on that [here](https://github.com/nltk/nltk/wiki/Sentiment-Analysis).

[Examples](http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/) of how to do sentiment analysis in Python. 
Note that tweets here are hand-labelled with regards to sentiment.


## Where do NLTK corpora come from? 

[Here's](http://www.nltk.org/nltk_data/) a list of all the data: 

### Corpora specific to sentiment analysis: 

1) VADER [Sentiment Lexicon](https://github.com/cjhutto/vaderSentiment) 
id: vader_lexicon; size: 90486; author: C.J. Hutto and Eric Gilbert; copyright: ; license: MIT License;

2) Sentiment Polarity Dataset Version 2.0 
id: movie_reviews; size: 4004848; author: Bo Pang and Lillian Lee; copyright: Copyright (C) 2004 Bo Pang and Lillian Lee; license: Creative Commons Attribution 4.0 International;

### [NLTK's Sentiment analyzer package](http://www.nltk.org/_modules/nltk/sentiment/sentiment_analyzer.html)


## How are corpora created?

[By humans](http://www.uvm.edu/pdodds/files/papers/others/2007/godbole2007a.pdf)

[First by humans, then annotated and revised](https://ijcai.org/Proceedings/15/Papers/587.pdf)

> "The annotation, manually performed, begins with a phase
where five human annotators (two males and three woman,
varying ages) collectively annotated a small set of data (200
tweets), attaining a general agreement on the exploitation of
the labels. Then, we annotated all the data producing for
each tweet not less than two independent annotations. The
agreement calculated at this stage, according to the Cohen’s
κ score, was satisfactory: κ = 0.65. I"


By machines: 
from ["Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis"](http://people.cs.pitt.edu/~wiebe/pubs/papers/emnlp05polarity.pdf):

>A typical approach to sentiment analysis is to start with a lexicon of positive and negative words and
>phrases. In these lexicons, entries are tagged with their a priori prior polarity: out of context, does the word seem to evoke something positive or something
>negative. For example, beautiful has a positive prior polarity, and horrid has a negative prior polarity.
>However, the contextual polarity of the phrase in which a word appears may be different from the
> word’s prior polarity. 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment