The paper can be found here
Put in simple words: The paper presents a way on how you can classify text without any annotated data (i.e. unsupervised) and some minimal domain knowledge. The paper uses the domain of reviews, where the domain knowledge is knowing excellent
is positive
while poor
is negative
sentiment.
Computing the semantic closeness of "important phrases" in the text to some pre-defined ideas - here words 'excellent' and 'poor' - can be used to determine the class of the review text.
"The classification of a review is predicted by the average semantic orientation of the phrases in the review that contain adjectives or adverbs."
SemanticOrientation(Phrase p) = PMI(p, 'excellent') - PMI(p, 'poor')
SemanticOrientation(Review r) = Average( SemanticOrientation(Phrase p) ) for all Phrases p in the Review r
- PoS tagging to identify and extract phrases/bi-grams containing adverb or adjectives.
- Compute
Semantic Orientation
of each of the extracted phrases. - Compute
Average Semantic Orientation
of the review - Classify a review as positive if its average semantic orientation comes greater than 0, else negative
First we run a Part-of-speech tagger to tag the various words in a review. To extract the phrases following patterns can be looked for, where the symbols JJ, NN, RB mean the usual PoS.
PMI, also known as Pointwise Mutual Information, between two words is a measure of gauging their co-occurance.
Mathematically, it is:
MI (often written as I), or mutual information, between two random variables X and Y is a measure of intuitively understanding - by knowing X, how much do I know about random variable Y.
Mathematically it is given as:
This, I guess, makes clear why pointwise mutual information had the word pointwise in it.
PMI-IR uses Pointwise Mutual Information(PMI) and Information Retieval(IR) to measure the similarity of pairs of words or phrases. PMI-IR estimates PMI by issuing queries to a search engine (hence the IR in PMI-IR) and noting
the number of hits (matching documents).
Note: AltaVista
was used for this paper since it allows NEAR operator constraints
Let hits(query) be the number of hits returned, given the query. Then the estimate of SO can be given as follows:
Unlike the Hatzivassiloglou and McKeown's method, which is designed for isolated adjectives, this method uses phrases containing adjectives and adverbs. This is advantageous because, although an isolated adjective may indicate subjectivity, it lacks contexts to truly determine the semantic orientation. (Compare unpredictable
in unpredictable steering
- is negative but in unpredictable plot
- is positive). Hence, this might have a better performance on some datasets.
The paper illustrates a very simplistic and straightforward startegy of classifying reviews in an unsupervised manner. This algorithm is easy to estimate and may quickly some baseline or even good results for your dataset. It also, doesn't face the problem of sparsity that general bi-gram models face because of its treatment of features. It doesn't have to maintain a count or anything for the features and uses a search engine to compute the PMI between two words. Can these kinds of methods be used to avoid the sparisity problem that n-gram models in general face and still be able to unleash their power? Something we might look for in the future.