larsmans/README

## README
Sentiment analysis experiment using scikit-learn
================================================

The script sentiment.py reproduces the sentiment analysis approach from Pang,
Lee and Vaithyanathan (2002), who tried to classify movie reviews as positive
or negative, with three differences:

* tf-idf weighting is applied to terms
* the three-fold cross validation split is different
* regularization is tuned by cross validation

with the result that the accuracy is around 87%, rather than the 82.9%
reported by Pang et al. Only support vector machines are used, since those
gave better results than naive Bayes and logistic regression ("MaxEnt")
according to Pang et al.

To run:

    wget http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
    tar xzf review_polarity.tar.gz
    python sentiment.py

## gistfile1.py
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

data = load_files('txt_sentoken')

vect = TfidfVectorizer()
X = vect.fit_transform(data.data)

params = {"tfidf__ngram_range": [(1, 1), (1, 2)],
          "svc__C": [.01, .1, 1, 10, 100]}

clf = Pipeline([("tfidf", TfidfVectorizer(sublinear_tf=True)),
                ("svc", LinearSVC())])

gs = GridSearchCV(clf, params, verbose=2, n_jobs=-1)
gs.fit(data.data, data.target)
print(gs.best_estimator_)
print(gs.best_score_)
	Sentiment analysis experiment using scikit-learn
	================================================

	The script sentiment.py reproduces the sentiment analysis approach from Pang,
	Lee and Vaithyanathan (2002), who tried to classify movie reviews as positive
	or negative, with three differences:

	* tf-idf weighting is applied to terms
	* the three-fold cross validation split is different
	* regularization is tuned by cross validation

	with the result that the accuracy is around 87%, rather than the 82.9%
	reported by Pang et al. Only support vector machines are used, since those
	gave better results than naive Bayes and logistic regression ("MaxEnt")
	according to Pang et al.

	To run:

	wget http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
	tar xzf review_polarity.tar.gz
	python sentiment.py
	from sklearn.datasets import load_files
	from sklearn.feature_extraction.text import TfidfVectorizer
	from sklearn.grid_search import GridSearchCV
	from sklearn.pipeline import Pipeline
	from sklearn.svm import LinearSVC

	data = load_files('txt_sentoken')

	vect = TfidfVectorizer()
	X = vect.fit_transform(data.data)

	params = {"tfidf__ngram_range": [(1, 1), (1, 2)],
	"svc__C": [.01, .1, 1, 10, 100]}

	clf = Pipeline([("tfidf", TfidfVectorizer(sublinear_tf=True)),
	("svc", LinearSVC())])

	gs = GridSearchCV(clf, params, verbose=2, n_jobs=-1)
	gs.fit(data.data, data.target)
	print(gs.best_estimator_)
	print(gs.best_score_)