Skip to content

Instantly share code, notes, and snippets.

@luisfredgs
Forked from larsmans/README
Created October 3, 2016 19:49
Show Gist options
  • Save luisfredgs/c2a2fe362b16a924ac0eaf667a6151b9 to your computer and use it in GitHub Desktop.
Save luisfredgs/c2a2fe362b16a924ac0eaf667a6151b9 to your computer and use it in GitHub Desktop.
Sentiment analysis with scikit-learn
Sentiment analysis experiment using scikit-learn
================================================
The script sentiment.py reproduces the sentiment analysis approach from Pang,
Lee and Vaithyanathan (2002), who tried to classify movie reviews as positive
or negative, with three differences:
* tf-idf weighting is applied to terms
* the three-fold cross validation split is different
* regularization is tuned by cross validation
with the result that the accuracy is around 87%, rather than the 82.9%
reported by Pang et al. Only support vector machines are used, since those
gave better results than naive Bayes and logistic regression ("MaxEnt")
according to Pang et al.
To run:
wget http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
tar xzf review_polarity.tar.gz
python sentiment.py
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
data = load_files('txt_sentoken')
vect = TfidfVectorizer()
X = vect.fit_transform(data.data)
params = {"tfidf__ngram_range": [(1, 1), (1, 2)],
"svc__C": [.01, .1, 1, 10, 100]}
clf = Pipeline([("tfidf", TfidfVectorizer(sublinear_tf=True)),
("svc", LinearSVC())])
gs = GridSearchCV(clf, params, verbose=2, n_jobs=-1)
gs.fit(data.data, data.target)
print(gs.best_estimator_)
print(gs.best_score_)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment