Created
December 24, 2013 18:53
-
-
Save larsmans/8116772 to your computer and use it in GitHub Desktop.
Sentiment analysis with scikit-learn
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sentiment analysis experiment using scikit-learn | |
================================================ | |
The script sentiment.py reproduces the sentiment analysis approach from Pang, | |
Lee and Vaithyanathan (2002), who tried to classify movie reviews as positive | |
or negative, with three differences: | |
* tf-idf weighting is applied to terms | |
* the three-fold cross validation split is different | |
* regularization is tuned by cross validation | |
with the result that the accuracy is around 87%, rather than the 82.9% | |
reported by Pang et al. Only support vector machines are used, since those | |
gave better results than naive Bayes and logistic regression ("MaxEnt") | |
according to Pang et al. | |
To run: | |
wget http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz | |
tar xzf review_polarity.tar.gz | |
python sentiment.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.datasets import load_files | |
from sklearn.feature_extraction.text import TfidfVectorizer | |
from sklearn.grid_search import GridSearchCV | |
from sklearn.pipeline import Pipeline | |
from sklearn.svm import LinearSVC | |
data = load_files('txt_sentoken') | |
vect = TfidfVectorizer() | |
X = vect.fit_transform(data.data) | |
params = {"tfidf__ngram_range": [(1, 1), (1, 2)], | |
"svc__C": [.01, .1, 1, 10, 100]} | |
clf = Pipeline([("tfidf", TfidfVectorizer(sublinear_tf=True)), | |
("svc", LinearSVC())]) | |
gs = GridSearchCV(clf, params, verbose=2, n_jobs=-1) | |
gs.fit(data.data, data.target) | |
print(gs.best_estimator_) | |
print(gs.best_score_) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment