Skip to content

Instantly share code, notes, and snippets.

Created December 24, 2013 18:53
Show Gist options
  • Save larsmans/8116772 to your computer and use it in GitHub Desktop.
Save larsmans/8116772 to your computer and use it in GitHub Desktop.
Sentiment analysis with scikit-learn
Sentiment analysis experiment using scikit-learn
The script reproduces the sentiment analysis approach from Pang,
Lee and Vaithyanathan (2002), who tried to classify movie reviews as positive
or negative, with three differences:
* tf-idf weighting is applied to terms
* the three-fold cross validation split is different
* regularization is tuned by cross validation
with the result that the accuracy is around 87%, rather than the 82.9%
reported by Pang et al. Only support vector machines are used, since those
gave better results than naive Bayes and logistic regression ("MaxEnt")
according to Pang et al.
To run:
tar xzf review_polarity.tar.gz
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
data = load_files('txt_sentoken')
vect = TfidfVectorizer()
X = vect.fit_transform(
params = {"tfidf__ngram_range": [(1, 1), (1, 2)],
"svc__C": [.01, .1, 1, 10, 100]}
clf = Pipeline([("tfidf", TfidfVectorizer(sublinear_tf=True)),
("svc", LinearSVC())])
gs = GridSearchCV(clf, params, verbose=2, n_jobs=-1),
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment