Skip to content

Instantly share code, notes, and snippets.

@coreylynch
Forked from pprett/bench_rcv1.py
Created November 26, 2012 22:07
Show Gist options
  • Save coreylynch/4150972 to your computer and use it in GitHub Desktop.
Save coreylynch/4150972 to your computer and use it in GitHub Desktop.
Benchmark sklearn's SGDClassifier on RCV1-ccat dataset.
"""
Benchmark sklearn's SGDClassifier on RCV1-ccat dataset.
So generate the input files see http://leon.bottou.org/projects/sgd .
Results
-------
ACC: 0.9479
AUC: 0.9476
3 loops, best of 1: 1.21 s per loop
"""
import svmlight_loader
from sklearn.linear_model import SGDClassifier
from sklearn.utils import shuffle
from sklearn import metrics
X, y = svmlight_loader.load_svmlight_file('../../corpora/rcv1-ccat/train.dat', buffer_mb=500)
X_test, y_test = svmlight_loader.load_svmlight_file('../../corpora/rcv1-ccat/test.dat', n_features=X.shape[1], buffer_mb=500)
X_train, y_train = shuffle(X, y, random_state=0)
del X
del y
clf = SGDClassifier(n_iter=5, alpha=0.00001)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print "ACC: %.4f" % metrics.zero_one_score(y_test, y_pred)
print "AUC: %.4f" % metrics.auc_score(y_test, y_pred)
print "%timeit clf.fit(X_train, y_train)"
print "%timeit clf.score(X_test, y_test)"
@CaiyiZhu
Copy link

CaiyiZhu commented Apr 7, 2016

Hi, where do you get the corpora in this format. I can download it using sklearn, but the data is in the format of pickle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment