Skip to content

Instantly share code, notes, and snippets.

@jakevdp
Created September 29, 2011 14:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jakevdp/1250783 to your computer and use it in GitHub Desktop.
Save jakevdp/1250783 to your computer and use it in GitHub Desktop.
test code & dataset for scikit-learn issue #365
code demonstrating the problem seen in issue #365
to run the example:
tar -zxvf data.tgz
python test.py
import numpy as np
from operator import itemgetter
from sklearn.feature_extraction.text import Vectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_files
data_train = load_files('data_train')
data_test = load_files('data_test')
categories = data_train.target_names
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
vectorizer = Vectorizer()
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary.iteritems(),
key=itemgetter(1))])
knnfitted = KNeighborsClassifier(n_neighbors=1000,
algorithm='brute').fit(X_train, y_train)
pred = knnfitted.predict(X_test)
print 1.0 * sum(pred) / len(pred)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment