Skip to content

Instantly share code, notes, and snippets.

@szilard
Created April 15, 2015 22:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save szilard/96e1562ef010364b3c7a to your computer and use it in GitHub Desktop.
Save szilard/96e1562ef010364b3c7a to your computer and use it in GitHub Desktop.
Random Forest all data vs subsamples
import numpy as np
from scipy.stats import chi2
from sklearn.ensemble import RandomForestClassifier
n = 1000
p = 100
def genr_data(n,p):
X = np.random.randn(n,p)
y = np.zeros(n)
for i in range(n):
y[i] = 1 if np.sum(X[i]**2) > chi2.ppf(0.5,p) else -1
return (X,y)
d_train = genr_data(n,p)
X_train = d_train[0]
y_train = d_train[1]
d_test = genr_data(10000,p)
X_test = d_test[0]
y_test = d_test[1]
md = RandomForestClassifier(n_estimators = 500, n_jobs = -1)
%time md.fit(X_train, y_train)
yp = md.predict(X_test)
float(np.sum(yp!=y_test))/y_test.size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment