Skip to content

Instantly share code, notes, and snippets.

@ericbolo
Last active September 8, 2016 21:33
Show Gist options
  • Save ericbolo/bbf769ee5b34bbb2cdf7a86abc79b9fa to your computer and use it in GitHub Desktop.
Save ericbolo/bbf769ee5b34bbb2cdf7a86abc79b9fa to your computer and use it in GitHub Desktop.
ML with SciKit Learn

Introduction

This is a list of commands and tricks to fit and evaluate machine learning models using sci-kit learn.

Most of them are notes from this great video series by Kevin Markham: http://www.dataschool.io/machine-learning-with-scikit-learn/

Classification accuracy

from sklearn import metrics print metrics.accuracy_score(y, y_pred);

Train-test split

from sklearn.cross_validation import train_test_split X_tain, X_test, y_train, y_test = train_test_split(X,y(, test_size=0.4, random_state=4)

Note: random_state guarantees the split will be the same every time.

Downside of train-test split: high-variance estimate (can change a lot with different split)

Using pandas to ingest data

import pandas as pd

data = pd.read_csv('/url/or/path/to/file.csv') data.head();#First 5 rows

Preparing X and y for scikit learn using pandas

feature_cols = ['TV', 'Radio', 'Newspaper']

X = data[feature_cols] y = data['Sales']

Visualizing data using seaborn

import seaborn as sns

(if using jupyter notebook: allow plots to appear in b) %matplotlib inline

sns.pairplot(data, x_vars = ['feature_col_1', 'feature_col_2'], y_vars = 'target_col' (, size = 7, aspect = 0.7))

Linear regression using scikit learn

from sklearn.linear_model import LinearRegression linreg = LinearRegression()

linreg.fit(X_train, y_train)

print linreg.intercept_ print linreg.coef_

Evaluation metrics for linear regression

  • mean absolute error (MAE)
  • mean squared error (MSE)
  • root mean squared error (RMSE)

RMSE most popular, because "punishes" larger error (like MSE) AND interpretable in the "y" units

Cross-validation for parameter tuning, model selection, and feature selection

  • split dataset into K equal partitions (folds)
  • use fold 1 as testing set and the union of the other folds as the training set
  • calculate training accuracy
  • repeat steps 2 and 3 K-times, using a different fold as the testing test each time
  • use the average testing accuracy as the estimate of out-of-sample accuracy

Usually, K ~10

Cross-validation example

knn = KNeighborsClassifier(n_neighbors =5)

scores = cross_val_score(knn, X, y, cv =10, scoring='accuracy')

Using grid search for more efficient parameter tuning

k_range = range(1, 31) param_grid = dict(n_neighbors=k_range)

grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')

=> this produces a KNN model that runs through the parameter grid

grid.fit(X, y) grid.grid_scores_ grid.plot(k_range, grid.mean_scores) plt.xlabel('value of K for KNN') plt.ylabel('Cross-validated accuracy')

grid.best_score_ grid.best_params_ grid.best_estimator

Searching multiple parameters simultaneously

k_range = range(1, 31) weight_option['uniform', 'distance'] (how the weights are assigned to the neighbors)

param_grid=dict(n_neighbors=k_range, weights=weight_options)

grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')

Note: GridSearchCV can be computationally expensive. Consider using RandomizedGridSearchCV

Evaluating a classification model

Accuracy for binary classification task (response values: 0s and 1s)

acc = max(y_test.mean(), 1 - y_test.mean())

Confusion matrix

confusiong = metrics.confusion_matric(y_test, y_pred_class)

Adjusting the classification threshold

Get the probabilities instead of classes from the model, then

y_pred_class = binarize(y_pred_probs, 0.3)[0]

ROC Curves and Area under the curve (AUC)

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)

plt.plot(tpr, fpr)

fpr = false positive rate tpr = true positive rate

Higher Area under Curve (AUC) of ROC can be used as a measure of the overall performance of a classifier. => alternative to classification accuracy. a very large AUC corresponds to high sensitivity (recall) and high specificity (precision)

from sklearn.cross-validation import cross_val_score cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment