Skip to content

Instantly share code, notes, and snippets.

@rhiever
Last active October 13, 2015 18:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rhiever/7e6de0e51685b32bd4e6 to your computer and use it in GitHub Desktop.
Save rhiever/7e6de0e51685b32bd4e6 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# scitkit-learn feature order bug (?)\n",
"\n",
"I ran into a possible bug in scikit-learn today where the order of the features that I pass to a tree-based classifier affects the classifier performance. I've created a minimal working example below. As far as I can tell, it only affects decision tree and random forest classifiers. I've tested a couple different types of classifiers below to verify this statement.\n",
"\n",
"First simulate a data set with two features and a class and divide them into training/testing sets."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>class</th>\n",
" <th>group</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.374540</td>\n",
" <td>0.185133</td>\n",
" <td>1</td>\n",
" <td>training</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.950714</td>\n",
" <td>0.541901</td>\n",
" <td>0</td>\n",
" <td>testing</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.731994</td>\n",
" <td>0.872946</td>\n",
" <td>0</td>\n",
" <td>training</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.598658</td>\n",
" <td>0.732225</td>\n",
" <td>0</td>\n",
" <td>training</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.156019</td>\n",
" <td>0.806561</td>\n",
" <td>1</td>\n",
" <td>training</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B class group\n",
"0 0.374540 0.185133 1 training\n",
"1 0.950714 0.541901 0 testing\n",
"2 0.731994 0.872946 0 training\n",
"3 0.598658 0.732225 0 training\n",
"4 0.156019 0.806561 1 training"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.cross_validation import StratifiedShuffleSplit\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"np.random.seed(42)\n",
"\n",
"test_df = pd.DataFrame({'A': np.random.random(1000),\n",
" 'B': np.random.random(1000),\n",
" 'class': np.random.randint(0, 2, 1000)})\n",
"\n",
"training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(test_df['class'].values,\n",
" n_iter=1,\n",
" train_size=0.75,\n",
" test_size=0.25)))\n",
"\n",
"test_df.loc[training_indeces, 'group'] = 'training'\n",
"test_df.loc[testing_indeces, 'group'] = 'testing'\n",
"\n",
"test_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random forests\n",
"\n",
"Now fit the data with a random forest classifier with the same random state. Note that here I pass the features ordered column 'A' then column 'B'. The printed values are the testing performance. Repeat this procedure 10 times to make sure the results are reproducible with the same feature order."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.536\n",
"0.536\n",
"0.536\n",
"0.536\n",
"0.536\n",
"0.536\n",
"0.536\n",
"0.536\n",
"0.536\n",
"0.536\n"
]
}
],
"source": [
"for repeat in range(10):\n",
" rfc = RandomForestClassifier(random_state=42)\n",
"\n",
" training_features = test_df.loc[test_df['group'] == 'training', ['A', 'B']].values\n",
" training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
"\n",
" testing_features = test_df.loc[test_df['group'] == 'testing', ['A', 'B']].values\n",
" testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
"\n",
" rfc.fit(training_features, training_classes)\n",
"\n",
" print(rfc.score(testing_features, testing_classes))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now fit the data with a random forest classifier with the same random state. Note that here I pass the features ordered column 'B' then column 'A'. Repeat this procedure 10 times to make sure the results are reproducible with the same feature order."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.532\n",
"0.532\n",
"0.532\n",
"0.532\n",
"0.532\n",
"0.532\n",
"0.532\n",
"0.532\n",
"0.532\n",
"0.532\n"
]
}
],
"source": [
"for repeat in range(10):\n",
" rfc = RandomForestClassifier(random_state=42)\n",
"\n",
" training_features = test_df.loc[test_df['group'] == 'training', ['B', 'A']].values\n",
" training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
"\n",
" testing_features = test_df.loc[test_df['group'] == 'testing', ['B', 'A']].values\n",
" testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
"\n",
" rfc.fit(training_features, training_classes)\n",
"\n",
" print(rfc.score(testing_features, testing_classes))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See how the classifier performance is different when all I changed was the order of the features? Why does the order of features affect classification performance?\n",
"\n",
"# Decision tree classifiers"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.476\n",
"0.476\n",
"0.476\n",
"0.476\n",
"0.476\n",
"0.476\n",
"0.476\n",
"0.476\n",
"0.476\n",
"0.476\n"
]
}
],
"source": [
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"for repeat in range(10):\n",
" dtc = DecisionTreeClassifier(random_state=42)\n",
"\n",
" training_features = test_df.loc[test_df['group'] == 'training', ['A', 'B']].values\n",
" training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
"\n",
" testing_features = test_df.loc[test_df['group'] == 'testing', ['A', 'B']].values\n",
" testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
"\n",
" dtc.fit(training_features, training_classes)\n",
"\n",
" print(dtc.score(testing_features, testing_classes))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n"
]
}
],
"source": [
"for repeat in range(10):\n",
" dtc = DecisionTreeClassifier(random_state=42)\n",
"\n",
" training_features = test_df.loc[test_df['group'] == 'training', ['B', 'A']].values\n",
" training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
"\n",
" testing_features = test_df.loc[test_df['group'] == 'testing', ['B', 'A']].values\n",
" testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
"\n",
" dtc.fit(training_features, training_classes)\n",
"\n",
" print(dtc.score(testing_features, testing_classes))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SVM"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n"
]
}
],
"source": [
"from sklearn.svm import SVC\n",
"\n",
"for repeat in range(10):\n",
" svc = SVC(random_state=42)\n",
"\n",
" training_features = test_df.loc[test_df['group'] == 'training', ['A', 'B']].values\n",
" training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
"\n",
" testing_features = test_df.loc[test_df['group'] == 'testing', ['A', 'B']].values\n",
" testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
"\n",
" svc.fit(training_features, training_classes)\n",
"\n",
" print(svc.score(testing_features, testing_classes))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n"
]
}
],
"source": [
"for repeat in range(10):\n",
" svc = SVC(random_state=42)\n",
"\n",
" training_features = test_df.loc[test_df['group'] == 'training', ['B', 'A']].values\n",
" training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
"\n",
" testing_features = test_df.loc[test_df['group'] == 'testing', ['B', 'A']].values\n",
" testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
"\n",
" svc.fit(training_features, training_classes)\n",
"\n",
" print(svc.score(testing_features, testing_classes))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Logistic regression"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n"
]
}
],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"for repeat in range(10):\n",
" lrc = LogisticRegression(random_state=42)\n",
"\n",
" training_features = test_df.loc[test_df['group'] == 'training', ['A', 'B']].values\n",
" training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
"\n",
" testing_features = test_df.loc[test_df['group'] == 'testing', ['A', 'B']].values\n",
" testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
"\n",
" lrc.fit(training_features, training_classes)\n",
"\n",
" print(lrc.score(testing_features, testing_classes))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n",
"0.512\n"
]
}
],
"source": [
"for repeat in range(10):\n",
" lrc = LogisticRegression(random_state=42)\n",
"\n",
" training_features = test_df.loc[test_df['group'] == 'training', ['B', 'A']].values\n",
" training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
"\n",
" testing_features = test_df.loc[test_df['group'] == 'testing', ['B', 'A']].values\n",
" testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
"\n",
" lrc.fit(training_features, training_classes)\n",
"\n",
" print(lrc.score(testing_features, testing_classes))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment