Skip to content

Instantly share code, notes, and snippets.

@pmarcelino
Created July 11, 2018 04:20
Show Gist options
  • Save pmarcelino/08a84deb0825172c0d574a6ef015bbed to your computer and use it in GitHub Desktop.
Save pmarcelino/08a84deb0825172c0d574a6ef015bbed to your computer and use it in GitHub Desktop.
Feature selection using SelectFromModel
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature selection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example we will see how to **select features** through **model-based selection**.\n",
"\n",
"As a toy example, we will use data from 'Titanic: Machine Learning for Disaster', one of the most popular Kaggle competitions. However, we will not use the original data set. We will use a modified data set, which results from a [Kaggle kernel](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python) that I did. \n",
"\n",
"[Here](https://github.com/pmarcelino/blog/data/titanic_modified.csv) you can access to the data set used in this exercise."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Load data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's load the data set."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Fare</th>\n",
" <th>FamilySize</th>\n",
" <th>Imputed</th>\n",
" <th>Pclass_2</th>\n",
" <th>Pclass_3</th>\n",
" <th>Sex_male</th>\n",
" <th>Age_Child</th>\n",
" <th>Age_Elder</th>\n",
" <th>Embarked_Q</th>\n",
" <th>Embarked_S</th>\n",
" <th>Title_Miss</th>\n",
" <th>Title_Mr</th>\n",
" <th>Title_Mrs</th>\n",
" <th>Title_Other</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>7.2500</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>71.2833</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>7.9250</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>53.1000</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>8.0500</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Fare FamilySize Imputed Pclass_2 Pclass_3 Sex_male \\\n",
"0 0 7.2500 1 0 0 1 1 \n",
"1 1 71.2833 1 0 0 0 0 \n",
"2 1 7.9250 0 0 0 1 0 \n",
"3 1 53.1000 1 0 0 0 0 \n",
"4 0 8.0500 0 0 0 1 1 \n",
"\n",
" Age_Child Age_Elder Embarked_Q Embarked_S Title_Miss Title_Mr \\\n",
"0 0 0 0 1 0 1 \n",
"1 0 0 0 0 0 0 \n",
"2 0 0 0 1 1 0 \n",
"3 0 0 0 1 0 0 \n",
"4 0 0 0 1 0 1 \n",
"\n",
" Title_Mrs Title_Other \n",
"0 0 0 \n",
"1 1 0 \n",
"2 0 0 \n",
"3 1 0 \n",
"4 0 0 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('data/titanic_modified.csv', index_col=0)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As I mentioned, this data set results from a [kernel](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python) that I already did for the Titanic competition. I'll summarize each of the features to give you some context:\n",
"\n",
"* **Survived**. Target variable. It's 1 if the passenger survived and 0 if it didn't.\n",
"* **Fare**. Passenger fare. It keeps the same properties as the original feature.\n",
"* **FamilySize**. It's the sum of the original features SipSp and Parch. SipSp refers to the # of siblings / spouses aboard the Titanic. Parch refers to the # of parents / children aboard the Titanic.\n",
"* **Imputed**. Identifies instances where some missing data imputation was made.\n",
"* **Pclass**. Ticket class. The feature was [encoded](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features), so we just have the features corresponding to the second and third class. The one corresponding to the first class was deleted to avoid the [dummy trap](http://www.algosome.com/articles/dummy-variable-trap-regression.html).\n",
"* **Sex**. It was also encoded. It's 1 if the instance corresponds to a male, and 0 if it corresponds to a female.\n",
"* **Age**. Encoded to the following classes: Children, Adult, and Elder. It's an Adult if Age_Child = 0 and Age_Elder = 0.\n",
"* **Embarked**. Port of embarkation. Originally, we had three possible ports: C = Cherbourg, Q = Queenstown, S = Southampton.\n",
"* **Title**. Results from the name of the passenger. Guess what? It's also an encoded feature."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train and test data sets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In feature selection, our goal is to distinguish features that are useful for prediction from features that just add noise to the prediction model.\n",
"\n",
"To test the model's performance on unseen data, we need a train and a test data set."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# Create train and test set\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X = df.drop('Survived', axis=1) # Keep all features except 'Survived'\n",
"y = df['Survived'] # Just keep 'Survived'\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature selection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The solution we propose uses scikit-learn. You can read its documentation to know more about [feature selection](http://scikit-learn.org/stable/modules/feature_selection.html). \n",
"\n",
"Basically, you just have to use **SelectFromModel**. When you use this meta-transformer, you specify which **model** you want to use (e.g. Random Forests) and the **threshold** value to use for feature selection. This threshold value defines which features should be kept: features whose value is above the threshold are kept, features whose value is below the threshold are discarded. You can read more about SelectFromModel [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel).\n",
"\n",
"In this example, I'll show you how to do feature selection using **Random Forests** as the base model. You can use other models. The logic is the same. I usually go for Random Forests because it's a flexible model, which works well with different types of features and in different types of problems.\n",
"\n",
"Since the Titanic is a classification problem, we will use **RandomForestClassifier** as the base model."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(712, 14)\n",
"(712, 7)\n"
]
}
],
"source": [
"from sklearn.feature_selection import SelectFromModel\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"select = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=7), threshold='median')\n",
"select.fit(X_train, y_train)\n",
"X_train_selected = select.transform(X_train)\n",
"print(X_train.shape)\n",
"print(X_train_selected.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some notes:\n",
"* We reduced the number of features from 14 to 7.\n",
"* We used 'median' as the threshold value, which is the reason why we have 7 features. You can use other thresholds, like the 'mean' or even the 'mean' times a scaling factor.\n",
"* We used n_estimators = 100 for illustrative purposes. You can use a different number."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Model's performance"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CV accuracy: 0.807 +/- 0.052\n",
"[ 0 1 4 5 10 11 12]\n"
]
}
],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import cross_val_score\n",
"from sklearn.model_selection import StratifiedKFold\n",
"\n",
"# Assess model performance\n",
"lr = LogisticRegression()\n",
"lr.fit(X_train_selected, y_train)\n",
"strat_kfold = StratifiedKFold(10, random_state=7)\n",
"score = cross_val_score(lr, X_train_selected, y_train, scoring='accuracy', cv=10)\n",
"print('CV accuracy: %.3f +/- %.3f' % (np.mean(score), np.std(score)))\n",
"\n",
"# Get features\n",
"print(select.get_support(indices=True))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we compare the results with the example we used in [univariate feature selection](), we can see that model's performance is lower with this model-based solution.\n",
"\n",
"However, we must be aware that in this case, we define the 'median' as the threshold value. Accordingly, we are restricting the model to half of its original features. \n",
"\n",
"If we compare the results achieved by this model, which has 7 features, with the models with 7 features resulting from the univariate feature selection, we will see that the model-based approach is as good or better than the univariate feature selection.\n",
"\n",
"Feature selection can then be improved using a different base model or considering a different threshold value."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Complete solution"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(712, 14)\n",
"(712, 7)\n",
"CV accuracy: 0.807 +/- 0.052\n",
"[ 0 1 4 5 10 11 12]\n"
]
}
],
"source": [
"from sklearn.feature_selection import SelectFromModel\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import cross_val_score\n",
"from sklearn.model_selection import StratifiedKFold\n",
"\n",
"# Select features\n",
"select = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=7), threshold='median')\n",
"select.fit(X_train, y_train)\n",
"X_train_selected = select.transform(X_train)\n",
"print(X_train.shape)\n",
"print(X_train_selected.shape)\n",
"\n",
"\n",
"# Assess model\n",
"lr = LogisticRegression()\n",
"lr.fit(X_train_selected, y_train)\n",
"strat_kfold = StratifiedKFold(10, random_state=7)\n",
"score = cross_val_score(lr, X_train_selected, y_train, scoring='accuracy', cv=10)\n",
"print('CV accuracy: %.3f +/- %.3f' % (np.mean(score), np.std(score)))\n",
"\n",
"# Get features\n",
"print(select.get_support(indices=True))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment