pmarcelino/feature_selection_univariate.ipynb

## feature_selection_univariate.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we will see how to **select features** through **univariate feature selection**. \n",
    "\n",
    "As a toy example, we will use data from 'Titanic: Machine Learning for Disaster', one of the most popular Kaggle competitions. However, we will not use the original data set. We will use a modified data set, which results from a [Kaggle kernel](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python) that I did. \n",
    "\n",
    "[Here](https://github.com/pmarcelino/blog/data/titanic_modified.csv) you can access to the data set used in this exercise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load the data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Survived</th>\n",
       "      <th>Fare</th>\n",
       "      <th>FamilySize</th>\n",
       "      <th>Imputed</th>\n",
       "      <th>Pclass_2</th>\n",
       "      <th>Pclass_3</th>\n",
       "      <th>Sex_male</th>\n",
       "      <th>Age_Child</th>\n",
       "      <th>Age_Elder</th>\n",
       "      <th>Embarked_Q</th>\n",
       "      <th>Embarked_S</th>\n",
       "      <th>Title_Miss</th>\n",
       "      <th>Title_Mr</th>\n",
       "      <th>Title_Mrs</th>\n",
       "      <th>Title_Other</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Survived     Fare  FamilySize  Imputed  Pclass_2  Pclass_3  Sex_male  \\\n",
       "0         0   7.2500           1        0         0         1         1   \n",
       "1         1  71.2833           1        0         0         0         0   \n",
       "2         1   7.9250           0        0         0         1         0   \n",
       "3         1  53.1000           1        0         0         0         0   \n",
       "4         0   8.0500           0        0         0         1         1   \n",
       "\n",
       "   Age_Child  Age_Elder  Embarked_Q  Embarked_S  Title_Miss  Title_Mr  \\\n",
       "0          0          0           0           1           0         1   \n",
       "1          0          0           0           0           0         0   \n",
       "2          0          0           0           1           1         0   \n",
       "3          0          0           0           1           0         0   \n",
       "4          0          0           0           1           0         1   \n",
       "\n",
       "   Title_Mrs  Title_Other  \n",
       "0          0            0  \n",
       "1          1            0  \n",
       "2          0            0  \n",
       "3          1            0  \n",
       "4          0            0  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('data/titanic_modified.csv', index_col=0)  # You may have to modify this \n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As I mentioned, this data set results from a [kernel](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python) that I already did for the Titanic competition. I'll summarize each of the features to give you some context:\n",
    "\n",
    "* **Survived**. Target variable. It's 1 if the passenger survived and 0 if it didn't.\n",
    "* **Fare**. Passenger fare. It keeps the same properties as the original feature.\n",
    "* **FamilySize**. It's the sum of the original features SipSp and Parch. SipSp refers to the # of siblings / spouses aboard the Titanic. Parch refers to the # of parents / children aboard the Titanic.\n",
    "* **Imputed**. Identifies instances where some missing data imputation was made.\n",
    "* **Pclass**. Ticket class. The feature was [encoded](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features), so we just have the features corresponding to the second and third class. The one corresponding to the first class was deleted to avoid the [dummy trap](http://www.algosome.com/articles/dummy-variable-trap-regression.html).\n",
    "* **Sex**. It was also encoded. It's 1 if the instance corresponds to a male, and 0 if it corresponds to a female.\n",
    "* **Age**. Encoded to the following classes: Children, Adult, and Elder. It's an Adult if Age_Child = 0 and Age_Elder = 0.\n",
    "* **Embarked**. Port of embarkation. Originally, we had three possible ports: C = Cherbourg, Q = Queenstown, S = Southampton.\n",
    "* **Title**. Results from the name of the passenger. Guess what? It's also an encoded feature."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Speaking of encoded features, they are a good example of how feature selection is useful. For example, when we do [one-hot encoding](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), a lot of features tend to be created (*n*-1 features to be more precise, where *n* is the number of existing class values). Accordingly, these cases are prone to the application of feature selection."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train and test data sets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we will be applying different feature selection methods. The usefulness of these methods depend on their capacity to improve the performance of our prediction models. \n",
    "\n",
    "To test the models' performance on unseen data, we will need to have train and test data sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Create train and test set\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X = df.drop('Survived', axis=1)  # Keep all features except 'Survived'\n",
    "y = df['Survived']  # Just keep 'Survived'\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Brief note on *test_size=0.20*: \n",
    "\n",
    "* It's common practice to consider 80% of your data for training and 20% for testing. However, you can test different ratios and see how it impacts the model. From my experience, you don't get much from tuning this parameter. If you want to know more about this topic, there's an interesting thread on [StackOverflow](https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use scikit-learn to perform feature selection. In the [official documentation](http://scikit-learn.org/stable/modules/feature_selection.html), you have all the details about their feature selection methods. Here, I'll make it easy for you. \n",
    "\n",
    "To perform feature selection, you need:\n",
    "\n",
    "1. To define the **number of features** that you want to keep.\n",
    "1. To select the **scoring function** that will evaluate the relationship between the variables.\n",
    "\n",
    "In what concerns the number of features, you can define it through:\n",
    "\n",
    "1. **SelectKBest**. Selects the *k* best features.\n",
    "1. **SelectPercentile**. Selects the best features into the percentile that you define.\n",
    "\n",
    "Regarding the scoring functions, you'll have different functions for classification and regression problems. Since we are dealing with a classification problem, I'll just list the two scoring functions that I use the most:\n",
    "\n",
    "1. **f_classification**. Based on analysis of variance (ANOVA).\n",
    "1. **mutual_info_classification**. Based on mutual information.\n",
    "\n",
    "I'll not go into the details on each scoring function, but I'll set up the system so that we can test both functions and keep the one that performs better. Brute force style.\n",
    "\n",
    "\n",
    "![](figures/meme_brute_force.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## f_classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To apply this feature selection method, we need to:\n",
    "\n",
    "* Define if we will use SelectKBest or SelectPercentile. In this case, since we have few features, I'll go for **SelectKBest** and I'll test all possible *k* values to choose the best one.\n",
    "* Select the machine learning algorithm that we want to apply. For the sake of simplicity, I'll use a **logistic regression**. It will run fast and it can provide us some insights about the data.\n",
    "* Choose a performance metric. Since we are solving a Kaggle competition, we will use the performance metric defined by them. In this case, the performance metric is **accuracy** (percentage of correct predictions).\n",
    "\n",
    "Now, we just need to build a function that runs for all *k* values, applies logistic regression, and computes the accuracy of the model. Then, we keep the *k* value that tell us the features we need to keep to optimize the performance of the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CV accuracy: 0.802 +/- 0.047\n",
      "CV accuracy: 0.802 +/- 0.047\n",
      "CV accuracy: 0.802 +/- 0.047\n",
      "CV accuracy: 0.802 +/- 0.047\n",
      "CV accuracy: 0.800 +/- 0.059\n",
      "CV accuracy: 0.803 +/- 0.060\n",
      "CV accuracy: 0.803 +/- 0.058\n",
      "CV accuracy: 0.812 +/- 0.051\n",
      "CV accuracy: 0.810 +/- 0.050\n",
      "CV accuracy: 0.813 +/- 0.054\n",
      "CV accuracy: 0.809 +/- 0.052\n",
      "CV accuracy: 0.834 +/- 0.051\n",
      "CV accuracy: 0.834 +/- 0.051\n",
      "CV accuracy: 0.831 +/- 0.054\n",
      "\n",
      "The optimal number of features is: 12\n",
      "The maximum score is: 0.834\n",
      "The optimal features are:  [ 0  1  2  3  4  5  6  9 10 11 12 13]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "from sklearn.feature_selection import f_classif\n",
    "\n",
    "lr = LogisticRegression()\n",
    "max_score = 0\n",
    "\n",
    "for i in range(1, np.shape(X_train)[1]+1):\n",
    "    # Feature selection\n",
    "    select = SelectKBest(f_classif, k=i)\n",
    "    select.fit(X_train, y_train)  # Computes the statistical relationship between the features and the output\n",
    "    X_train_selected = select.transform(X_train)  # Reduces the number of features to k\n",
    "    \n",
    "    # Model selection\n",
    "    lr.fit(X_train_selected, y_train)\n",
    "    strat_kfold = StratifiedKFold(10, random_state=7)\n",
    "    score = cross_val_score(lr, X_train_selected, y_train, scoring='accuracy', cv=strat_kfold)\n",
    "    print('CV accuracy: %.3f +/- %.3f' % (np.mean(score), np.std(score)))\n",
    "\n",
    "    # Optimal number of features\n",
    "    if np.mean(score) > max_score:\n",
    "        max_score = np.mean(score)\n",
    "        best_n_features = i\n",
    "        mask = select.get_support(indices=True)\n",
    "\n",
    "print('\\nThe optimal number of features is: %i' % best_n_features)\n",
    "print('The maximum score is: %.3f' % max_score)\n",
    "print('The optimal features are: ', mask)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Brief note about cross-validation:\n",
    "\n",
    "* When we want to estimate the performance of a model in a more stable way, we usually do [cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html). It is common to use a specific cross-validation version called [k-fold cross-validation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold). \n",
    "* When we are dealing with classification problems, I recommend you to use [stratified k-fold cross-validation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold). This will allow your *k* folds to keep the same proportion of class values. In regression problems you don't have this issue."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## mutual_info_classif"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will follow the same logic as *f_classif*. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CV accuracy: 0.802 +/- 0.047\n",
      "CV accuracy: 0.802 +/- 0.047\n",
      "CV accuracy: 0.799 +/- 0.047\n",
      "CV accuracy: 0.800 +/- 0.048\n",
      "CV accuracy: 0.800 +/- 0.048\n",
      "CV accuracy: 0.800 +/- 0.048\n",
      "CV accuracy: 0.807 +/- 0.052\n",
      "CV accuracy: 0.836 +/- 0.052\n",
      "CV accuracy: 0.809 +/- 0.047\n",
      "CV accuracy: 0.827 +/- 0.052\n",
      "CV accuracy: 0.836 +/- 0.046\n",
      "CV accuracy: 0.831 +/- 0.053\n",
      "CV accuracy: 0.830 +/- 0.052\n",
      "CV accuracy: 0.831 +/- 0.054\n",
      "\n",
      "The optimal number of features is: 8\n",
      "The maximum score is: 0.836\n",
      "The optimal features are:  [ 0  1  4  5 10 11 12 13]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "from sklearn.feature_selection import mutual_info_classif\n",
    "\n",
    "lr = LogisticRegression()\n",
    "max_score = 0\n",
    "\n",
    "for i in range(1, np.shape(X_train)[1]+1):\n",
    "    # Feature selection\n",
    "    select = SelectKBest(mutual_info_classif, k=i)\n",
    "    select.fit(X_train, y_train)  # Computes the statistical relationship between the features and the output\n",
    "    X_train_selected = select.transform(X_train)  # Reduces the number of features to k\n",
    "    \n",
    "    # Model selection\n",
    "    lr.fit(X_train_selected, y_train)\n",
    "    strat_kfold = StratifiedKFold(10, random_state=7)\n",
    "    score = cross_val_score(lr, X_train_selected, y_train, scoring='accuracy', cv=10)\n",
    "    print('CV accuracy: %.3f +/- %.3f' % (np.mean(score), np.std(score)))\n",
    "\n",
    "    # Optimal number of features\n",
    "    if np.mean(score) > max_score:\n",
    "        max_score = np.mean(score)\n",
    "        best_n_features = i\n",
    "        mask = select.get_support(indices=True)\n",
    "\n",
    "print('\\nThe optimal number of features is: %i' % best_n_features)\n",
    "print('The maximum score is: %.3f' % max_score)\n",
    "print('The optimal features are: ', mask)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The methods give us different results. *mutual_info_classif* give us a higher performance score and a lower number of features. This suggests that *mutual_info_classif* would be a better choice for this problem. It has a higher accuracy score (better performance) and less features (less computational cost). \n",
    "\n",
    "We can also notice that some features are excluded by both methods. When different methods guide us to the same results, that means something. Try to use these insights to increase your knowledge about the data and the problem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extra"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we finish, let's go DRY. DRY is a principle of software development which stands for 'don't repeat yourself'. This means that we should aim at reducing repetition of code. \n",
    "\n",
    "You probably noticed that we used the code structure to apply both methods. Here, I'll give you a code snippet that you can reuse in generic problems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def select_features_univariate(X, y, scoring_function=mutual_info_classif, scoring_performance='accuracy'):\n",
    "    \"\"\"Function to select features using univariate methods\n",
    "\n",
    "    Assumptions:\n",
    "        - You want to use SelectKBest\n",
    "        - You're dealing with a classification problem\n",
    "        - You want to use f_classif or mutual_info_classif as scoring functions\n",
    "        - You believe that this can be useful :P\n",
    "\n",
    "    Args:\n",
    "        X (pd.DataFrame): Input values\n",
    "        y (pd.Series): Output value\n",
    "        scoring_function (str): Function that scores relationships between variables\n",
    "        scoring_performance (str): Performance metric\n",
    "\n",
    "    Returns:\n",
    "        best_n_features: Optimal number of features\n",
    "        max_score: Maximum performance score\n",
    "        mask: Optimal set of features\n",
    "\n",
    "    \"\"\"\n",
    "    from sklearn.linear_model import LogisticRegression\n",
    "    from sklearn.model_selection import cross_val_score\n",
    "    from sklearn.model_selection import StratifiedKFold\n",
    "    from sklearn.feature_selection import SelectKBest\n",
    "    from sklearn.feature_selection import f_classif\n",
    "    from sklearn.feature_selection import mutual_info_classif\n",
    "    \n",
    "    max_score = 0\n",
    "    lr = LogisticRegression()\n",
    "\n",
    "    for i in range(1, np.shape(X)[1]+1):\n",
    "        # Feature selection\n",
    "        select = SelectKBest(scoring_function, k=i)\n",
    "        select.fit(X, y)  # Computes the statistical relationship between the features and the output\n",
    "        X_selected = select.transform(X)  # Reduces the number of features to k\n",
    "\n",
    "        # Model selection\n",
    "        lr.fit(X_selected, y)\n",
    "        strat_kfold = StratifiedKFold(10, random_state=7)\n",
    "        score = cross_val_score(lr, X_selected, y_train, scoring=scoring_performance, cv=strat_kfold)\n",
    "        print('CV score: %.3f +/- %.3f' % (np.mean(score), np.std(score)))\n",
    "\n",
    "        # Optimal number of features\n",
    "        if np.mean(score) > max_score:\n",
    "            max_score = np.mean(score)\n",
    "            best_n_features = i\n",
    "            mask = select.get_support(indices=True)\n",
    "\n",
    "    print('\\nThe optimal number of features is: %i' % best_n_features)\n",
    "    print('The maximum score is: %.3f' % max_score)\n",
    "    print('The optimal features are: ', mask)\n",
    "    \n",
    "    return (best_n_features, max_score, mask)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Feature selection"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In this example we will see how to select features through univariate feature selection. \n",
	"\n",
	"As a toy example, we will use data from 'Titanic: Machine Learning for Disaster', one of the most popular Kaggle competitions. However, we will not use the original data set. We will use a modified data set, which results from a [Kaggle kernel](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python) that I did. \n",
	"\n",
	"[Here](https://github.com/pmarcelino/blog/data/titanic_modified.csv) you can access to the data set used in this exercise."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"---"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"import numpy as np\n",
	"import matplotlib.pyplot as plt\n",
	"\n",
	"%matplotlib inline"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Load data"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's load the data set."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style>\n",
	" .dataframe thead tr:only-child th {\n",
	" text-align: right;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: left;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>Survived</th>\n",
	" <th>Fare</th>\n",
	" <th>FamilySize</th>\n",
	" <th>Imputed</th>\n",
	" <th>Pclass_2</th>\n",
	" <th>Pclass_3</th>\n",
	" <th>Sex_male</th>\n",
	" <th>Age_Child</th>\n",
	" <th>Age_Elder</th>\n",
	" <th>Embarked_Q</th>\n",
	" <th>Embarked_S</th>\n",
	" <th>Title_Miss</th>\n",
	" <th>Title_Mr</th>\n",
	" <th>Title_Mrs</th>\n",
	" <th>Title_Other</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>0</td>\n",
	" <td>7.2500</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>1</td>\n",
	" <td>71.2833</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>1</td>\n",
	" <td>7.9250</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>1</td>\n",
	" <td>53.1000</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>0</td>\n",
	" <td>8.0500</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" Survived Fare FamilySize Imputed Pclass_2 Pclass_3 Sex_male \\\n",
	"0 0 7.2500 1 0 0 1 1 \n",
	"1 1 71.2833 1 0 0 0 0 \n",
	"2 1 7.9250 0 0 0 1 0 \n",
	"3 1 53.1000 1 0 0 0 0 \n",
	"4 0 8.0500 0 0 0 1 1 \n",
	"\n",
	" Age_Child Age_Elder Embarked_Q Embarked_S Title_Miss Title_Mr \\\n",
	"0 0 0 0 1 0 1 \n",
	"1 0 0 0 0 0 0 \n",
	"2 0 0 0 1 1 0 \n",
	"3 0 0 0 1 0 0 \n",
	"4 0 0 0 1 0 1 \n",
	"\n",
	" Title_Mrs Title_Other \n",
	"0 0 0 \n",
	"1 1 0 \n",
	"2 0 0 \n",
	"3 1 0 \n",
	"4 0 0 "
	]
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"df = pd.read_csv('data/titanic_modified.csv', index_col=0) # You may have to modify this \n",
	"df.head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"As I mentioned, this data set results from a [kernel](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python) that I already did for the Titanic competition. I'll summarize each of the features to give you some context:\n",
	"\n",
	"* Survived. Target variable. It's 1 if the passenger survived and 0 if it didn't.\n",
	"* Fare. Passenger fare. It keeps the same properties as the original feature.\n",
	"* FamilySize. It's the sum of the original features SipSp and Parch. SipSp refers to the # of siblings / spouses aboard the Titanic. Parch refers to the # of parents / children aboard the Titanic.\n",
	"* Imputed. Identifies instances where some missing data imputation was made.\n",
	"* Pclass. Ticket class. The feature was [encoded](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features), so we just have the features corresponding to the second and third class. The one corresponding to the first class was deleted to avoid the [dummy trap](http://www.algosome.com/articles/dummy-variable-trap-regression.html).\n",
	"* Sex. It was also encoded. It's 1 if the instance corresponds to a male, and 0 if it corresponds to a female.\n",
	"* Age. Encoded to the following classes: Children, Adult, and Elder. It's an Adult if Age_Child = 0 and Age_Elder = 0.\n",
	"* Embarked. Port of embarkation. Originally, we had three possible ports: C = Cherbourg, Q = Queenstown, S = Southampton.\n",
	"* Title. Results from the name of the passenger. Guess what? It's also an encoded feature."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Speaking of encoded features, they are a good example of how feature selection is useful. For example, when we do [one-hot encoding](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), a lot of features tend to be created (n-1 features to be more precise, where n is the number of existing class values). Accordingly, these cases are prone to the application of feature selection."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Train and test data sets"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In this example we will be applying different feature selection methods. The usefulness of these methods depend on their capacity to improve the performance of our prediction models. \n",
	"\n",
	"To test the models' performance on unseen data, we will need to have train and test data sets."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Create train and test set\n",
	"from sklearn.model_selection import train_test_split\n",
	"\n",
	"X = df.drop('Survived', axis=1) # Keep all features except 'Survived'\n",
	"y = df['Survived'] # Just keep 'Survived'\n",
	"\n",
	"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=7)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Brief note on test_size=0.20: \n",
	"\n",
	"* It's common practice to consider 80% of your data for training and 20% for testing. However, you can test different ratios and see how it impacts the model. From my experience, you don't get much from tuning this parameter. If you want to know more about this topic, there's an interesting thread on [StackOverflow](https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio)."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Feature selection"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We will use scikit-learn to perform feature selection. In the [official documentation](http://scikit-learn.org/stable/modules/feature_selection.html), you have all the details about their feature selection methods. Here, I'll make it easy for you. \n",
	"\n",
	"To perform feature selection, you need:\n",
	"\n",
	"1. To define the number of features that you want to keep.\n",
	"1. To select the scoring function that will evaluate the relationship between the variables.\n",
	"\n",
	"In what concerns the number of features, you can define it through:\n",
	"\n",
	"1. SelectKBest. Selects the k best features.\n",
	"1. SelectPercentile. Selects the best features into the percentile that you define.\n",
	"\n",
	"Regarding the scoring functions, you'll have different functions for classification and regression problems. Since we are dealing with a classification problem, I'll just list the two scoring functions that I use the most:\n",
	"\n",
	"1. f_classification. Based on analysis of variance (ANOVA).\n",
	"1. mutual_info_classification. Based on mutual information.\n",
	"\n",
	"I'll not go into the details on each scoring function, but I'll set up the system so that we can test both functions and keep the one that performs better. Brute force style.\n",
	"\n",
	"\n",
	"![](figures/meme_brute_force.jpg)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## f_classification"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"To apply this feature selection method, we need to:\n",
	"\n",
	"* Define if we will use SelectKBest or SelectPercentile. In this case, since we have few features, I'll go for SelectKBest and I'll test all possible k values to choose the best one.\n",
	"* Select the machine learning algorithm that we want to apply. For the sake of simplicity, I'll use a logistic regression. It will run fast and it can provide us some insights about the data.\n",
	"* Choose a performance metric. Since we are solving a Kaggle competition, we will use the performance metric defined by them. In this case, the performance metric is accuracy (percentage of correct predictions).\n",
	"\n",
	"Now, we just need to build a function that runs for all k values, applies logistic regression, and computes the accuracy of the model. Then, we keep the k value that tell us the features we need to keep to optimize the performance of the model."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CV accuracy: 0.802 +/- 0.047\n",
	"CV accuracy: 0.802 +/- 0.047\n",
	"CV accuracy: 0.802 +/- 0.047\n",
	"CV accuracy: 0.802 +/- 0.047\n",
	"CV accuracy: 0.800 +/- 0.059\n",
	"CV accuracy: 0.803 +/- 0.060\n",
	"CV accuracy: 0.803 +/- 0.058\n",
	"CV accuracy: 0.812 +/- 0.051\n",
	"CV accuracy: 0.810 +/- 0.050\n",
	"CV accuracy: 0.813 +/- 0.054\n",
	"CV accuracy: 0.809 +/- 0.052\n",
	"CV accuracy: 0.834 +/- 0.051\n",
	"CV accuracy: 0.834 +/- 0.051\n",
	"CV accuracy: 0.831 +/- 0.054\n",
	"\n",
	"The optimal number of features is: 12\n",
	"The maximum score is: 0.834\n",
	"The optimal features are: [ 0 1 2 3 4 5 6 9 10 11 12 13]\n"
	]
	}
	],
	"source": [
	"from sklearn.linear_model import LogisticRegression\n",
	"from sklearn.model_selection import cross_val_score\n",
	"from sklearn.model_selection import StratifiedKFold\n",
	"from sklearn.feature_selection import SelectKBest\n",
	"from sklearn.feature_selection import f_classif\n",
	"\n",
	"lr = LogisticRegression()\n",
	"max_score = 0\n",
	"\n",
	"for i in range(1, np.shape(X_train)[1]+1):\n",
	" # Feature selection\n",
	" select = SelectKBest(f_classif, k=i)\n",
	" select.fit(X_train, y_train) # Computes the statistical relationship between the features and the output\n",
	" X_train_selected = select.transform(X_train) # Reduces the number of features to k\n",
	" \n",
	" # Model selection\n",
	" lr.fit(X_train_selected, y_train)\n",
	" strat_kfold = StratifiedKFold(10, random_state=7)\n",
	" score = cross_val_score(lr, X_train_selected, y_train, scoring='accuracy', cv=strat_kfold)\n",
	" print('CV accuracy: %.3f +/- %.3f' % (np.mean(score), np.std(score)))\n",
	"\n",
	" # Optimal number of features\n",
	" if np.mean(score) > max_score:\n",
	" max_score = np.mean(score)\n",
	" best_n_features = i\n",
	" mask = select.get_support(indices=True)\n",
	"\n",
	"print('\\nThe optimal number of features is: %i' % best_n_features)\n",
	"print('The maximum score is: %.3f' % max_score)\n",
	"print('The optimal features are: ', mask)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Brief note about cross-validation:\n",
	"\n",
	"* When we want to estimate the performance of a model in a more stable way, we usually do [cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html). It is common to use a specific cross-validation version called [k-fold cross-validation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold). \n",
	"* When we are dealing with classification problems, I recommend you to use [stratified k-fold cross-validation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold). This will allow your k folds to keep the same proportion of class values. In regression problems you don't have this issue."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## mutual_info_classif"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We will follow the same logic as f_classif. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CV accuracy: 0.802 +/- 0.047\n",
	"CV accuracy: 0.802 +/- 0.047\n",
	"CV accuracy: 0.799 +/- 0.047\n",
	"CV accuracy: 0.800 +/- 0.048\n",
	"CV accuracy: 0.800 +/- 0.048\n",
	"CV accuracy: 0.800 +/- 0.048\n",
	"CV accuracy: 0.807 +/- 0.052\n",
	"CV accuracy: 0.836 +/- 0.052\n",
	"CV accuracy: 0.809 +/- 0.047\n",
	"CV accuracy: 0.827 +/- 0.052\n",
	"CV accuracy: 0.836 +/- 0.046\n",
	"CV accuracy: 0.831 +/- 0.053\n",
	"CV accuracy: 0.830 +/- 0.052\n",
	"CV accuracy: 0.831 +/- 0.054\n",
	"\n",
	"The optimal number of features is: 8\n",
	"The maximum score is: 0.836\n",
	"The optimal features are: [ 0 1 4 5 10 11 12 13]\n"
	]
	}
	],
	"source": [
	"from sklearn.linear_model import LogisticRegression\n",
	"from sklearn.model_selection import cross_val_score\n",
	"from sklearn.model_selection import StratifiedKFold\n",
	"from sklearn.feature_selection import SelectKBest\n",
	"from sklearn.feature_selection import mutual_info_classif\n",
	"\n",
	"lr = LogisticRegression()\n",
	"max_score = 0\n",
	"\n",
	"for i in range(1, np.shape(X_train)[1]+1):\n",
	" # Feature selection\n",
	" select = SelectKBest(mutual_info_classif, k=i)\n",
	" select.fit(X_train, y_train) # Computes the statistical relationship between the features and the output\n",
	" X_train_selected = select.transform(X_train) # Reduces the number of features to k\n",
	" \n",
	" # Model selection\n",
	" lr.fit(X_train_selected, y_train)\n",
	" strat_kfold = StratifiedKFold(10, random_state=7)\n",
	" score = cross_val_score(lr, X_train_selected, y_train, scoring='accuracy', cv=10)\n",
	" print('CV accuracy: %.3f +/- %.3f' % (np.mean(score), np.std(score)))\n",
	"\n",
	" # Optimal number of features\n",
	" if np.mean(score) > max_score:\n",
	" max_score = np.mean(score)\n",
	" best_n_features = i\n",
	" mask = select.get_support(indices=True)\n",
	"\n",
	"print('\\nThe optimal number of features is: %i' % best_n_features)\n",
	"print('The maximum score is: %.3f' % max_score)\n",
	"print('The optimal features are: ', mask)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"# Conclusion"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The methods give us different results. mutual_info_classif give us a higher performance score and a lower number of features. This suggests that mutual_info_classif would be a better choice for this problem. It has a higher accuracy score (better performance) and less features (less computational cost). \n",
	"\n",
	"We can also notice that some features are excluded by both methods. When different methods guide us to the same results, that means something. Try to use these insights to increase your knowledge about the data and the problem."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Extra"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Before we finish, let's go DRY. DRY is a principle of software development which stands for 'don't repeat yourself'. This means that we should aim at reducing repetition of code. \n",
	"\n",
	"You probably noticed that we used the code structure to apply both methods. Here, I'll give you a code snippet that you can reuse in generic problems."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"def select_features_univariate(X, y, scoring_function=mutual_info_classif, scoring_performance='accuracy'):\n",
	" \"\"\"Function to select features using univariate methods\n",
	"\n",
	" Assumptions:\n",
	" - You want to use SelectKBest\n",
	" - You're dealing with a classification problem\n",
	" - You want to use f_classif or mutual_info_classif as scoring functions\n",
	" - You believe that this can be useful :P\n",
	"\n",
	" Args:\n",
	" X (pd.DataFrame): Input values\n",
	" y (pd.Series): Output value\n",
	" scoring_function (str): Function that scores relationships between variables\n",
	" scoring_performance (str): Performance metric\n",
	"\n",
	" Returns:\n",
	" best_n_features: Optimal number of features\n",
	" max_score: Maximum performance score\n",
	" mask: Optimal set of features\n",
	"\n",
	" \"\"\"\n",
	" from sklearn.linear_model import LogisticRegression\n",
	" from sklearn.model_selection import cross_val_score\n",
	" from sklearn.model_selection import StratifiedKFold\n",
	" from sklearn.feature_selection import SelectKBest\n",
	" from sklearn.feature_selection import f_classif\n",
	" from sklearn.feature_selection import mutual_info_classif\n",
	" \n",
	" max_score = 0\n",
	" lr = LogisticRegression()\n",
	"\n",
	" for i in range(1, np.shape(X)[1]+1):\n",
	" # Feature selection\n",
	" select = SelectKBest(scoring_function, k=i)\n",
	" select.fit(X, y) # Computes the statistical relationship between the features and the output\n",
	" X_selected = select.transform(X) # Reduces the number of features to k\n",
	"\n",
	" # Model selection\n",
	" lr.fit(X_selected, y)\n",
	" strat_kfold = StratifiedKFold(10, random_state=7)\n",
	" score = cross_val_score(lr, X_selected, y_train, scoring=scoring_performance, cv=strat_kfold)\n",
	" print('CV score: %.3f +/- %.3f' % (np.mean(score), np.std(score)))\n",
	"\n",
	" # Optimal number of features\n",
	" if np.mean(score) > max_score:\n",
	" max_score = np.mean(score)\n",
	" best_n_features = i\n",
	" mask = select.get_support(indices=True)\n",
	"\n",
	" print('\\nThe optimal number of features is: %i' % best_n_features)\n",
	" print('The maximum score is: %.3f' % max_score)\n",
	" print('The optimal features are: ', mask)\n",
	" \n",
	" return (best_n_features, max_score, mask)"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}