sazio/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    Feature Engineering

The real deal is that nobody explicitly tells you what feature engineering is, in some way, you are expected to understand for yourself what are good features.


Feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. […] Much of the success of machine learning is actually success in engineering features that a learner can understand.


(Scott Locklin, in “Neglected machine learning ideas”)

Let's try to figure out what feature engineering is.
In solving such problems, our goal is to get the best possible result from a model. In order to achieve that, we need to extract useful information and get the most from what we have. On one side, this includes getting the best possible result from the algorithms we are employing. On the other side, it also involves getting the most out of the available data.
How do we get the most out of our data for predictive modeling?
Feature engineering tries to find an answer to this question.

Actually, the success of all Machine Learning algorithms depends on how you present the data.


(Mohammad Pezeshki, answer to “What are some general tips on feature selection and engineering that every data scientist should know?")


Feature Importance

Feature importance refers to a bunch of techniques that assign a score to input features based on how useful they are at predicting a target variable. These scores play an important role in predictive modeling, they usually provide useful insights into the dataset and the basis for dimensionality reduction and feature selection.
Feature importance scores can be calculated both for regression and classification problems.
These scores can be used in a range of situations, such as:


Better understanding the data: the relative scorse can highlight which features may be most relevant to the target, and on the other side, which are least relevant. This could be a useful notion for a domain expert and could be used as a basis for gathering more or different data.


Better understanding a model: inspecting the importance score provides insights into the specific model we're using and which features are the most important to the model when elaborating a prediction.


Reducing the number of input features: we can use the importance scores to select those features to delete (lowest score) and those to keep (highest scores).


Now let's jot down a few lines of code in order to grasp this topic in a better way.In order to explore feature importance scores, we'll import a few test datasets directly from sklearn.
Classification Dataset
Easy peasy, we can use the make_classification() function to create a test binary classification dataset.
We can specify the number of samples and the number of features, some of them are going to be informative and the remaining redundant. (Tip: you should fix the random seed, in this way you'll get a reproducible result)
https://gist.github.com/560d2e3ea48f16619652327bb0c6607a
(1000, 8) (1000,)

Regression Dataset
In a parallel fashion, we'll use the make_regression() function to create a regression dataset.
https://gist.github.com/206de6f4daf9aef5332ed089bd93db5d
(1000, 8) (1000,)


Coefficients as Feature Importance

When we think about linear machine learning algorithms, we always fit a model where the prediction is the weighted sum of the input values (e.g. linear regression, logistic regression, ridge regression etc..)
These coefficients can be used directly as naive feature importance scores. Firstly we'll fit a model on the dataset to find the coefficients, then summarize the importance scores for each input feature and create a bar chart to get an idea of the relative importance.
Linear Regression Feature Importance
It's time to fit a LinearRegression() model on the regression dataset and get the coef_ property that conatins the coefficients. The only assumption is that the input variables have the same scale or have been scaled prior to fitting the model.
This same approach can be used with regularized linear models, such as Ridge and ElasticNet.
https://gist.github.com/62a8ba3a25de7fbed39ef071fa582004
Feature: 0, Score: -0.00000
Feature: 1, Score: 41.28219
Feature: 2, Score: 0.00000
Feature: 3, Score: 41.81266
Feature: 4, Score: 45.15258
Feature: 5, Score: 0.00000
Feature: 6, Score: 0.00000
Feature: 7, Score: 0.00000


Logistic Regression Feature Importance
In a similar fashion, we can do the same to fit a LogisticRegression() model.
https://gist.github.com/afbe83cb846925e4585c886a1cf5a7bf
Feature: 0, Score: -1.08328
Feature: 1, Score: 0.35669
Feature: 2, Score: -0.13472
Feature: 3, Score: 0.58331
Feature: 4, Score: -0.40560
Feature: 5, Score: -0.38912
Feature: 6, Score: 0.31175
Feature: 7, Score: -0.88263


Recall that this is a classification problem with classes 0 and 1 (binary). Notice that the coefficients are both positive and negative, positive scores indicate a feature that predicts class 1 while negative scores indicate a feature that predicts class 0.
Why can't we analyze a regression problem with Logistic Regression? (A pretty naive question, try to answer tho)

Decision Tree Feature Importance

Decision Tree algorithms like Classification And Regression Trees (CART) offer importance scores based on the reduction in the criterion used to select split points, like Gini or Entropy. This approach can be also used for ensembles of decision trees, such as Random Forest and Gradient Boositng algorithms.
We can directly use the CART algorithm for feature importance implemented in Scikit-Learn as the DecisionTreeRegressor and DecisionTreeClassifier.
The model provides a feature_importances_ property that tells us the relative importance scores for each feature.
CART Regression Feature Importance
https://gist.github.com/012124d504f6dab7b05b879d41a1da14
Feature: 0, Score: 0.00394
Feature: 1, Score: 0.27784
Feature: 2, Score: 0.00367
Feature: 3, Score: 0.33327
Feature: 4, Score: 0.37304
Feature: 5, Score: 0.00377
Feature: 6, Score: 0.00273
Feature: 7, Score: 0.00173


CART Classification Feature Importance
https://gist.github.com/cb82ffdc1eb7ea2f897607a45d0cc205
Feature: 0, Score: 0.61084
Feature: 1, Score: 0.06682
Feature: 2, Score: 0.00798
Feature: 3, Score: 0.08570
Feature: 4, Score: 0.07629
Feature: 5, Score: 0.03331
Feature: 6, Score: 0.01803
Feature: 7, Score: 0.10103


Random Forest Feature Importance

Analogously, we can use the RandomForest algorithm for feature importance implemented in scikit-learn as the RandomForestRegressor and RandomForestClassifier.
As above, the model provides a feature_importances_ property.
Random Forest Regression Feature Importance
https://gist.github.com/a41e71397561bf2676c7e2bc96e0c138
Feature: 0, Score: 0.00488
Feature: 1, Score: 0.27663
Feature: 2, Score: 0.00440
Feature: 3, Score: 0.33057
Feature: 4, Score: 0.36924
Feature: 5, Score: 0.00467
Feature: 6, Score: 0.00460
Feature: 7, Score: 0.00501


Random Forest Classification Feature Importance
https://gist.github.com/b76f046e542f96b95a95818977975ff1
Feature: 0, Score: 0.27838
Feature: 1, Score: 0.11923
Feature: 2, Score: 0.07076
Feature: 3, Score: 0.13927
Feature: 4, Score: 0.10233
Feature: 5, Score: 0.10529
Feature: 6, Score: 0.04857
Feature: 7, Score: 0.13616


XGBoost Feature Importance

XGBoost is a Python library that provides an efficient implementation of the stochastic gradient boostig algorithm. (For an introduction to Boosted Trees, you can take a look here)
This algorithm can be integrated with Scikit-Learn via the XGBRegressor and XGBClassifier classes.
Even in this one, we can find the feature_importances_ property.
First, let's install the XGBoost library, with pip:
https://gist.github.com/3984b43c7cef61159b111519a74331d7
Requirement already satisfied: xgboost in /usr/local/lib/python3.6/dist-packages (0.90)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.4.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.18.5)

https://gist.github.com/7d982417680d75dd240ace463eaf93f0
0.90

Now, let's take a look at an example of XGBoost for feature importance.
XGBoost Regression Feature Importance
https://gist.github.com/fc29682a856d332a6dbc56ebe5009244
[21:06:15] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Feature: 0, Score: 0.00057
Feature: 1, Score: 0.25953
Feature: 2, Score: 0.00099
Feature: 3, Score: 0.38528
Feature: 4, Score: 0.34686
Feature: 5, Score: 0.00103
Feature: 6, Score: 0.00437
Feature: 7, Score: 0.00137


XGBoost Classification Feature Importance
https://gist.github.com/f94bc851d9e8b78c2975215f94955505
Feature: 0, Score: 0.51614
Feature: 1, Score: 0.05840
Feature: 2, Score: 0.08998
Feature: 3, Score: 0.03381
Feature: 4, Score: 0.10906
Feature: 5, Score: 0.08295
Feature: 6, Score: 0.02555
Feature: 7, Score: 0.08412


Permutation Feature Importance

Permutation feature importance is a technique for calculating relative importance scores that is independent of the model used.
It measures the increase in the prediction error of the model after we permuted the feature's values, which breaks the relationship between the feature and the true outcome.
The concept is really straightforward: We measure the importance of a feature by calculating the increase in the model's prediction error after permuting the feature. A feature is "important" if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction. A feature is "unimportant" if shuffling its values leaves the model error unchanged, because in this case the model ignored the feature for the prediction.
Permutation feature selection can be used via the permutation_importance() function that take a fit model, a dataset and a scoring function.
Let's try this approach with an algorithm that doesn't support feature selection natively, KNN (K-Nearest Neighbors).
Permutation Feature Importance for Regression
https://gist.github.com/6ac9243bd0e0a1b9e2ca1fad73eca32a
Feature: 0, Score: 39.26371
Feature: 1, Score: 1970.76947
Feature: 2, Score: 32.07404
Feature: 3, Score: 2384.33933
Feature: 4, Score: 2505.48063
Feature: 5, Score: 53.14065
Feature: 6, Score: 64.58060
Feature: 7, Score: 44.49132


Permutation Feature Importance for Classification
https://gist.github.com/d2b0189a2182c5ef45afe08467eceb23
Feature: 0, Score: 0.07960
Feature: 1, Score: 0.02720
Feature: 2, Score: 0.00660
Feature: 3, Score: 0.03100
Feature: 4, Score: 0.11100
Feature: 5, Score: 0.05320
Feature: 6, Score: 0.03720
Feature: 7, Score: 0.06260


Feature Selection with Importance
Feature importance scores can be used to find useful insights and interpret the data, but they can also be used directly to help rank and select features that are most useful. This procedure is usually referred as Feature Selection, and we'll look at it in more detail soon.
In our case, we can show how is possible to find redundant features by using the previously shown techniques.
Firstly, we can split the dataset into train and test sets, train a model on the training set, make predictions on the test set and evaluate the results by employing classification accuracy. We'll use a Logistic Regression model to fit our data.
https://gist.github.com/6f94de1e648719ef01edf3ea1611b53a
Accuracy: 86.67

In this case, we can see that our model achieved a classification accuracy of about $86.67 %$ using all the features in the dataset.
Let's see what happens if we select only relevant features. We could use any of the feature importance scores above, but in this case we'll use the ones provided by random forest.
We can use the SelectFromModel class to define both the model abd the number of features to select.
https://gist.github.com/ce5bca68324e4d20fff414391da39ba2
This will calculate the importance scores that can be used to rank all input features. We can then apply the method as a transform to select a subset of 5 most important features from the dataset. This transform will be applied to the training set and the test set.
https://gist.github.com/68b702dc64f29c8164e2720a87328c50
We can wrap up every piece and get this code snippet.
https://gist.github.com/070eddd88187e7ef8308feec54d1c24a
Accuracy: 86.36

In this case, we can see that the model achieves the same performance on the dataset, although with almost half of the features.