lukemerrick/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    2019.10.07
A new theoretical perspective on an old measure of feature importance: deriving confidence intervals for permutation importance

Luke Merrick

Introduction

In this post, we explain how a new theoretical perspective on the popular permutation feature importance technique allows us to quantify its uncertainty with confidence intervals and avoid potential pitfalls in its use.
First, let's motivate the "why" of using this technique in the first place. Let's imagine you just got hired onto the data science team at a major international retailer. Prior to your arrival, this team built a complex model to forecast weekly sales at each of your dozens of locations around the globe. The model takes into account a multitude of factors: geographic data (like local population density and demographics), seasonality data, weather forecast data, information about individual stores (like total square footage), and even the number of likes your company’s tweets have been getting recently. Let’s assume, too, that this model works wonders, giving the business team advance insight into future sales patterns weeks in advance. There is just one problem. Can you guess what it is?
Nobody knows why the sales forecast model works so well.
Why is this a problem? A number of reasons. The business folks relying on the model’s predictions have no idea how reliable they would be if, say, Twitter experienced an outage and tweet likes decreased one week. On the data science team, you have little sense of what factors are most useful to the model, so you’re flying blind when it comes to identifying new signals with which to bolster your model's performance. And let's not forget other stakeholders. If a decision based on this model’s forecast were to lead to bad results for the company, the board will want to know a lot more about this model than “it just works.”
So what can we do? A great first step is to get some measure of feature importance. This means assigning a numerical score of importance to each of the factors that your model uses. These numerical scores represent how important these features are to your model’s ability to make quality predictions.
Many modeling techniques come with built-in feature importance measurements. Perhaps you can use the information-gain-based importance measure that comes by default with your xgboost model? Not so fast! As your teammates will point out, there is no guarantee that these feature importances will describe your complex ensemble, and besides, gain-based importance measures are biased [1].
So what can we do instead? We can use  “randomized ablation” (aka “permutation”) feature importance measurements. Christoph Molnar offers a clear and concise description of this technique in his Interpretable ML Book [2]:

The concept is really straightforward: We measure the importance of a feature by calculating the increase in the model’s prediction error after permuting the feature. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction. A feature is “unimportant” if shuffling its values leaves the model error unchanged, because in this case the model ignored the feature for the prediction.

Background

Where did this technique come from? Randomized ablation feature importance is certainly not new. Indeed, its inception dates back to at least 2001, when a variant of this technique was introduced as the “noising” of variables to better understand how random forest models use them [3]. Recently, however, this technique has seen a resurgence in use and variation. For example, an implementation of this technique will be included in the upcoming version 0.22 of the popular Scikit-learn library [4]. For a more theoretical example, consider the recently-introduced framework of “model class reliance,” which has termed a variant of the randomized ablation feature importance “model reliance” and used it as a core building block [5].
A new theoretical perspective

Working with this technique at Fiddler Labs, we have sought to develop a clear sense of what it means, theoretically, to permute a column of your features, run that through your model, and see how much the model’s error increases. This has led us to use the theoretical lens of randomized ablation, hence our new name for what is commonly called permutation feature importance.
In a recent preprint released on arXiv, we develop a clear theoretical formulation of this technique as it relates to the classic statistical learning problem statement. We find that the notion of measuring error after permuting features (or, more formally, ablating them through randomization) actually fits in quite nicely with the mathematics of risk minimization in supervised learning [6]. If you are familiar with this body of theory, we hope this connection will be as helpful to your intuition as it has been to ours.
Additionally, our reformulation provides two ways of constructing confidence intervals around the randomization ablation feature importance scores, a technique that practitioners can use to avoid potential pitfalls in the application of randomized ablation feature importance. To the best of our knowledge, current formulations and implementations of this technique do not include these confidence measurements.
Confidence intervals on feature importance

Consider what might happen if we were to re-run randomized ablation feature importance with a different randomized ablation (e.g. by using a different random seed), or if we run it on two different random subsets of a very large dataset (e.g. to avoid using a full dataset that would exceed our macine's memory capacity). Our feature importances might change! Ideally, we would want to use a large dataset and average over many ablations to mitigate the randomness inherent in the algorithm, but in practice we may not have a enough data or compute power to do so.
There are two sources of uncertainty in the randomized ablation feature impotance scores: the data points we use, and the random ablation values (i.e. permutation) we use. By running the algorithm multiple times and examining the run-to-run variance, we can construct a confidence interval (CI) that measures the uncertainty stemming from the ablation used. Similarly, by looking point-by-point at the loss increases caused by ablation (instead of just averaging loss over our dataset), we can construct a CI that measures the uncertainty stemming from our finite dataset.
Example: forecasting the price of a home

To demonstrate the use of randomized ablation feature importance values with CIs, let's apply the technique to a real model. To this end, I used the Ames Housing Dataset (PDF) to build a complex model that estimates the sale price of houses. The full code for this example is available in a Jupyter notebook here.
To show the importance of confidence intervals, we run randomized ablation feature importance using just 100 points, with just $K=3$ repetitions. This gives us the following top-10 features by score, with a 95% confidence interval indicated by the black error bars:

As we can see from our error bars, it is uncertain which features is actually the third most important over these 100 points. Re-running randomized ablation feature importance with $K=30$ iterations, we arrive at much tighter error bounds, and we find with confidence that a house's neighborhood actually edges out its total basement square footage in importance to our model:

However, it turns out that a larger source of uncertainty in these feature importance scores actually stems from the small size of the dataset used, rather than the small number of ablation repetitions. This fact is uncovered by using the other CI methodology presented in our paper, which captures uncertainty resulting from both ablation and the size of the dataset. Running this other CI technique on another 100 points of our dataset (with just $K=1$ repetition) we observe the following wide CIs:

By increasing the number of points to 500 instead of 100, our confidence improves significantly, and we become fairly confident that neighborhood is the third most important feature to our model overall (not just in our limited dataset).

The math behind CIs

If you want to take a deep dive into the mathematics of randomized ablation feature importance and get the full picture on confidence intervals, we recommend you check out the full paper ([6]). However, we’ll also present below a simplified explanation of one of the CIs we derive in our paper (the one which leverages repeated permutations).
To set the stage, imagine that we have a model that takes in numerical features and outputs a numerical prediction. For example, consider a sales model that takes in numerical information about a store and outpus a prediction of the total sales that store will gross this week. Before diving in, we need to define some notation.

We'll call our model $f$, which is a function that takes in $M$ features $x_1, x_2, \ldots , x_{M}$.
We'll call the true label value that we want to predict $y$.
We'll call our dataset $S = \left((x_1^{(1)}, x_2^{(1)}, \ldots , x_{M}^{(1)}, y^{(1)}), (x_1^{(2)}, x_2^{(2)}, \ldots , x_{M}^{(2)}, y^{(2)}), \ldots, (x_1^{(N)}, x_2^{(N)}, \ldots , x_{M}^{(N)}, y^{(N)}) \right)$.

This is a sequence of $N$ tuples that each contain an $M$-dimensional vector of features and a label.


We measure the error of our model with a loss function. Let's assume we use Mean Squared Error (MSE) and use the notation $\ell (f(x_1, x_2, \ldots , x_{M}), y)$ to denote the mean squared error of a single prediction.
We call the average loss across our dataset the "risk," denoted as: $R(f, S) = \frac{1}{N} \sum_{j = 1}^N \ell(f(x_1^{(j)}, x_2^{(j)}, \ldots , x_{M}^{(j)}), y^{(j)})$


Randomized ablation feature importance measures the increase in risk of a model on a dataset after a certain feature has been hidden or "ablated" by randomization. In practice, this randomized ablation is typically carried out by shuffling (or "permuting") the corresponding column in a data table containing the dataset. Let's introduce a little more notation to express this idea mathematically.

We can randomly ablate the $i$th feature of our dataset $S$. We shall call the result of ablating the $i$th feature of our dataset $S_i$.
We'll call the randomized ablation feature importance of the $i$th feature $\Delta R_i = R(f, S_i) - R(f, S)$.

As we discussed above, we can actually create many different randomized ablations of the same feature on the same dataset (e.g. by varying the random seed). Mathematically, this means:

We can repeatedly randomly ablate the $i$th feature of our dataset in different ways. We shall call the sequence of randomly-ablated datasets $S_i^{(1)}, S_i^{(2)}, S_i^{(3)}, \ldots$

This gives us a sequence of randomly-sampled random ablation feature importance values $\Delta R_i^{(1)}, \Delta R_i^{(2)}, \Delta R_i^{(3)}, \ldots = R(f, S_i^{(1)}) - R(f, S),~ R(f, S_i^{(2)}) - R(f, S),~ R(f, S_i^{(3)}) - R(f, S), \ldots$

The theoretically correct feature importance value is $\Delta R_i = \text{average}(\Delta R_i^{(1)}, \Delta R_i^{(2)}, \Delta R_i^{(3)}, \ldots)$


This is where CIs come in. In general, when we estimate the mean of a random variable by sampling from it, we can estimate an interval for which we have a certain level of confidence (say, 95% confidence) that the true mean of the random variable lies. This is thanks to the Central Limit Theorem, which tells us the average of a large enough sample of a random variable is normally distributed.
To construct this confidence interval, we thus repeatedly sample the random variable (for randomized ablation feature importance, we rerun the procedure repeatedly), and estimate the standard deviation of the average over this sample, called the Standard Error of the Mean (SEM). If this concept is new to you, we recommend you read through a statistics text for a full explanation (e.g.  [7]) .
Once we estimate the SEM, we multiply it by the "confidence coefficient," which is the number of standard deviations around the mean of a normally distributed variable within which the probability mass equals our confidence level. For example, 95% of the probability mass of a normally distributed random variable falls within 1.96 standard deviations of the mean, so the confidence coefficient for 95% confidence is 1.96. By sampling $K$ repetitions of randomized ablation feature importance, we can thus estimate a 95% confidence interval as follows:
$$
\left[\hat{\mu} - 1.96 \times \sqrt{\frac{\hat{\sigma}^2}{K}}, \hat{\mu} + 1.96 \times \sqrt{\frac{\hat{\sigma}^2}{K}} \right]
$$
Where
$$
\hat{\mu} = \frac{1}{K} \sum_{k=1}^{K} \Delta R_i^{(k)} \
\hat{\sigma}^2 = \frac{1}{K} \sum_{k=1}^{K} \left( \Delta R_i^{(k)} - \hat{\mu} \right)^2
$$
Conclusion

Feature importance techniques are a powerful and easy way to gain valuable insight about your machine learning models. The randomized ablation feature importance technique, often referred to as “permutation” importance, offers a straightforward and broadly-applicable technique for computing feature importances. We also showed here how, through a new way of theorizing and formulating the “true” value of randomized ablation feature importance, we are able to construct confidence intervals around our feature importance measurements. These confidence intervals are a useful tool for avoiding pitfalls in practice, especially when datasets are not large.
If you want a deeper dive, be sure to check out our full paper. Don't worry, it's only four pages long! https://arxiv.org/abs/1910.00174
[1] Parr et. al. Beware Default Random Forest Importances (2018). https://explained.ai/rf-importance/
[2] Molnar, Christoph. Interpretable Machine learning (2019). https://christophm.github.io/interpretable-ml-book/feature-importance.html
[3] Breiman, Leo. Random Forests (2001).  https://www.stat.berkeley.edu/%7Ebreiman/randomforest2001.pdf
[4] Scikit-learn Contributors. Permutation feature importance (2019). https://scikit-learn.org/dev/modules/permutation_importance.html
[5] Fisher et. al. Model Class Reliance (2019). https://arxiv.org/abs/1801.01489
[6] Merrick, Luke. Randomized Ablation Feature Importance (2019). https://arxiv.org/abs/1910.00174
[7]  Taboga, Marco. StatLect: Digital textbook on probability and statistics (2019).  https://www.statlect.com/fundamentals-of-statistics/set-estimation-mean