devpramod/integrated-gradients.md

## integrated-gradients.md

      
    Raw
  

              integrated-gradients.md
            
          
    Introduction

It is very important to learn how and why a machine learning model behaves a certain way while making predictions. As NLP models get more bigger and complex, it becomes imperative that we aim to attribute the output predictions to precise and distinct signals from the input data,  especially in production environments. Model interpretability helps pin down a few questions like:


What kind of examples does my model perform poorly on?


Why did my model make this prediction? Can this prediction be attributed to adversarial behavior, or to undesirable priors in the training set?


Does my model behave consistently if I change things like textual style, verb tense, or pronoun gender?


Axiomatic Attribution for Deep Networks

A Neural Network is a mathematical function, just as  is. The function output is heavily dependent on x, or the input. If someone told us that f(x) evaluated to a trillion, we would say that the input was a relatively large number. In other words, input to the mathematical function shown above absolutely decides the output. The large output can be attributed to a relatively large input. This attribution to the input is something that can help us understand a neural network's prediction. For example, when a neural network predicts the image that it was shown as a 'cat', the pixels in the image belonging to the cat attributed to the prediction. If there was a score for this, then the attribution score for those pixels would be very high given that it was a well-trained model.
The research paper Axiomatic Attribution for Deep Networks defines axioms for the correctness of attribution methods - methods that generate attribution scores for inputs to deep networks. Axioms are nothing but desirable characteristics that we want these methods to have, so we can trust that they will do a good job of attributing the right scores to the right input features. The paper mentioned above highlights two such axioms:

Sensitivity
Implementation Invariance

We will get back to what these mean later in the post. For now, we will focus on an attribution method that satisfies both these axioms.
This method is called Integrated Gradients.  Unfortunately, most of the similar methods do not satisfy
one of these two axioms, which makes learning Integrated Gradients a fruitful task. Another desirable feature of Integrated Gradients is that it does not need any instrumentation of the network, and can be computed easily using a few calls to the gradient operation, which can be done quite easily in any model neural network programming framework, allowing even novice practitioners to easily apply the
technique.
Integrated Gradients

Here is a simple way to understand Integrated Gradients.
An Attribution method scores the input data based on the predictions the model makes, i.e. it attributes the predictions to it's input signals or features, using scores for each feature. For example, In sentiment classification for movie reviews, the input text could be 'It was a fantastic performance'. Using an attribution method, we could generate a score for each word in the input the model predicted for. And this scores associated to each of these word could tell us how much part they played in an instance of a prediction.
Integrated Gradients is one such method. In rough terms it is equal to (feature x gradient). The gradient is the signal that tells the neural network how much to increase or decrease a certain weight/coefficient in the network during backpropagation. It relies heavily on the input features to do so. Therefore, the gradient associated with each input feature with respect to the output (partial derivative dout/din) can help us get a clue about how important a feature is. But there is one small problem.
It becomes easier to formalize this problem using the example of image classification using deep learning.

Consider the above image. We correctly predict the image being a 'Fireboat'. We calculate the gradients per feature, in this case a pixel in the image, to get the dark image on the right. But it looks like noise and is nowhere close to identifying the fireboat. Then how did the model get the prediction right? Did the model randomly guess that the correct prediction was a fireboat?
It turns out that the model function is flat in the vicinity of the input of a well trained model. What I mean by that is, assume that the model had only one input feature and one weight value - y = f(x) or y = w * x. If the error in the network is zero, in other words if the goal weight is reached as shown below, the slope at that point (which is the derivative dy/dx) is 0.

Therefore in the (feature x gradient) paradigm we get (feature x 0). This explains the black pixels and random noise in the image. There is a way to counter this as shown in the paper Axiomatic Attribution for Deep Networks.
We arrive at the next and final important concept in understanding Integrated Gradients, The Baseline. Now, the next step is to desaturate the network in order to see the effect of the input features on the output predictions. To perform this, we take the image in question and dial down the brightness/intensity of the image all the way down to black as shown below.

Next, we scale the intensity slowly on a range of 0-1 to make the black image look more and more like the fireboat image. We see in the graph above that around the 0.3 mark, the score reaches 1.0 and the gradients level off. This is basically the saturated network that we previously discussed. The gradients that we need to compute the attributions lie below this mark. And the black image is called the baseline.
In the case of NLP models, the baseline is a zero-embedding word vector.
Now that we have looked at Integrated Gradients, Attribution Scores and the Baseline, It is a good time to visit the two axioms that were discussed earlier in the post.
Sensitivity

An attribution method satisfies Sensitivity if for every input and baseline that differ in one feature but have different predictions then the differing feature should be given a non-zero attribution.
Gradients violate sensitivity: consider the following function. It is a one variable, one ReLU network (ReLU is nothing but max(0,x)).

Suppose the baseline  and the input . The range of this function is 0 to 1. For any value greater than or equal to 1, the function returns 1. A reason to choose Integrated Gradients over gradients is that gradients break sensitivity according to the definition above as the differing feature  is being given a zero attribution.
Implementation Invariance

Two networks are functionally equivalent if their outputs are equal for all inputs, despite having very different implementations. For example a CNN and a RNN classifying a piece of text to have a positive sentiment. Attribution methods should satisfy Implementation Invariance, i.e., the attributions are always identical for two functionally equivalent networks. To motivate this,
notice that attribution can be roughly defined as assigning the blame (or credit) for the output to the input features. Such a definition does not refer to implementation details. Other popular attribution methods such as DeepLift and LRP break Implementation Invariance. Integrated Gradient method does not.
These are some great properties that Integrated Gradients have which make them a very good choice to explain any differentiable model.