Skip to content

Instantly share code, notes, and snippets.

@ritwikraha
Last active August 29, 2023 12:54
Show Gist options
  • Save ritwikraha/2466b901cb22bbe65288e4bb499e0ebc to your computer and use it in GitHub Desktop.
Save ritwikraha/2466b901cb22bbe65288e4bb499e0ebc to your computer and use it in GitHub Desktop.
why Bahdanau is Additive?

Bahdanau Attention is often called Additive Attention because of the mathematical formulation used to compute the attention scores. In contrast to Dot-Product (Multiplicative) Attention, Bahdanau Attention relies on addition and a non-linear activation function.

Let's go through the math step-by-step:

Definitions

  • ( h_i ): Hidden state of the encoder for the (i)-th time step in the source sequence.
  • ( s_t ): Hidden state of the decoder for the (t)-th time step in the target sequence.
  • ( W_1 ) and ( W_2 ): Weight matrices.
  • ( b ): Bias term.
  • ( v ): Context vector.
  • ( \text{score}(h_i, s_t) ): Attention score for (h_i) and (s_t).

Step 1: Projection

First, both ( h_i ) and ( s_t ) are linearly transformed using weight matrices ( W_1 ) and ( W_2 ): $$[ \tilde{h_i} = W_1 h_i ] [ \tilde{s_t} = W_2 s_t ]$$

Step 2: Calculate Score

The attention score between ( h_i ) and ( s_t ) is calculated using these projected vectors:

$$[ \text{score}(h_i, s_t) = \text{tanh}(\tilde{h_i} + \tilde{s_t} + b) ]$$

Here, $$(\text{tanh})$$ is a non-linear activation function. This step is the reason it is called "Additive" Attention, because the transformed hidden states are summed up (addition operation).

Step 3: Calculate Weights

The scores are then normalized using a softmax function to get the attention weights ( \alpha ):

$$[ \alpha_{i, t} = \frac{\text{exp}(\text{score}(h_i, s_t))}{\sum_j \text{exp}(\text{score}(h_j, s_t))} ]$$

Step 4: Calculate Context Vector

The context vector $$( c_t )$$ is then calculated as the weighted sum of all the encoder hidden states:

$$[ c_t = \sum_i \alpha_{i, t} h_i ]$$

This context vector is then used in the decoder's calculations.

To summarize, the key "additive" part is in the scoring mechanism where the transformed hidden states from both encoder and decoder are added together. This is different from Multiplicative (Dot-Product) Attention, where the hidden states would be multiplied together to calculate the score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment