Skip to content

Instantly share code, notes, and snippets.

Last active April 25, 2022 14:37
Show Gist options
  • Save cbaziotis/7ef97ccf71cbc14366835198c09809d2 to your computer and use it in GitHub Desktop.
Save cbaziotis/7ef97ccf71cbc14366835198c09809d2 to your computer and use it in GitHub Desktop.
Keras Layer that implements an Attention mechanism, with a context/query vector, for temporal data. Supports Masking. Follows the work of Yang et al. [] "Hierarchical Attention Networks for Document Classification"
def dot_product(x, kernel):
Wrapper for dot product operation, in order to be compatible with both
Theano and Tensorflow
x (): input
kernel (): weights
if K.backend() == 'tensorflow':
return K.squeeze(, K.expand_dims(kernel)), axis=-1)
return, kernel)
class AttentionWithContext(Layer):
Attention operation, with a context/query vector, for temporal data.
Supports Masking.
Follows the work of Yang et al. []
"Hierarchical Attention Networks for Document Classification"
by using a context vector to assist the attention
# Input shape
3D tensor with shape: `(samples, steps, features)`.
# Output shape
2D tensor with shape: `(samples, features)`.
How to use:
Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
The dimensions are inferred based on the output shape of the RNN.
Note: The layer has been tested with Keras 2.0.6
model.add(LSTM(64, return_sequences=True))
# next add a Dense layer (for classification/regression) or whatever...
def __init__(self,
W_regularizer=None, u_regularizer=None, b_regularizer=None,
W_constraint=None, u_constraint=None, b_constraint=None,
bias=True, **kwargs):
self.supports_masking = True
self.init = initializers.get('glorot_uniform')
self.W_regularizer = regularizers.get(W_regularizer)
self.u_regularizer = regularizers.get(u_regularizer)
self.b_regularizer = regularizers.get(b_regularizer)
self.W_constraint = constraints.get(W_constraint)
self.u_constraint = constraints.get(u_constraint)
self.b_constraint = constraints.get(b_constraint)
self.bias = bias
super(AttentionWithContext, self).__init__(**kwargs)
def build(self, input_shape):
assert len(input_shape) == 3
self.W = self.add_weight((input_shape[-1], input_shape[-1],),
if self.bias:
self.b = self.add_weight((input_shape[-1],),
self.u = self.add_weight((input_shape[-1],),
super(AttentionWithContext, self).build(input_shape)
def compute_mask(self, input, input_mask=None):
# do not pass the mask to the next layers
return None
def call(self, x, mask=None):
uit = dot_product(x, self.W)
if self.bias:
uit += self.b
uit = K.tanh(uit)
ait =, self.u)
a = K.exp(ait)
# apply mask after the exp. will be re-normalized next
if mask is not None:
# Cast the mask to floatX to avoid float64 upcasting in theano
a *= K.cast(mask, K.floatx())
# in some cases especially in the early stages of training the sum may be almost zero
# and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
# a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx())
a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
a = K.expand_dims(a)
weighted_input = x * a
return K.sum(weighted_input, axis=1)
def compute_output_shape(self, input_shape):
return input_shape[0], input_shape[-1]
Copy link

ni9elf commented May 13, 2017

Need some help in two lines.
In line 40, why is assert len(input_shape) == 3 required? What information is stored in input_shape?
In line 87, why is expand_dim being used?

Copy link

Helw150 commented Jun 19, 2017

@cbaziotis My loss is still NaN despite the small epsilon. Any recommended paths for debugging?

Copy link

cbaziotis commented Jul 30, 2017

Sorry for not replying sooner, but notifications for gist comments apparently don't work.

Regarding some of the errors: the layer was developed using Theano as a backend. I have updated the gist and now it also works with Tensorflow. However, i suggest to use Theano, as it has better RNN performance. Please use the new version and let me know.
Also, i have not tested the layer with Keras 2, but i assume it will need only some minor syntactic changes.

@Helw150 do you mind sharing the code for your model?

Copy link

shillel commented Aug 9, 2017

Thank you for this! using it in my school project.
one comment: even with your fix in line:105, I still sometimes encountered the NaN issue.
following BiMPM, I used K.max(sum(...), K.epsilon()) which turned out to be more stable.
hope this helps.

Copy link

Updated for Keras 2.

Copy link

Line 93: I had to replace ait =, self.u) with ait = dot_product(uit, self.u) to make it work with TF

Copy link

linetor commented Oct 4, 2017

@sreiling If that, there is not error. But the model result is different. If I make that,AttentionWithContext's output dimension is lstm's hidden dim, compute_output_shape's output dim is input's last dim(embedding dim). Is it right?

Copy link

aryopg commented Oct 26, 2017

Really great code @cbaziotis! I've used it several times for classification problem. But, I've been wondering how to use this in a seq2seq architecture? Many thanks!

Copy link

L93 was also creating an issue for me with TensorFlow so I reused the dot_product() function like on L87

Copy link

shanest commented Jan 30, 2018

Thanks so much for this terrific gist (as well as your other Attention one)!

One minor bug: on line 93, should be replaced by dot_product, so that it works with TensorFlow as backend.

Copy link

Nice work. Thanks for sharing the code.
I have a problem when I'm using the code. My sequences have varying lengths and I’m using bucketing to solve the issue. Therefore I define the LSTM input shape as (None, None, features), i.e. there are no explicit timesteps. But your code needs a fixed timestep, so there always raises an error. As far as I know, the number of timesteps doesn't need to be fixed. Therefore I wonder if there's a way to modify the code to support that. Thanks.

Copy link

Thank you for your code.
I want to use the Layer(with some adaptive changes) in my code as part of my graduation thesis. I wonder how should I cite it.

Copy link

LuisPB7 commented Apr 7, 2018

Hello everyone

I was wondering, does anyone know how to create an attention layer with a custom (fixed, or trainable) context vector? I have tried this:

def call(self, inputs, mask=None):
        x = inputs[0]
        context = inputs[1]
        uit =, self.W)

        if self.bias:
            uit += self.b

        uit = K.tanh(uit)
        ait =, context)

        a = K.exp(ait)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())

        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        # a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx())
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

having also modified some other aspects such as the get_output_shape_for and compute_output_shape methods. Here is how I create and apply the layer with the custom context vector:

# Some other code...
context = Dropout(0.01)(dense) # 150 dimensional vector
H = TimeDistributed(Dense(150))(g2) # (None, 50, 150) tensor
sentence = AttentionWithContext()([H, context])
SentenceEncoder = Model(input_premisse, sent)

However, when attempting to run

input_premisse = Input(shape=(50,))
input_hyp = Input(shape=(50,))
input_overlap = Input(shape=(1,))
input_refuting = Input(shape=(15,))
input_polarity = Input(shape=(2,))
input_hand = Input(shape=(26,))
input_sim = Input(shape=(1,))
input_bleu = Input(shape=(1,))
input_rouge = Input(shape=(3,))
premisse_representation = SentenceEncoder(input_premisse)
hyp_representation = SentenceEncoder(input_hyp)
concat = merge([premisse_representation, hyp_representation], mode='concat')
mul = merge([premisse_representation, hyp_representation], mode='mul')
dif = merge([premisse_representation, hyp_representation], mode=lambda x: x[0] - x[1], output_shape=lambda x: x[0])
final_merge = merge([concat, mul, dif, input_overlap, input_refuting, input_polarity, input_hand, input_sim, input_bleu, input_rouge], mode='concat')

I get an error on the final_merge which says:

line 229, in <module>
  sent = AttentionWithContext()([H, context])
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 572, in __call__
  self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 635, in add_inbound_node
  Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 172, in create_node
  output_tensors = to_list(, mask=input_masks))
File "C:\Users\Luís Pedro\Desktop\", line 186, in call
  a *= K.cast(mask, K.floatx())
File "D:\Anaconda3\Lib\site-packages\keras\backend\", line 206, in cast
  return T.cast(x, dtype)
File "D:\Anaconda3\Lib\site-packages\theano\tensor\", line 1259, in cast
  _x = as_tensor_variable(x)
File "D:\Anaconda3\Lib\site-packages\theano\tensor\", line 200, in as_tensor_variable
  raise AsTensorError("Cannot convert %s to TensorType" % str_x, type(x))

theano.tensor.var.AsTensorError: ('Cannot convert [None, None] to TensorType', <class 'list'>)

and if I comment out that specific lines, I instead get the error:

File "C:\Users\Luís Pedro\Desktop\", line 244, in <module>
  final_merge = merge([concat, mul, dif, input_overlap, input_refuting, input_polarity, input_hand, input_sim, input_bleu, input_rouge], mode='concat')
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 1680, in merge
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 1301, in __init__
  self.add_inbound_node(layers, node_indices, tensor_indices)
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 635, in add_inbound_node
  Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 172, in create_node
  output_tensors = to_list(, mask=input_masks))
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 1394, in call
  return K.concatenate(inputs, axis=self.concat_axis)
File "D:\Anaconda3\Lib\site-packages\keras\backend\", line 583, in concatenate
  return T.concatenate([to_dense(x) for x in tensors], axis=axis)
File "D:\Anaconda3\Lib\site-packages\theano\tensor\", line 4734, in concatenate
  return join(axis, *tensor_list)
File "D:\Anaconda3\Lib\site-packages\theano\tensor\", line 4483, in join
  return join_(axis, *tensors_list)
File "D:\Anaconda3\Lib\site-packages\theano\gof\", line 615, in __call__
  node = self.make_node(*inputs, **kwargs)
File "D:\Anaconda3\Lib\site-packages\theano\tensor\", line 4216, in make_node
  axis, tensors, as_tensor_variable_args, output_maker)
File "D:\Anaconda3\Lib\site-packages\theano\tensor\", line 4282, in _make_node_internal
  raise TypeError("Join() can only join tensors with the same "
builtins.TypeError: Join() can only join tensors with the same number of dimensions.

Any ideas? Thanks.

Copy link

leocnj commented Apr 28, 2018

Just wondering whether
self.W = self.add_weight((input_shape[-1], input_shape[-1],)
is necessary. Can we simply use a vector here rather than a matrix?

Copy link

ronggong commented May 30, 2018

@LeZhengThu This code works for the variable length input I think. At least it works for my case.

Copy link

I'm getting negative attention weights for some words using this. Is this supposed to happen? If so, any literature that indicates this should happen? If not, any ideas on how to fix?

Copy link

Lhemamou commented Jul 2, 2018

@skywang329 Do you check a values or u values ? The attention weights are the a values and normally the exponential forces coefficient to be positive

Copy link
@ronggong could you provide a minimal working example? I'm using bucketing like @LeZhengThu (for efficiency reasons), so I set input_length=None since it varies from batch to batch. The output of my Bidirectional GRU layer has shape (?,?,256). When adding an AttentionWithContext layer, I get IndexError: pop index out of range.

Copy link

I fixed the error I was getting by replacing by dot_product line 93. The error had nothing to do with the length of the input.

Copy link

lzfelix commented Aug 3, 2018

Thanks for your implementation @cbaziotis! I have made some modifications on your code here in order to make it compatible with Keras 2.x and to also make easy recovering the attention weights for visualization. By the way, have you thought about making a PR for the attention layer on keras-contrib?

Copy link

inputs = Input(shape=(100,))
embedding_layer = Embedding(maxnumber_of_tp, embedding_vecor_length, mask_zero=True)(inputs)
hidden = LSTM(64, return_sequences=True)(embedding_laye )
sentence, word_scores = Attention(return_attention=True)(hidden)
output = Dense(1, activation='sigmoid')(sentence)
model = Model(input=inputs, output=output)

I train it with a binary classification problem. My question is How should I catch 'word_scores'?
When I do this:

attention_model = Model(input= model.input, output= model.layers[-2].output)

I got the 'sentence' rather than 'word_scores '

Anyone knows?

Copy link

nectario commented Feb 3, 2019

Where is the context computed? I need to output a different sequence length than the one of the input.

Copy link

IS5882 commented Mar 2, 2019

The attention layer outputs a 2D tensor shape (none,256) any idea on how to make it output a 3D tensor without reshaping??!

Because I reshaped it to be (none,1,256) and my time distributed dense layers that follow expects (None, 1, 15) and I need it to expect what its actually receiving (none,20,15) since 20 is my max sentence length ?! Any ideas?

Copy link

iridiumblue commented May 25, 2019

Great work, thanks!

I've made some small updates, so that the Layer works under Tensorflow 1.13 with Eager Execution (EE is awesome, with its imperative model, makes debugging soooooo much easier.)

AttentionWithContext for TF 1.13 and Eager Execution

Copy link

will this work for different modalities like (visual and texual)?

Copy link

gdbb commented Aug 31, 2019

Hello everyone

I was wondering, does anyone know how to create an attention layer with a custom (fixed, or trainable) context vector? I have tried this:

def call(self, inputs, mask=None):
        x = inputs[0]
        context = inputs[1]
        uit =, self.W)

        if self.bias:
            uit += self.b

        uit = K.tanh(uit)
        ait =, context)

        a = K.exp(ait)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())

        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        # a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx())
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

having also modified some other aspects such as the get_output_shape_for and compute_output_shape methods. Here is how I create and apply the layer with the custom context vector:

# Some other code...
context = Dropout(0.01)(dense) # 150 dimensional vector
H = TimeDistributed(Dense(150))(g2) # (None, 50, 150) tensor
sentence = AttentionWithContext()([H, context])
SentenceEncoder = Model(input_premisse, sent)

However, when attempting to run

input_premisse = Input(shape=(50,))
input_hyp = Input(shape=(50,))
input_overlap = Input(shape=(1,))
input_refuting = Input(shape=(15,))
input_polarity = Input(shape=(2,))
input_hand = Input(shape=(26,))
input_sim = Input(shape=(1,))
input_bleu = Input(shape=(1,))
input_rouge = Input(shape=(3,))
premisse_representation = SentenceEncoder(input_premisse)
hyp_representation = SentenceEncoder(input_hyp)
concat = merge([premisse_representation, hyp_representation], mode='concat')
mul = merge([premisse_representation, hyp_representation], mode='mul')
dif = merge([premisse_representation, hyp_representation], mode=lambda x: x[0] - x[1], output_shape=lambda x: x[0])
final_merge = merge([concat, mul, dif, input_overlap, input_refuting, input_polarity, input_hand, input_sim, input_bleu, input_rouge], mode='concat')

I get an error on the final_merge which says:

line 229, in <module>
  sent = AttentionWithContext()([H, context])
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 572, in __call__
  self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 635, in add_inbound_node
  Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 172, in create_node
  output_tensors = to_list(, mask=input_masks))
File "C:\Users\Luís Pedro\Desktop\", line 186, in call
  a *= K.cast(mask, K.floatx())
File "D:\Anaconda3\Lib\site-packages\keras\backend\", line 206, in cast
  return T.cast(x, dtype)
File "D:\Anaconda3\Lib\site-packages\theano\tensor\", line 1259, in cast
  _x = as_tensor_variable(x)
File "D:\Anaconda3\Lib\site-packages\theano\tensor\", line 200, in as_tensor_variable
  raise AsTensorError("Cannot convert %s to TensorType" % str_x, type(x))

theano.tensor.var.AsTensorError: ('Cannot convert [None, None] to TensorType', <class 'list'>)

and if I comment out that specific lines, I instead get the error:

File "C:\Users\Luís Pedro\Desktop\", line 244, in <module>
  final_merge = merge([concat, mul, dif, input_overlap, input_refuting, input_polarity, input_hand, input_sim, input_bleu, input_rouge], mode='concat')
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 1680, in merge
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 1301, in __init__
  self.add_inbound_node(layers, node_indices, tensor_indices)
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 635, in add_inbound_node
  Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 172, in create_node
  output_tensors = to_list(, mask=input_masks))
File "D:\Anaconda3\Lib\site-packages\keras\engine\", line 1394, in call
  return K.concatenate(inputs, axis=self.concat_axis)
File "D:\Anaconda3\Lib\site-packages\keras\backend\", line 583, in concatenate
  return T.concatenate([to_dense(x) for x in tensors], axis=axis)
File "D:\Anaconda3\Lib\site-packages\theano\tensor\", line 4734, in concatenate
  return join(axis, *tensor_list)
File "D:\Anaconda3\Lib\site-packages\theano\tensor\", line 4483, in join
  return join_(axis, *tensors_list)
File "D:\Anaconda3\Lib\site-packages\theano\gof\", line 615, in __call__
  node = self.make_node(*inputs, **kwargs)
File "D:\Anaconda3\Lib\site-packages\theano\tensor\", line 4216, in make_node
  axis, tensors, as_tensor_variable_args, output_maker)
File "D:\Anaconda3\Lib\site-packages\theano\tensor\", line 4282, in _make_node_internal
  raise TypeError("Join() can only join tensors with the same "
builtins.TypeError: Join() can only join tensors with the same number of dimensions.

Any ideas? Thanks.

@LuisPB7 I combine the context and key into a whole tensor as an input, then split them in the Attention class. But that needs some modification in the Attention codes (stuff like tensor calculation, input/output shape).

Copy link

Will this work for images?

Copy link

Paryul10 commented May 28, 2020


I am getting this error. Can anyone please help me resolve it.

model.add(Bidirectional(LSTM(lstm_output_size, dropout_W=0.2,dropout_U=0.2, return_sequences=True)))
model.add(Bidirectional(LSTM(lstm_output_size, dropout_W=0.2,dropout_U=0.2, return_sequences=True)))
model.add(Dense(numclasses, activation='softmax'))

Copy link

Copy link

junyongyou commented Oct 5, 2020

Thanks a lot for the code. I have a question about using mask. Could you please explain how to define and use a mask here? If I have already used a Masking layer before LSTM, e.g., x = Masking(mask_value=0.)(x), should I still use mask here? If so, how can I define the mask? I am using masking value as 0 in the masking layer for LSTM, then the LSTM layer knows which timesteps should be ignored. However, the LSTM features will not be zeros and might be arbitrary, how to define the mask for the attention layer then? Should we use the same mask as that for LSTM? Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment