Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Keras Layer that implements an Attention mechanism, with a context/query vector, for temporal data. Supports Masking. Follows the work of Yang et al. [] "Hierarchical Attention Networks for Document Classification"
class AttentionWithContext(Layer):
Attention operation, with a context/query vector, for temporal data.
Supports Masking.
Follows the work of Yang et al. []
"Hierarchical Attention Networks for Document Classification"
by using a context vector to assist the attention
# Input shape
3D tensor with shape: `(samples, steps, features)`.
# Output shape
2D tensor with shape: `(samples, features)`.
:param kwargs:
Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
The dimensions are inferred based on the output shape of the RNN.
model.add(LSTM(64, return_sequences=True))
def __init__(self,
W_regularizer=None, u_regularizer=None, b_regularizer=None,
W_constraint=None, u_constraint=None, b_constraint=None,
bias=True, **kwargs):
self.supports_masking = True
self.init = initializations.get('glorot_uniform')
self.W_regularizer = regularizers.get(W_regularizer)
self.u_regularizer = regularizers.get(u_regularizer)
self.b_regularizer = regularizers.get(b_regularizer)
self.W_constraint = constraints.get(W_constraint)
self.u_constraint = constraints.get(u_constraint)
self.b_constraint = constraints.get(b_constraint)
self.bias = bias
super(AttentionWithContext, self).__init__(**kwargs)
def build(self, input_shape):
assert len(input_shape) == 3
self.W = self.add_weight((input_shape[-1], input_shape[-1],),
if self.bias:
self.b = self.add_weight((input_shape[-1],),
self.u = self.add_weight((input_shape[-1],),
super(AttentionWithContext, self).build(input_shape)
def compute_mask(self, input, input_mask=None):
# do not pass the mask to the next layers
return None
def call(self, x, mask=None):
uit =, self.W)
if self.bias:
uit += self.b
uit = K.tanh(uit)
ait =, self.u)
a = K.exp(ait)
# apply mask after the exp. will be re-normalized next
if mask is not None:
# Cast the mask to floatX to avoid float64 upcasting in theano
a *= K.cast(mask, K.floatx())
# in some cases especially in the early stages of training the sum may be almost zero
# and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
# a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx())
a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
a = K.expand_dims(a)
weighted_input = x * a
return K.sum(weighted_input, axis=1)
def get_output_shape_for(self, input_shape):
return input_shape[0], input_shape[-1]
def compute_output_shape(self, input_shape):
"""Shape transformation logic so Keras can infer output shape
return (input_shape[0], input_shape[-1])
Copy link

ni9elf commented May 13, 2017

Need some help in two lines.
In line 40, why is assert len(input_shape) == 3 required? What information is stored in input_shape?
In line 87, why is expand_dim being used?

Copy link

ni9elf commented May 13, 2017

Thank you for releasing your code. Have you implemented the entire Hierarchical Attention Network (HAN) also, apart from the above attention layer? Any leads on how to get the code of HAN, preferably in Keras. I have currently found these two Keras implementations: and

Copy link

thomasjungblut commented May 16, 2017

Thanks for the code, I'm getting:

  File "C:\Anaconda3\lib\site-packages\keras\", line 466, in add
    output_tensor = layer(self.outputs[0])
  File "C:\Anaconda3\lib\site-packages\keras\engine\", line 585, in __call__
    output =, **kwargs)
  File "C:\Users\thomas.jungblut\git\ner-sequencelearning\", line 77, in call
    ait =, self.u)
  File "C:\Anaconda3\lib\site-packages\keras\backend\", line 928, in dot
    y_permute_dim = [y_permute_dim.pop(-2)] + y_permute_dim
IndexError: pop index out of range

Topology is quite easy:

        m = Sequential()

        m.add(LSTM(250, return_sequences=True,input_shape=(timesteps, input_dim)))

Copy link

abali96 commented May 23, 2017

I'm also seeing what @thomasjungblut has been experiencing. Similar topology:

model = Sequential()
model.add(LSTM(100, input_shape=(MAX_TIMESTEPS, input_vector_size), return_sequences=True))
model.add(Dense(1, activation='sigmoid'))

Copy link

rmdort commented May 26, 2017

I have adapted the code for tensorflow and keras 2. Here is the fork

Copy link

bicepjai commented Sep 10, 2017

Is the issue "IndexError: pop index out of range" resolved

Copy link

leocnj commented Apr 28, 2018

Just tried rmdort's fork. The issue reported by abali96 disappears! Also, he added tensor shapes in comments and this helps to understand what happens under the hood.

Copy link

sekarpdkt commented Apr 28, 2018

I am getting

Traceback (most recent call last):
File "", line 25, in
from attention import AttentionWithContext
File "/ssd/MachineLearning/Python/NLP/SplitAndSpellSentence/", line 1, in
class AttentionWithContext(Layer):
NameError: name 'Layer' is not defined

Code is simple

    model = Sequential()  
    model.add(recurrent.GRU(hidden_neurons, input_shape=( CONFIG.max_input_wordchunk_len, len(chars)), 
                            kernel_initializer=CONFIG.initialization, activation='linear'))
    model.add(Dense(len(chars), activation='sigmoid',kernel_initializer=CONFIG.initialization))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

Edit: My bad. from keras.engine.topology import Layer resolved.

Copy link

sekarpdkt commented Apr 28, 2018

I dont know why, but, getting dimension error.

def generate_model(output_len, chars=None):
    """Generate the model"""
    print('Building model...')
    chars = chars or CHARS

    in_out_neurons = CONFIG.max_input_len  
    hidden_neurons = CONFIG.hidden_size
    model = Sequential()  
    model.add(recurrent.GRU(512, input_shape=( 128, 100), 
                            kernel_initializer=CONFIG.initialization, activation='linear'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
    return model

and the error is

Layer (type)                 Output Shape              Param #   
gru_1 (GRU)                  (None, 128, 512)          941568    
attention_with_context_1 (At (None, 512)               263168    
Total params: 1,204,736
Trainable params: 1,204,736
Non-trainable params: 0
Epoch 1/500
Traceback (most recent call last):
  File "", line 580, in <module>
  File "", line 482, in train_speller
  File "", line 467, in itarative_train
    class_weight=None, max_queue_size=10, workers=1)
  File "/ssd/anaconda3/lib/python3.6/site-packages/keras/legacy/", line 91, in wrapper
    return func(*args, **kwargs)
  File "/ssd/anaconda3/lib/python3.6/site-packages/keras/", line 1315, in fit_generator
  File "/ssd/anaconda3/lib/python3.6/site-packages/keras/legacy/", line 91, in wrapper
    return func(*args, **kwargs)
  File "/ssd/anaconda3/lib/python3.6/site-packages/keras/engine/", line 2230, in fit_generator
  File "/ssd/anaconda3/lib/python3.6/site-packages/keras/engine/", line 1877, in train_on_batch
  File "/ssd/anaconda3/lib/python3.6/site-packages/keras/engine/", line 1480, in _standardize_user_data
  File "/ssd/anaconda3/lib/python3.6/site-packages/keras/engine/", line 113, in _standardize_input_data
    'with shape ' + str(data_shape))
ValueError: Error when checking target: expected attention_with_context_1 to have 2 dimensions, but got array with shape (64, 128, 100)

Any idea?

as output share is anyway 3dim, i tried to change line 81 as

return (input_shape[0], input_shape[1],input_shape[2])

then for different error and model is not compiling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment