Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Text generator based on LSTM model with pre-trained Word2Vec embeddings in Keras
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
__author__ = 'maxim'
import numpy as np
import gensim
import string
from keras.callbacks import LambdaCallback
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.layers import Dense, Activation
from keras.models import Sequential
from keras.utils.data_utils import get_file
print('\nFetching the text...')
url = 'https://raw.githubusercontent.com/maxim5/stanford-tensorflow-tutorials/master/data/arxiv_abstracts.txt'
path = get_file('arxiv_abstracts.txt', origin=url)
print('\nPreparing the sentences...')
max_sentence_len = 40
with open(path) as file_:
docs = file_.readlines()
sentences = [[word for word in doc.lower().translate(None, string.punctuation).split()[:max_sentence_len]] for doc in docs]
print('Num sentences:', len(sentences))
print('\nTraining word2vec...')
word_model = gensim.models.Word2Vec(sentences, size=100, min_count=1, window=5, iter=100)
pretrained_weights = word_model.wv.syn0
vocab_size, emdedding_size = pretrained_weights.shape
print('Result embedding shape:', pretrained_weights.shape)
print('Checking similar words:')
for word in ['model', 'network', 'train', 'learn']:
most_similar = ', '.join('%s (%.2f)' % (similar, dist) for similar, dist in word_model.most_similar(word)[:8])
print(' %s -> %s' % (word, most_similar))
def word2idx(word):
return word_model.wv.vocab[word].index
def idx2word(idx):
return word_model.wv.index2word[idx]
print('\nPreparing the data for LSTM...')
train_x = np.zeros([len(sentences), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(sentences)], dtype=np.int32)
for i, sentence in enumerate(sentences):
for t, word in enumerate(sentence[:-1]):
train_x[i, t] = word2idx(word)
train_y[i] = word2idx(sentence[-1])
print('train_x shape:', train_x.shape)
print('train_y shape:', train_y.shape)
print('\nTraining LSTM...')
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size, weights=[pretrained_weights]))
model.add(LSTM(units=emdedding_size))
model.add(Dense(units=vocab_size))
model.add(Activation('softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
def sample(preds, temperature=1.0):
if temperature <= 0:
return np.argmax(preds)
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
def generate_next(text, num_generated=10):
word_idxs = [word2idx(word) for word in text.lower().split()]
for i in range(num_generated):
prediction = model.predict(x=np.array(word_idxs))
idx = sample(prediction[-1], temperature=0.7)
word_idxs.append(idx)
return ' '.join(idx2word(idx) for idx in word_idxs)
def on_epoch_end(epoch, _):
print('\nGenerating text after epoch: %d' % epoch)
texts = [
'deep convolutional',
'simple and effective',
'a nonconvex',
'a',
]
for text in texts:
sample = generate_next(text)
print('%s... -> %s' % (text, sample))
model.fit(train_x, train_y,
batch_size=128,
epochs=20,
callbacks=[LambdaCallback(on_epoch_end=on_epoch_end)])
@maxim5

This comment has been minimized.

Copy link
Owner Author

@maxim5 maxim5 commented Jan 12, 2018

Detailed explanation can be found on StackOverflow.

@shubham0704

This comment has been minimized.

Copy link

@shubham0704 shubham0704 commented Feb 25, 2018

Very helpful. Thanks a lot 👍

@anjalibhavan

This comment has been minimized.

Copy link

@anjalibhavan anjalibhavan commented Apr 26, 2018

Hey! I tried training this with a bigger dataset, but it failed to compile with the following error:
IndexError: list index out of range

So I next tried a very small dataset, just two paragraphs. And yet the same error came.
Kindly look into this soon!

@maxim5

This comment has been minimized.

Copy link
Owner Author

@maxim5 maxim5 commented May 21, 2018

@anjali-123b to show the result, the script uses the terms for the selected data set (model, network, deep, ...). If you choose another dataset, those terms will likely be out-of-vocabulary. So you need to change them (line 36 and 83-88 in rev1).

@GLambard

This comment has been minimized.

Copy link

@GLambard GLambard commented May 24, 2018

Thanks a lot for your detailed script!
Also, I have a question concerning the config of the model you use. Why an Embedding layer first? Maybe I misunderstand but you already have an embedding from word2vec. Why not pass directly the word2vec representation to the LSTM layer? Thank you)

UPDATE: Okay, I got it! You anyway need the Embedding layer to contain the pre-trained weights from Word2Vec with the option to fix them or not during the training phase of the model. Awesome!

@AlbinSou

This comment has been minimized.

Copy link

@AlbinSou AlbinSou commented Jun 15, 2018

Hello, thank you for the script.
I am quite new to LSTM and I don't understand why you only put one word in your train_y vector, ( train_y[i] = sentence[-1] ). couldn't you rather try to predict every word in each sentence given the preceding words ? i.e for "Brian is in the kitchen" you'd have
train_x=[["Brian"], ["Brian","is"], ["Brian","is","in"]] with the corresponding train_y = [["is"],["in"],["the"]]
Thank you in advance

@SamsadSajid

This comment has been minimized.

Copy link

@SamsadSajid SamsadSajid commented Aug 21, 2018

In the above code, from line 63-71 you first convert the preds to log. Then log to exponential. Then exponential to multinomial. And you are referring to the multinomial as the probability. Why is that? Is it because you are considering the multinomial distribution?

@orenisp

This comment has been minimized.

Copy link

@orenisp orenisp commented Sep 6, 2018

@maxim5 Can you please explain why the shape of the prediction is [40,1350] and not [1,1350]?
If you want the probablity of each word given the context why there are 40 rows?

@satishpasumarthi

This comment has been minimized.

Copy link

@satishpasumarthi satishpasumarthi commented Oct 1, 2018

After training the model, if we try to print the accuracy it shows as 1. What could I be doing wrong here?

@dhineshkumar-r

This comment has been minimized.

Copy link

@dhineshkumar-r dhineshkumar-r commented Jan 7, 2019

Hi, Thanks for the article. I have a question regarding the way you choose the maximum sentence length. How is that number deduced?
Is it okay if I choose the average number of words in a sentence in the whole corpus? If yes, how do I handle sentences with more words than average value?

Thank you.

@keineahnung2345

This comment has been minimized.

Copy link

@keineahnung2345 keineahnung2345 commented Feb 17, 2019

In Python 3.x:

s.translate(None, string.punctuation)

should be replaced with:

import string
#make translator object
translator=str.maketrans('','',string.punctuation)
s.translate(translator)
@poddubnyoleg

This comment has been minimized.

Copy link

@poddubnyoleg poddubnyoleg commented Mar 11, 2019

Hi! What is the point of using one-hot vector on the prediction layer?

model.add(Dense(units=vocab_size)) model.add(Activation('softmax'))

It seems more suitable to use prediction of same embedding vector with Dense layer with linear activation.

Dense(emdedding_size, activation='linear')

Because if network outputs word Queen instead of King, gradient should be smaller, than output word Apple (in case of one-hot predictions these gradients would be the same)

@shumaila96

This comment has been minimized.

Copy link

@shumaila96 shumaila96 commented Jul 3, 2019

@maxim5 Can you please explain why the shape of the prediction is [40,1350] and not [1,1350]?
If you want the probablity of each word given the context why there are 40 rows?

Hi...have you found the answer??

@sharonwoo

This comment has been minimized.

Copy link

@sharonwoo sharonwoo commented Jul 15, 2019

@keineahnung2345, I was able to get the code working in Python 3.6 by using

s.translate(string.punctuation)

instead of

s.translate(None, string.punctuation)

@gugarosa

This comment has been minimized.

Copy link

@gugarosa gugarosa commented Sep 18, 2019

@maxim5 Can you please explain why the shape of the prediction is [40,1350] and not [1,1350]?
If you want the probablity of each word given the context why there are 40 rows?

Hi...have you found the answer??

It happens as he is returning the prediction over all possible timesteps. As the max_sentence_len is 40, he is returning a prediction in a tensor of shape (max_sentence_len, vocabulary_size).

It usually happens when the LSTM layer argument return_sequences is True, not sure though why in this case is activated (haven't run the code), as the default is False (https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/LSTM).

@alexpro1987

This comment has been minimized.

Copy link

@alexpro1987 alexpro1987 commented Apr 13, 2020

Awesome work!

@CealClem

This comment has been minimized.

Copy link

@CealClem CealClem commented May 6, 2020

@anjali-123b to show the result, the script uses the terms for the selected data set (model, network, deep, ...). If you choose another dataset, those terms will likely be out-of-vocabulary. So you need to change them (line 36 and 83-88 in rev1).

Hi what are we supposed to replace these lists with? common words in our own dataset? Sorry I don't quite understand...
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.