Skip to content

Instantly share code, notes, and snippets.

Last active Jan 21, 2018
What would you like to do?
Image Captioning LSTM


name: LSTM image captioning model based on CVPR 2015 paper "Show and tell: A neural image caption generator" and code from Karpathy's NeuralTalk.



neon_version: v1.0.rc1

neon_commit: 2169b093fbba0c189021a941d286c7a98c0c6c6c

gist_id: 7e76e90664f935c6f65d

##Description The LSTM model is trained on the flickr8k dataset using precomputed VGG features from Model details can be found in the following CVPR-2015 paper:

Show and tell: A neural image caption generator.
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan.  
CVPR, 2015 (arXiv ref. cs1411.4555)

The model was trained for 15 epochs where 1 epoch is 1 pass over all 5 captions of each image. Training data was shuffled each epoch. To evaluate on the test set, download the model and weights, and run:

    python --model_file [path_to_weights]

##Performance For testing, the model is only given the image and must predict the next word until a stop token is predicted. Greedy search is currently used by just taking the max probable word each time. Using the bleu score evaluation script from and evaluating against 5 reference sentences the results are below.

BLEU Score
B-1 54.2
B-2 32.6
B-3 19.3
B-4 12.3

A few things that were not implemented are beam search, l2 regularization, and ensembles. With these things, performance would be a bit better.

Copy link

nickzuck commented Jan 20, 2017

Access Denied on the model file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment