Reviewer #2 comments ===========
This work experimentally compares different hyper-parameters of a seq2seq model for Automatic Speech Recognition (ASR) with attention mechanism and performs some analysis on the trained models. The title is not only too broad as it considers attention models only in ASR but also the analysis is not comprehensive enough. Among others, it looks at the errors of the beam-search, the effect of beam size, effect of number of LSTM layers in encoder, number of hidden units and the effect of pre-training. It would be much appreciated if you focused more on deeper analysis of the models rather than the structure of the networks, as you did in later sections.
The paper is easy to read and the introduction is well-written citing relevant works. However, going through the paper and only judging by the story-telling for the experiments, it seems that the author(s) are describing their effort at manually tuning a ASR model (which is not a bad thing by itself if presented right). If that i