Spotlight0xff/reviewer_2.rst

## reviewer_2.rst

      
    Raw
  

              reviewer_2.rst
            
          
Reviewer #2 comments

This work experimentally compares different hyper-parameters of a seq2seq model for Automatic Speech Recognition (ASR) with attention mechanism and performs some analysis on the trained models.
The title is not only too broad as it considers attention models only in ASR but also the analysis is not comprehensive enough. Among others, it looks at the errors of the beam-search, the effect of beam size, effect of number of LSTM layers in encoder, number of hidden units and the effect of pre-training. It would be much appreciated if you focused more on deeper analysis of the models rather than the structure of the networks, as you did in later sections.
The paper is easy to read and the introduction is well-written citing relevant works. However, going through the paper and only judging by the story-telling for the experiments, it seems that the author(s) are describing their effort at manually tuning a ASR model (which is not a bad thing by itself if presented right). If that is not what the author(s) intended to convey, I suggest re-writing and making new sub-sections for added clarity, specially for section 5.
Some points and questions:

Related works: Distinguish your work from [1].
Related works: You briefly mentioned analysis of models in vision tasks. It would be more fitting if you would mention and summarize NLP works, specially the ones related to attention.
RETURNN framework is for multi-GPU setting. Is this how you trained your networks?
In Table 1, what is the difference between "attention-this work1-none-BPE 1k" and "attention-[Zeyer et al., 2018b]-none-BPE 1k"? If their are the same, where does the performance gap come from?
In Table 1, "this work ^ 2" does not have a language Model but still is better than others which is interesting. It would be even so if we could see the same experiments with a language model. It is important because a high performing ASR system will most likely have one.
Pre-training experiments:  Interestingly enough best result is for 4 (and not 6) layers of LSTM with no pre-training. Could it be because of the pre-training scheme?
Line 124: “This might be due to more stable hyper parameters.” At this point, it is still not clear what is the difference in hyper-parameters.
Line 132: “It seems that pretraining allows to train deeper model and helps for the convergence in that case…” it is not obvious to from your experiments. Table 4 shows the opposite to some extent. If you are analyzing table 5, then this conclusion should be written be a bit later.
I suggest adding one row to show the baseline for all the tables.
Author(s) mention an 8% relative boost in performance from their analysis. It would be interesting to compare with Random Hyperparameter Search preferably with the same number of experiments for a fair comparison.
Line 168: “Related to that, we observe a high training variance. I.e. with the same configuration but different random seeds, we get some variance in the final WER performance.” This is an important information that is missing from all of the results table!
Section 6 and Table 8: For a fixed random seed, where does the non-determinism come from? An expiation would make this clear.
To interpret figure 4, one needs to see the histogram of length of the sequences. I am not sure averaging output units over the whole dataset would mean much if the distribution is heavily skewed. If you know of a similar analysis in the literature, please cite them.
Section 7 - Analysis of the encoder output : It is not clear how the authors found these single neurons.

Prabhavalkar, Rohit, Tara N. Sainath, Bo Li, Kanishka Rao, and Navdeep Jaitly. “An analysis of “attention” in sequence-to-sequence models,“.” In Proc. of Interspeech. 2017.
Rating: 2: Marginally below acceptance threshold
Confidence: 2: The reviewer is fairly confident that the evaluation is correct

Rebuttal comments


This work experimentally compares different hyper-parameters of a seq2seq model for Automatic Speech Recognition (ASR) with attention mechanism and performs some analysis on the trained models.
The title is not only too broad as it considers attention models only in ASR but also the analysis is not comprehensive enough. Among others, it looks at the errors of the beam-search, the effect of beam size, effect of number of LSTM layers in encoder, number of hidden units and the effect of pre-training. It would be much appreciated if you focused more on deeper analysis of the models rather than the structure of the networks, as you did in later sections.
TODO

The paper is easy to read and the introduction is well-written citing relevant works. However, going through the paper and only judging by the story-telling for the experiments, it seems that the author(s) are describing their effort at manually tuning a ASR model (which is not a bad thing by itself if presented right). If that is not what the author(s) intended to convey, I suggest re-writing and making new sub-sections for added clarity, specially for section 5.
TODO

1- Related works: Distinguish your work from [1].
Prabhavalkar, Rohit, Tara N. Sainath, Bo Li, Kanishka Rao, and Navdeep Jaitly. “An analysis of “attention” in sequence-to-sequence models,“.” In Proc. of Interspeech. 2017.
Thanks, we supplemented the paper with the appropriate mention.

2- Related works: You briefly mentioned analysis of models in vision tasks. It would be more fitting if you would mention and summarize NLP works, specially the ones related to attention.
This would be indeed more appropriate, we have included relevant literature.

3- RETURNN framework is for multi-GPU setting. Is this how you trained your networks?
Even though the framework supports multi-GPU training, we used single GPUs for all experiments.

4- In Table 1, what is the difference between "attention-this work1-none-BPE 1k" and "attention-[Zeyer et al., 2018b]-none-BPE 1k"? If their are the same, where does the performance gap come from?
this work^1 is the same exact baseline, though we have run it multiple times to select the best model.

5-In Table 1, "this work ^ 2" does not have a language Model but still is better than others which is interesting. It would be even so if we could see the same experiments with a language model. It is important because a high performing ASR system will most likely have one.
TODO: do LM search for baseline
Thank you, we agree that results for a combination with a language model would fitting.
For the initial paper submission, due to time constraints we did not include this comparison.
The relevant information has been added to the paper.

6- Pre-training experiments:  Interestingly enough best result is for 4 (and not 6) layers of LSTM with no pre-training. Could it be because of the pre-training scheme?
We suspect so too. For Table 4 the same pretraining scheme was used which is probably not optimal for the different architectures.
Even though we could not optimize each encoder configuration individually (due to computation/time constraints),
Table 6 and 7 show modifications to the pretraining scheme which resulted in further improvements for similar encoder configurations.

7- Line 124: “This might be due to more stable hyper parameters.” At this point, it is still not clear what is the difference in hyper-parameters.
Relevant information has been added the paper, we meant the BPE vocabulary in particular.

8- Line 132: “It seems that pretraining allows to train deeper model and helps for the convergence in that case…” it is not obvious to from your experiments. Table 4 shows the opposite to some extent. If you are analyzing table 5, then this conclusion should be written be a bit later.
Please note the footnotes in Table 4, we had to adjust the learning rate for non-pretraining models to help them converge.
We noticed a trend where pretraining helps large and deep models to converge without requiring to fine-tune the learning rate,
or other hyper parameters.

9- I suggest adding one row to show the baseline for all the tables.
TODO: space constraints, also: which baseline?

10- Author(s) mention an 8% relative boost in performance from their analysis. It would be interesting to compare with Random Hyperparameter Search preferably with the same number of experiments for a fair comparison.
TODO: time constraints?

11-Line 168: “Related to that, we observe a high training variance. I.e. with the same configuration but different random seeds, we get some variance in the final WER performance.”
This is an important information that is missing from all of the results table!
The random seed is fixed for all experiments, not otherwise mentioned.

12-Section 6 and Table 8: For a fixed random seed, where does the non-determinism come from? An expiation would make this clear.
We suspect this variance stems from the non-determinism on GPUs.

13-To interpret figure 4, one needs to see the histogram of length of the sequences. I am not sure averaging output units over the whole dataset would mean much if the distribution is heavily skewed. If you know of a similar analysis in the literature, please cite them.
To allow comparison within the dataset, we have length-normalized all sequences to have the same length.
The purpose of Figure 4 is to demonstrate that individual neurons have learned to represent the relative sequence offset.
We have not found any directly-applicable literature, unfortunately.

14-Section 7 - Analysis of the encoder output : It is not clear how the authors found these single neurons.
An explanation has been added to the paper, thank you.

Rating: 2: Marginally below acceptance threshold
Confidence: 2: The reviewer is fairly confident that the evaluation is correct