jabalazs/gating_paper_reviews.md

## gating_paper_reviews.md

      
    Raw
  

              gating_paper_reviews.md
            
          
    Reviews for Gating Mechanisms for Combining Character and Word-level Word Representations: An Empirical Study

Authors: Jorge Balazs and Yutaka Matsuo
REVIEWER 1

Reviewer's Scores

               Appropriateness (1-5): 4
                       Clarity (1-5): 3
  Originality / Innovativeness (1-5): 2
       Soundness / Correctness (1-5): 2
         Meaningful Comparison (1-5): 3
                  Thoroughness (1-5): 3
    Impact of Ideas or Results (1-5): 2
                Recommendation (1-5): 2
           Reviewer Confidence (1-5): 4

Detailed Comments

This paper provides an empirical study on gating mechanisms for combining
character and word-level word representations. The paper is not well structured
and hard to understand for most of the readers.
One concern is, they did emphasize on background study much rather explaining
their experiments properly. They maintained a large portion of the paper for
references but mentioned only two major related works. They could provide a
better and organized literature review section.
On the other hand, they did not explain their experiments in details. Also, the
experiments are not very clear to the people who are not familiar with this
particular field. Besides, they did not provide sufficient amount of original
work that is convincing for publication.
Moreover, the paper is basically a study of existing works rather than a well
structured model. The paper failed to show a proper work flow of their
experiments and is not very well written. The main concern is to improve their
experiments section.
REVIEWER 2

Reviewer's Scores

               Appropriateness (1-5): 5
                       Clarity (1-5): 5
  Originality / Innovativeness (1-5): 3
       Soundness / Correctness (1-5): 4
         Meaningful Comparison (1-5): 4
                  Thoroughness (1-5): 4
    Impact of Ideas or Results (1-5): 3
                Recommendation (1-5): 4
           Reviewer Confidence (1-5): 5

Detailed Comments

This work investigates different methods of combining character-level and
word-level representations (concat, scalar gate, and vector gate) and evaluates
their abilities of representing words and sentences. Specifically, the authors
train models on the SNLI and MNLI datasets and then use the trained embedding
modules to produce word vectors. Those word vectors are used in the word-level
and the sentence-level evaluation. The authors empirically show that the vector
gate consistently outperforms other baselines in the word-level evaluation (word
similarity/relatedness tasks), but there is no clear effect in the
sentence-level evaluation (SentEval (Conneau et al. 2017)).
The vector gate proposed by the authors is solely based on the word-level input
while Yang et al., (2017) uses additional features (e.g. POS, NER) to compute
the gating values. The presentation is very clear, and this paper is easy to
follow. The authors extensively evaluate the gating mechanisms on various word
similarity and NLI/STS datasets (SentEval) while previous work has studied only
on language modeling (Miyamoto and Cho, 2016) and reading comprehension (Yang et
al., 2017). The baselines are good (word only, char only, concat, scalar gate),
and the results are reported with statistical significance. Visualization of
gating values (Figure 2&3) articulates that the rare words tend to use more
character information.
A question that arises here is: “is BiLSTM with max pooling the optimal way to
get the sentence representations?” I assume that the authors follow Conneau et
al. (2017), but they use the word-level input only. It would be nice if the
authors could explain more about the justification of their choice. Another
thing people might be interested in is when this approach is useful. Given the
recent development of LM-based pretraing methods (e.g. ELMo, GPT, BERT etc.), it
would be nice if authors could think about the use cases of their approach.
REVIEWER 3

Reviewer's Scores

               Appropriateness (1-5): 5
                       Clarity (1-5): 4
  Originality / Innovativeness (1-5): 2
       Soundness / Correctness (1-5): 4
         Meaningful Comparison (1-5): 3
                  Thoroughness (1-5): 3
    Impact of Ideas or Results (1-5): 2
                Recommendation (1-5): 2
           Reviewer Confidence (1-5): 4

Detailed Comments

This paper compares representations learned from different architectures on
SNLI/MNLI. They train a BiLSTM on top of lexical representations obtained from
word embeddings, a BiLSTM over character embeddings (I think, it should be made
more clear in the paper), concatenating these 2 representations, and using a
gate (either scalar or vector) and adding the weighted representations.
They then evaluate their lexical representations on word similarity tasks and
their entire model in a transfer setting on sentence-level ones. The vector
gating has the best performance on the lexical tasks but on the sentence level
ones there is no clear trend.
Essentially the main contribution of the paper is that they propose a vector
gate (which is common in neural networks - i.e. an LSTM uses multiple gates of
this form) as opposed to a scalar gate from prior work for combining word and
character representations and show the representations are better on word
similarity tasks. This isn't very surprising for me since this model is also the
most expressive of the ones considered.
I think the paper would be improved if other models would be considered. For
instance, instead of concatenating embeddings (which incidentally makes for the
BiLSTM larger for this model...I dont know if this was controlled for) you could
just average them. Also what about character n-grams? "Charagram: Embedding
Words and Sentences via Character n-grams" shows they are very effective on some
of the datasets used in this paper.
Also, overall their results on the word/sentence tasks are lackluster. A big
strength of the original Infersent paper (see "No Training Required: Exploring
Random Encoders for Sentence Classification" which show that the GloVe
embeddings along explain much of Infersent's performance) is the GloVe
embeddings. I think to get their numbers up to more interesting levels they
could work more with pre-trained embeddings (which is the typical use case
scenario anyway) and a learned character representation, and a new word
representation that could be used to augment the fixed GloVe embeddings.
I think the paper needs some work and added novelty. I think the setup is
interesting, but more could be explored here.