Skip to content

Instantly share code, notes, and snippets.

@jabalazs
Last active April 19, 2019 00:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jabalazs/27f81104d957ddca942d43a12f20c88e to your computer and use it in GitHub Desktop.
Save jabalazs/27f81104d957ddca942d43a12f20c88e to your computer and use it in GitHub Desktop.

Authors: Jorge Balazs and Yutaka Matsuo

REVIEWER 1

Reviewer's Scores

               Appropriateness (1-5): 4
                       Clarity (1-5): 3
  Originality / Innovativeness (1-5): 2
       Soundness / Correctness (1-5): 2
         Meaningful Comparison (1-5): 3
                  Thoroughness (1-5): 3
    Impact of Ideas or Results (1-5): 2
                Recommendation (1-5): 2
           Reviewer Confidence (1-5): 4

Detailed Comments

This paper provides an empirical study on gating mechanisms for combining character and word-level word representations. The paper is not well structured and hard to understand for most of the readers.

One concern is, they did emphasize on background study much rather explaining their experiments properly. They maintained a large portion of the paper for references but mentioned only two major related works. They could provide a better and organized literature review section.

On the other hand, they did not explain their experiments in details. Also, the experiments are not very clear to the people who are not familiar with this particular field. Besides, they did not provide sufficient amount of original work that is convincing for publication.

Moreover, the paper is basically a study of existing works rather than a well structured model. The paper failed to show a proper work flow of their experiments and is not very well written. The main concern is to improve their experiments section.

REVIEWER 2

Reviewer's Scores

               Appropriateness (1-5): 5
                       Clarity (1-5): 5
  Originality / Innovativeness (1-5): 3
       Soundness / Correctness (1-5): 4
         Meaningful Comparison (1-5): 4
                  Thoroughness (1-5): 4
    Impact of Ideas or Results (1-5): 3
                Recommendation (1-5): 4
           Reviewer Confidence (1-5): 5

Detailed Comments

This work investigates different methods of combining character-level and word-level representations (concat, scalar gate, and vector gate) and evaluates their abilities of representing words and sentences. Specifically, the authors train models on the SNLI and MNLI datasets and then use the trained embedding modules to produce word vectors. Those word vectors are used in the word-level and the sentence-level evaluation. The authors empirically show that the vector gate consistently outperforms other baselines in the word-level evaluation (word similarity/relatedness tasks), but there is no clear effect in the sentence-level evaluation (SentEval (Conneau et al. 2017)).

The vector gate proposed by the authors is solely based on the word-level input while Yang et al., (2017) uses additional features (e.g. POS, NER) to compute the gating values. The presentation is very clear, and this paper is easy to follow. The authors extensively evaluate the gating mechanisms on various word similarity and NLI/STS datasets (SentEval) while previous work has studied only on language modeling (Miyamoto and Cho, 2016) and reading comprehension (Yang et al., 2017). The baselines are good (word only, char only, concat, scalar gate), and the results are reported with statistical significance. Visualization of gating values (Figure 2&3) articulates that the rare words tend to use more character information.

A question that arises here is: “is BiLSTM with max pooling the optimal way to get the sentence representations?” I assume that the authors follow Conneau et al. (2017), but they use the word-level input only. It would be nice if the authors could explain more about the justification of their choice. Another thing people might be interested in is when this approach is useful. Given the recent development of LM-based pretraing methods (e.g. ELMo, GPT, BERT etc.), it would be nice if authors could think about the use cases of their approach.

REVIEWER 3

Reviewer's Scores

               Appropriateness (1-5): 5
                       Clarity (1-5): 4
  Originality / Innovativeness (1-5): 2
       Soundness / Correctness (1-5): 4
         Meaningful Comparison (1-5): 3
                  Thoroughness (1-5): 3
    Impact of Ideas or Results (1-5): 2
                Recommendation (1-5): 2
           Reviewer Confidence (1-5): 4

Detailed Comments

This paper compares representations learned from different architectures on SNLI/MNLI. They train a BiLSTM on top of lexical representations obtained from word embeddings, a BiLSTM over character embeddings (I think, it should be made more clear in the paper), concatenating these 2 representations, and using a gate (either scalar or vector) and adding the weighted representations.

They then evaluate their lexical representations on word similarity tasks and their entire model in a transfer setting on sentence-level ones. The vector gating has the best performance on the lexical tasks but on the sentence level ones there is no clear trend.

Essentially the main contribution of the paper is that they propose a vector gate (which is common in neural networks - i.e. an LSTM uses multiple gates of this form) as opposed to a scalar gate from prior work for combining word and character representations and show the representations are better on word similarity tasks. This isn't very surprising for me since this model is also the most expressive of the ones considered.

I think the paper would be improved if other models would be considered. For instance, instead of concatenating embeddings (which incidentally makes for the BiLSTM larger for this model...I dont know if this was controlled for) you could just average them. Also what about character n-grams? "Charagram: Embedding Words and Sentences via Character n-grams" shows they are very effective on some of the datasets used in this paper.

Also, overall their results on the word/sentence tasks are lackluster. A big strength of the original Infersent paper (see "No Training Required: Exploring Random Encoders for Sentence Classification" which show that the GloVe embeddings along explain much of Infersent's performance) is the GloVe embeddings. I think to get their numbers up to more interesting levels they could work more with pre-trained embeddings (which is the typical use case scenario anyway) and a learned character representation, and a new word representation that could be used to augment the fixed GloVe embeddings.

I think the paper needs some work and added novelty. I think the setup is interesting, but more could be explored here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment