JasonKessler/iclrreviews.csv

## iclrreviews.csv

          
            title
            authors
            decision_raw
            forum
            confidence
            rating
            review

            
              Improving Discriminator-Generator Balance in Generative Adversarial Networks
              ['Simen Selseng and Björn Gambäck']
              Reject
              SyBPtQfAZ
              4: The reviewer is confident but not absolutely certain that the evaluation is correct
              3: Clear rejection
              The paper proposes a variety of modifications to improve GAN training and evaluates them using a variant of the Generative Adversarial Metric.
The first proposed approach, Static Reusable Noise, proposes sampling a fixed set of latent noise vectors instead of producing them via online sampling. It is motivated by the observation that the generator encounters different noise samples at each iteration of training while for real data the discriminator sees only a fixed number of samples. This does not seem to be a particularly convincing argument. One could argue likewise that this makes the discriminator's job easier as it only has to track the finite amount of samples the generator can produce instead of the full distribution. Experimentally it does not appear to consistently help.
The second approach, Image based noise generation, replaces noise sampled from a simple distribution such as gaussian or uniform noise with a downsampled and grayscaled version of training images. This does not seem particularly well motivated. Usually the latent space of a generative model is assumed to represent high level disentangled/compositional factors of the data - not a low-level course grained representation of the image. In the case of conditional GANs, this is explicitly the case. The paper's experimental section suggests this failed to improve upon a baseline in all cases. This approach has an additional drawback of no longer having a tractable process to draw new samples from the model as it is dependent on samples from the data distribution.
The third variant, audition based noise selection (ABNS), proposes selecting a subset of generator samples in a given batch to train on. Two variants are considered, selecting only the best performing (according to the generator) or a mix of the best, worst, and random. This approach most directly addresses the proposed problem of generator discriminator imbalance and seems to be the most well-motivated. With the experimental methodology of the paper it also appears to be the one that potentially helps the most.
The proposed changes are motivated by the intuition that they better balance the difficulty of the tasks of the generator and discriminator. To potentially validate this, the authors could check whether the training loss terms of the generator and discriminator are better balanced on average. Could the authors comment on whether they observed this to be the case?
For comparisons, the number of epochs of training appears to be chosen arbitrarily per experiment. Throughout the paper comparisons are reported at epoch 8, 14, 16, 18, 20, 22, 25, 40, and 50. The graphs visualizing the reported metric over training show large oscillations between epochs. This makes it difficult to draw strong conclusions from the results. The authors could improve the strength of their results by comparing/evaluating with a more widely used metric, such as the Inception Score from (Salimans et al. 2016) which has seen wide adoption yet is missing from the evaluation metrics discussed in section 4.3.
Overall, the paper presents a variety of proposed changes to GAN training. Several of the proposed approaches seem ad-hoc and not particularly well motivated. One of the proposed approaches, ABNS, is potentially promising but the experimental methodology appears to have significant issues which makes evaluating and drawing strong conclusions from the results difficult. The reader is left with an unclear picture of the value of the proposed approaches.

            
              Improving Discriminator-Generator Balance in Generative Adversarial Networks
              ['Simen Selseng and Björn Gambäck']
              Reject
              SyBPtQfAZ
              4: The reviewer is confident but not absolutely certain that the evaluation is correct
              3: Clear rejection
              [Overview]
This paper proposed four different ways to balance the discriminator and generator during the training of GAN. Specifically, the authors proposed to use static reusable noise, image-based noise generation, audition based noise selection and generative multi-adversarial network with historic discriminator. Based on GAM (Im et al, 2016), the authors compared the modified GAN models with DCGAN to show the performance of different balancing methods, on three different datasets, CelebA, CIFAR-10 and IKEA.
[Strengths] 
This paper tried multiple ways to balance the generator and discriminator. Through the experiments on three different datasets, the authors showed ABNS and GMAN achieved better performance than DCGAN based on the GAM evaluation method. The paper also showed the accuracy curves in different epochs of training.
[Weaknesses]
1. This paper is poorly written. it is hard to follow the storyline in this paper. Also, the content in the paper, especially the experiment part is very redundant, like the figures showing the accuracy with epoch. It would be better to show some qualitative results instead.
2. Many papers proposed the evaluation metric, which the authors mentioned as well. However, the authors did not give any persuasive reason why they choose GAM as the evaluation metric in this paper. Recently, the metrics like Inception Score is more commonly used in many related works. The authors should report Inception Score as well in the experiments.
3. The intuitions behind the proposed different ways of balancing discriminator and generator are not clear. The authors should explain the motivation behind the proposed methods.
4. The paper presents no mathematical explanation for the proposed methods, which make it extremely hard to get precise senses of the proposed methods.
5. The organization of the paper is weird. The proposed methods reside in the experimental setup section. The discussion section should be used for summarizing the phenomenon emerged in the experiments, and hence should be short and compact.
[Summary]
Based on the above comments, I think this paper is in poor condition in both paper writing, methods and experiments. I suggest the authors first re-organize the paper and make more justification for the proposed methods on both theoretically and empirically before submitting.

            
              Improving Discriminator-Generator Balance in Generative Adversarial Networks
              ['Simen Selseng and Björn Gambäck']
              Reject
              SyBPtQfAZ
              5: The reviewer is absolutely certain that the evaluation is correct and very familiar with the relevant literature
              3: Clear rejection
              Impossible to compare to prior work:
   * No samples for CelebA and cifar are provided
   * No accepted metrics in the literature are reported.
  * Comparing DCGAN against IBNG is not a fair comparison: IBNG has information of the data distribution, DCGAN does not
Metrics:
  * there is a lot of criticism of already used metrics (which are valid), but then the authors do not report any of the standard metrics, and do not display samples.
  * the proposed metric does not assess: sample quality, overfitting. it compares two models, but if both models are not able to capture the data distribution the metric is meaningless. From the paper: ". In other words, the two Discriminators must perform about equal on the test dataset for GAM to elect the better GAN". This is a very big assumption.
Missing connection to theory on discriminator versus generator power:
   * no mention of Wasserstein GAN, where the discriminator can be made more powerful than the generator
   * no mention of gradient penalties (DRAGAN), a powerful regularizer which could help with this issue
Problems with suggested approaches:
  * Static reusable noise: No mention of overfitting, or whether the ability to generalize of the generator is impacted by this. The birthday paradox test (https://arxiv.org/pdf/1706.08224.pdf) could have been employed to check the effect on the support of the distribution, together with a metric which assess sample quality.
  * IBNG: makes use of data information, so it is not comparable to unconditional GAN methods. 
  * Audition based noise selection: interesting idea, but a heuristic which introduces two more hyperparameters: the selection size and the selection strategy, without any theoretical justification.
Overall comments:
  * there is no strong contribution in the paper. 
  * the manuscript needs polishing, there is a lot of repetition and typos in the current text
  * there is a lot of detail of existing methods, not sufficient details regarding the methods employed. 
  * paper not reproducible, learning rates and other training details missing. 
  * will the IKEA dataset be made available?

            
              Revisiting Knowledge Base Embedding as Tensor Decomposition
              ['Jiezhong Qiu', 'Hao Ma', 'Yuxiao Dong', 'Kuansan Wang', 'Jie Tang']
              Reject
              S1sRrN-CW
              4: The reviewer is confident but not absolutely certain that the evaluation is correct
              3: Clear rejection
              The paper proposes a new method to train knowledge base embeddings using a least-squares loss. For this purpose, the paper introduces a reweighting scheme of the entries in the original adjacency tensor. The reweighting is derived from an analysis of the cross-entropy loss. In addition, the paper discusses the connections of the margin and cross-entropy loss and evaluates the proposed method on WN18 and FB15k.
 The paper tackles an interesting problem, as learning from knowledge bases via embedding methods has become increasingly important for tasks such as question answering. Providing additional insight into current methods can be an important contribution to advance the state-of-the-art.
However, I'm concerned about several aspects in the current form of the paper. For instance, the derivation in Section 4 is unclear to me, as eq.4 suddenly introduces a weighted sum over expectations using the degrees of nodes. The derivation also seems to rely on a very specific negative sampling assumption (uniform sampling without checking whether the corrupted triple is a true negative). This sampling method isn't used consistently across models and also brings its own problems, e.g., see the LCWA discussion in [4]
In addition, the semantics that are introduced by the weighting scheme are not clear to me either. Using the proposed method, the probability of edges between high-degree nodes are down-weighted, since the ground-truth labels are divided by the node degrees. Since these weighted labels are then fitted using a least-squares loss, this implies that links between high-degree nodes should be less likely, which seems the opposite of what the scores should look like.
With regard to the significance of the contributions: Using a least-squares loss in combination with tensor methods is attractive because it enables ALS algorithms with closed-form updates that can be computed very fast. However, the proposed method still relies on SGD optimization. In this context, it is not clear to me why a tensor framework/least-squares loss would be preferable.
Further comments:
- The paper seems to equate "tensor method" with using a least squares loss. However, this doesn't have to be the case. For instance see [1,2] which propose Logistic and Poisson tensor factorizations, respectively.
- The distinction between tensor factorization and neural methods is unclear. Tensor factorization can be interpreted just as a particular scoring function. For instance, see [5] for a detailed discussion.
- The margin based ranking loss has been proposed earlier than in (Collobert et al, 2011). For instance see [3]
- p1: corrupted triples are not described entirely correct, typically only one of s or o is corrputed. 
- Closed-form tensor in Table 1: This should be least-squares loss of f(s,p,o) and log(...)?
- p6: Adding the constant to the tensor as proposed in (Levy & Goldberg, 2014) can done while gathering the minibatch and is therefore equivalent to the proposed approach.
[1] Nickel et al: Logistic Tensor Factorization for Multi-Relational Data, 2013.
[2] Chi et al: "On tensors, sparsity, and nonnegative factorizations", 2012
[3] Collobert et al: A unified architecture for natural language processing, 2008
[4] Dong et al: Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion, 2014
[5] Nickel et al: A Review of Relational Machine Learning for Knowledge Graphs, 2016.

            
              Revisiting Knowledge Base Embedding as Tensor Decomposition
              ['Jiezhong Qiu', 'Hao Ma', 'Yuxiao Dong', 'Kuansan Wang', 'Jie Tang']
              Reject
              S1sRrN-CW
              4: The reviewer is confident but not absolutely certain that the evaluation is correct
              5: Marginally below acceptance threshold
              This paper deals with the problem of representation learning from knowledge bases (KB), given in form of subject-relationship-object triplets. The paper has two main contributions: (1) Showing that two commonly used loss functions, margin-based and negative sampling-based, are closely related to each other; and (2) many of the KB embedding approaches can be reduced to a tensor decomposition problem where the entries in the tensor are a certain transformation of the original triplets values. 
Contribution (1) related to the connection between margin-based and negative sampling-based loss functions is sort of obvious in hindsight and I am not sure if it has been not recognized in prior work (I'm not very well-versed in this area). Regardless, even though this connection  is moderately interesting, I am not sure of its practical usefulness. I would like the authors to comment on this aspect.
Contribution (2) that shows that KB embedding approaches based on some of the popularly used loss functions such as margin-based or negative sampling can be cast as tensor factorization of a certain transformation of the original data is also interesting. However, similar connections have been studied for word-embedding methods. For example, prior work has shown that word embedding methods that optimize loss functions such as negative sampling can be seen as doing implicit matrix factorization of a transformed version of the word-counts. Therefore contribution (2) seems similar in spirit to this line of work.
Overall, the paper does have some interesting insights but it is unclear if these insights are non-trivial/surprising, and are of that much practical utility. I would like to authors to respond to these concerns.

            
              Revisiting Knowledge Base Embedding as Tensor Decomposition
              ['Jiezhong Qiu', 'Hao Ma', 'Yuxiao Dong', 'Kuansan Wang', 'Jie Tang']
              Reject
              S1sRrN-CW
              4: The reviewer is confident but not absolutely certain that the evaluation is correct
              3: Clear rejection
              The paper proposes a unified view of multiple methods for learning knowledge base embeddings.
The paper's motivations are interesting but the execution does fit standard for a publication at ICLR.
Main reasons:
* Section 3 does not bring much value. It is a rewriting trick that many knew but never thought of publishing
* Section 4.1 is either incorrect or clearly misleading. What happens to the summation terms related to the negative samples (o~=o' and s!=s') between the last equation and the 2 before that (on the expectations) at the bottom of page 4? They vanished while they are depending on the single triple (s, r, o), no?
* The independence assumption at the top of page 5 is indeed clearly too strong in the case of multi-relational graphs, where triples are all interconnected.
* In 4.2, writing that both RESCAL and KBTD explain a RDF triple through a similar latent form is not an observation that could explain intrinsic similarities between the methods but the direct consequence of the deliberate choice made for f(.) at the line before.
* The experiments are hard to use to validate the model because they are based on really outdated baselines. Most methods in Table 4 and 5 are performing well under their best known performance.

            
              Tree2Tree Learning with Memory Unit
              ['Ning Miao', 'Hengliang Wang', 'Ran Le', 'Chongyang Tao', 'Mingyue Shang', 'Rui Yan', 'Dongyan Zhao']
              Reject
              Syt0r4bRZ
              4: The reviewer is confident but not absolutely certain that the evaluation is correct
              4: Ok but not good enough - rejection
              This paper presents a model to encode and decode trees in distributed representations. 
This is not the first attempt of doing these encoders and decoders. However, there is not a comparative evalution with these methods.
In fact, it has been demonstrated that it is possible to encode and decode trees in distributed structures without learning parameters, see "Decoding Distributed Tree Structures" and "Distributed tree kernels".
The paper should present a comparison with such kinds of models.

            
              Tree2Tree Learning with Memory Unit
              ['Ning Miao', 'Hengliang Wang', 'Ran Le', 'Chongyang Tao', 'Mingyue Shang', 'Rui Yan', 'Dongyan Zhao']
              Reject
              Syt0r4bRZ
              4: The reviewer is confident but not absolutely certain that the evaluation is correct
              5: Marginally below acceptance threshold
              This paper proposes a tree-to-tree model aiming to encode an input tree into embedding and then decode that back to a tree. The contributions of the work are very limited.  Basic attention models, which have been shown to help model structures, are not included (or compared). Method-wise, the encoder is not novel and decoder is rather straightforward. The contributions of the work are in general very limited. Moreover, this manuscript contains many grammatical errors.  In general, it is not ready for publication. 
Pros:
- Investigating the ability of distributed representation in encoding input structured is in general interesting. Although there have been much previous work, this paper is along this line.
Cons:
- The contributions of the work are very limited. For example, attention, which have been widely used and been shown to help capture structures in many tasks, are not included and compared in this paper.
- Evaluation is not very convincing. The baseline performance in MT is too low. It is unclear if the proposed model is still helpful when other components are considered (e.g., attention). 
- For the objective function defined in the paper, it may be hard to balance the "structure loss" and "content loss" in different problems, and moreover, the loss function may not be even useful in real tasks (e.g, in MT), which often have their own objectives (as discussed in this paper). Earlier work on tree kernels (in terms of defining tree distances) may be related to this work. 
- The manuscript is full of grammatical errors, and the following are some of them:
"encoder only only need to"
"For for tree reconstruction task"
"The Socher et al. (2011b) propose a basic form"
"experiments and theroy analysis are done"

            
              Tree2Tree Learning with Memory Unit
              ['Ning Miao', 'Hengliang Wang', 'Ran Le', 'Chongyang Tao', 'Mingyue Shang', 'Rui Yan', 'Dongyan Zhao']
              Reject
              Syt0r4bRZ
              4: The reviewer is confident but not absolutely certain that the evaluation is correct
              2: Strong rejection
              Summary: the paper proposes a tree2tree architecture for NLP tasks. Both the encoder and decoder of this architecture make use of memory cells: the encoder looks like a tree-lstm to encode a tree bottom-up, the decoder generates a tree top-down by predicting the number of children first. The objective function is a linear mixture of the cost of generating the tree structure and the target sentence. The proposed architecture outperforms recursive autoencoder on a self-to-self predicting trees, and outperforms an lstm seq2seq on En-Cn translation.
Comment:
- The idea of tree2tree has been around recently but it is difficult to make it work. I thus appreciate the authors’ effort. However, I wish the authors would have done it more properly.
- The computation of the encoder and decoder is not novel. I was wondering how the encoder differs from tree-lstm. The decoder predicts the number of children first, but the authors don’t explain why they do that, nor compare this to existing tree generators. 
- I don’t understand the objective function (eq 4 and 5). Both Ls are not cross-entropy because label and childnum are not probabilities. I also don’t see why using Adam is more convenient than using SGD.
- I think eq 9 is incorrect, because the decoder is not Markovian. To see this we can look at recurrent neural networks for language modeling: generating the current word is conditioning on the whole history (not only the previous word).
- I expect the authors would explain more about how difficult the tasks are (eg. some statistics about the datasets), how to choose values for lambda, what the contribution of the new objective is.
About writing:
- the paper has so many problems with wording, e.g. articles, plurality.
- many terms are incorrect, e.g. “dependent parsing tree” (should be “dependency tree”), “consistency parsing” (should be “constituency parsing”)
- In 3.1, Socher et al. do not use lstm
- I suggest the authors to do some more literature review on tree generation

            
              Combining Model-based and Model-free RL via Multi-step Control Variates
              ['Tong Che', 'Yuchen Lu', 'George Tucker', 'Surya Bhupatiraju', 'Shane Gu', 'Sergey Levine', 'Yoshua Bengio']
              Reject
              HkPCrEZ0Z
              3: The reviewer is fairly confident that the evaluation is correct
              4: Ok but not good enough - rejection
              This paper presents a model-based approach to variance reduction in policy gradient methods.  The basic idea is to use a multi-step dynamics model as a "baseline" (more properly a control variate, as the terminology in the paper uses, but I think baselines are more familiar to the RL community) to reduce the variance of a policy gradient estimator, while remaining unbiased.  The authors also discuss how to best learn the type of multi-step dynamics that are well-suited to this problem (essentially, using off-policy data via importance weighting), and they demonstrate the effectiveness of the approach on four continuous control tasks.
This paper presents a nice idea, and I'm sure that with some polish it will become a very nice conference submission. But right now (at least as of the version I'm reviewing), the paper reads as being half-finished.  Several terms are introduced without being properly defined, and one of the key formalisms presented in the paper (the idea of "embedding" an "imaginary trajectory" remains completely opaque to me.  Further, the paper seems to simply leave out some portions: the introduction claims that one of the contributions is "we show that techniques such as latent space trajectory embedding and dynamic unfolding can significantly boost the performance of the model based control variates," but I see literally no section that hints at anything like this (no mention of "dynamic unfolding" or "latent space trajectory embedding" ever occurs later in the paper).
In a bit more detail, the key idea of the paper, at least to the extent that I understood it, was that the authors are able to introduce a model-based variance-reduction baseline into the policy gradient term.  But because (unlike traditional baselines) introducing it alone would affect the actual estimate, they actually just add and subtract this term, and separate out the two terms in the policy gradient: the new policy gradient like term will be much smaller, and the other term can be computed with less variance using model-based methods and the reparameterization trick.  But beyond this, and despite fairly reasonable familiarity with the subject, I simply don't understand other elements that the paper is talking about.
The paper frequently refers to "embedding" "imaginary trajectories" into the dynamics model, and I still have no idea what this is actually referring to (the definition at the start of section 4 is completely opaque to me).  I also don't really understand why something like this would be needed given the understanding above, but it's likely I'm just missing something here.  But I also feel that in this case, it borders on being an issue with the paper itself, as I think this idea needs to be described much more clearly if it is central to the underlying paper.
Finally, although I do think the extent of the algorithm that I could follow is interesting, the second issue with the paper is that the results are fairly weak as they stand currently.  The improvement over TRPO is quite minor in most of the evaluated domains (other than possibly in the swimmer task), even with substantial added complexity to the approach.  And the experiments are described with very little detail or discussion about the experimental setup.
Nor are either of these issues simply due to space constraints: the paper is 2 pages under the soft ICLR limit, with no appendix.  Not that there is anything wrong with short papers, but in this case both the clarity of presentation and details are lacking.  My honest impression is simply that this is still work in progress and that the write up was done rather hastily.  I think it will eventually become a good paper, but it is not ready yet.