dmrd/iclr_papers.org

## iclr_papers.org

      
    Raw
  

              iclr_papers.org
            
          
    Repeated topics

Attention

Memory

Reinforcement learning

Efficient NNs

Bayesian methods

Variational inference

neural turing machines

transfer learning

adversarial networks and examples

ICLR

Observation: many papers cite other papers at same conference
Multi-Scale Context Aggregation by Dilated Convolutions


  Semantic segmentation
  new cnn module
    
      specifically desigend for dense prediction
      dilated expansions systematically aggregate multiscale information
    
  
  Maintains resolution throughout model
  Meant to be added to existing semseg models
  convolution with dilation -> each successive layer convolves a larger area

The Variational Fair Autoencoder


  Variational autoencoder + priors to encourage “independence between sensitive and latent factors of variation”
  Goal: “learn representations that are explicitly invariant with respect to some known aspect of a dataset while retaining as much remaining information as possible”

A note on the evaluation of generative models


  There are many ways to evaluate generative models
  The 3 big ones are largely independent, so maximizing one != maximize others
    
      average log-likelihood
      parzen window estimates
        
          Probably don’t want to use these in general
        
      
      visual fidelity
    
  
Learning to Diagnose with LSTM Recurrent Neural Networks


  Empirically evaluate lstms to recognize patterns in clinical measurement timeseries
  establish effectiveness for clinical data
  Trained with untimestamped diagnoses, but can add timestamps later to do early diagnoses
  simple training strategy: replicate targets at each sequence step (auto
  Does a better job than MLP on hand engineered features at diagnosing
  Still pretty bad

[#B] Prioritized Experience Replay


  Replay lets RL agents remember and reuse experiences
  Prior work samples experiences uniformly from the past
    
      Does not weight “significant” experiences more heavily.  just based on frequency
    
  
  They develop framework to replay significant experiences more frequently
  “replay” is of state transitions
    
      Some transitions are more surprising than other
    
  
  Proposal: replay transitions with high expected learning progress
    
      measured by temporal-difference (TD) error magnitude
      Avoid losing diversity by doing stochastic prioritization + importance sampling
    
  
  Resul: speeds up learning 2x and achieves new record in atari benchmark

Importance Weighted Autoencoders


  VAE: pairs top-down generative network with bottom-up recognition network which approximates posterior inference
    
      strong assumptions about posterior inference
        
          posterior approximately factorial, parameters can be predicted from observables by nonlinear regression
        
      
      trained to maximize variational lower bound on log-likelihood
    
  
  contribution: train same architecture with a better log-likelihood lower bound derived from importance sampling
  learns a richer representation

[#A] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding


  3 stages of compression
    
      Prune network to just important connections
        
          repeatedly train and prune connections
        
      
      quantize weights to enforce weight sharing
        
          cluster weights, generate codebook, [quantize to codebook -> retrain codebook and repeat quantization]
        
      
      apply huffman coding
        
          encode weights and indices
        
      
  Results: compress alexnet 35x, VGG-16 49x
    
      speedup: 3-4x layerwise speedup
      No loss in accuracy (!)
    
  
Variationally Auto-Encoded Deep Gaussian Processes


  Augment DGPs with recognition model

Training Convolutional Neural Networks with Low-rank Filters for Efficient Image Classification


  Learn a set of small basis filters, which it then learns to combine into more complex filters
  Comparable accuracy and less compute
    
      26% less compute and 41% fewer parameters than GoogLeNet
    
  
  Not that impressive compared to `deep compression`
  interesting part is learning set of basis filters.  Linear combination of bases = more complex filter

Neural Networks with Few Multiplications


  approach
    
      forward pass: stochastically binarize weights. (“binary connect”)
        
          benefit: Converts multiplications to computing sign changes
        
      
      backwards pass: quantized backprop
        
          quantize representations at each layer to convert remaining multiplications to binary shifts
        
      
  performance: Tends to IMPROVE rather than damage performance
    
      potential reasons
        
          implicit regularization
          low precision = harder to overfit and improved generalization
        
      
  Promising for hardware implementation

Reducing Overfitting in Deep Networks by Decorrelating Representations


  Proposes new regularizer: DeCov
  Minimize the cross-covariance of hidden activations
  results:
    
      always reduced overfitting
      maintained or increased generalization performance
        
          often improved over dropout
          combining with dropout often worked best
        
      
  effects all layers up to the one it is applied to

Pushing the Boundaries of Boundary Detection using Deep Learning


  creates new loss function for boundary detection
  better than human performance on Berkeley segmentation dataset (0.808 vs. 0.803) measured by F-measure
  Slow: 320x420 image processed in 1 second
  steps
    
      processes at 3 scales by CNN
      fuse scales and send to normlized cuts algorithm
      combine “spectral boundaries” (eigenvalues) with original boundary map, nonmax supression, and optional thresholding
    
  
  interesting
    
      Graduated learning scheme - originally trained on slightly easier objective function to initialize the network
    
  
Generating Images from Captions with Attention


  generative process of images conditioned on captions
    
      captions = sequence of consecutive words
      images = sequence of patches drawn on a canvas
      sequence-to-sequence framework
    
  
  Language model: bidirectional attention RNN
  Image model: conditional DRAW network
    
      extend DRAW network to have a caption representation at each step
    
  
  trained to maximize variational lower bound on marginal likelihood of the correct image given the input caption
  postprocessing
    
      wow what a sentence:
    
    “adversarial network trained on residuals of a Laplacian pyramid conditioned
      on the skipthought representation (Kiros et al., 2015) of the captions to
      sharpen the generated images,”
  

Reasoning about Entailment with Neural Attention


  Determines if 2 sentences
    
      contradict
      do not relate
      the first sentence (premise) entails the second sentence (hypothesis)
    
  
  Simple prior work:
    
      map each sentence into semantic space with LSTMs,
      concatenate and feed to MLP
    
  
  Takes both sentences at once
    
      Premise
Hypothesis
    
    
      Different LSTMs for remis and hypothesis, but hypothesis LSTM initialized to hidden state of premise LSTM
    
  
  word-by-word attention mechanism
    
      Adding attention boosts 80.9->83.5% on benchmark
    
  
  word2vec as word input representation

Convolutional Neural Networks With Low-rank Regularization


  Further develop tensor decomposition technique for speeding up CNNs
  new algorithm for computing low-rank tensor decomposition to remove redundancy in conv kernels
  reduces forward time of VGG-16 by 50% with comparable accuracy
  low rank constrained CNNs sometimes do better than unconstrained. regularization?
  Simple idea: replace 4d conv kernel with 2 consecutive kernels with lower rank (idea from 2014 paper)

[#B] Unifying distillation and privileged information

Super theoretical

  Distillation and priviledge information are techniques to enable machines learn from other machines
    
      Paper introduces combination: generalized distillation
    
  
  Priviledged information
    
      Each training example is (x_1, x_1^*, y_n), or (feature, additional information from “teacher” only available during training, label)
    
  
  Distillation
    
      simple machine learns complex task by imitatin solution of a flexible machine
    
  
Particular object retrieval with integral max-pooling of CNN activations


  Content retrieval
  contributions
    
      propose compact image region representation derived from cnn layer activations
        
          encode multiple image regions at once
        
      
      ? use “generalized mean” to enable using integral images along with max-pooling.  lets them do particular object localization directly in the 2d cnn activation maps
      localization also used for image re-ranking and helps define a simple query expansion method
    
  
  interesting idea: Maximum Activations of Convolutions
    
      the representation does not encode location of activations and does a max pool over single region of size WxH.  It only encodes max response of each conv filter and is translation invariant
    
  
All you need is a good init

Simple approach

  Propose layer-sequential unit-variance initialization
    
      pre-init weights of each conv and inner product layer with orthonormal matrices
      proceed from 1st to last layer and normalize variance of output of each layer to be equal to 1
    
  
  performs at least as well as standard methods
  at least as fast as complex schemes for very deep nets like highway networks
  state-of-art or close to it results

Bayesian Representation Learning with Oracle Constraints

Variational inference!

  propose to combine “generative unsupervised feature learning” with a “probabilistic treatment of oractle information like triplets”
  Joint unsupervised generative model over observations and triplet-constraints provided by an oracle
    
      variational belief network, but could be used on GPs and other probabilistic models
    
  
  learns features from images without labels, but feature learning is additionally guided by an oracle
    
      provides a joint model of these components
      oracle helps guide features to be semantically meaningful
        
          allows semantic masking such as “take this face image, generate image conditioned on light from the right side”
        
      
  try to “transfer implicit oracle knowledge into explicit parametric model”

[#A] Neural Programmer: Inducing Latent Programs with Gradient Descent

Towards Universal Paraphrastic Sentence Embeddings


  Not much work on learning sentence representations that work as well across domains as word embeddings do
    
      explore compositional models that can encode word sequences into vectors s.t. similar meaning = high cosine similarity
    
  
  compare many architectures on paraphrasic datasets, both with similar test/train characteristics and applying to other domains
  Complex models such as LSTMs work best on in-domain data
  For out of domain data, simple architectures like word averaging perform best
  Models
    
      word averaging
        
          only parameter it learns is the word embedding matrix
          2nd model adds in a bias vector
          Can have a many layer model to get “deep word averaging”
        
      
      several RNNs, including LSTM rnns
    
  
  simplest model (word averaging) worked best in most cases, especially outside of training data domain (no neural network required)
    
      does better on entailment and similarity tasks
      LSTMs better on sentiment classification
    
  
Regularizing RNNs by Stabilizing Activations

What is an IRNN?

  Applied specifically to IRNNs (integrated RNN - no idea what these are)
    
      typically suffer from exploding gradients
      Also helps LSTms, but not tanh-RNNs
    
  
  penalize squared distance between successive hiden state norms
  prevents exponential growth of activations and helps generalizing to longer sequences

SparkNet: Training Deep Networks in Spark


  Caffe on Spark & open source
  parallelization scheme
    
      split data among workers
      in each iteration
        
          broadcast params to workers
          workers run SGD for fixed number of steps or length of time
          send params back to master and average
        
      
      init network by running SGD for a few iterations on master
      with 4 GPUs per node vs. single GPU
        
          1 node = 3.5x
          3 node = 9.4x
          6 node = 11.2x
        
      
      tolerates low-bandwidth intra-cluster communication
    
  
Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks


  Learning discriminative classifier from unlabeled/partially labeled data
  objective function: trades off
    
      mutual infromation between examples and predicted categorical class distribution
      and robustness of classifier to adversarial examples
    
  
  One view is that it extends GANs
  Generator that is learned alongside classifier generates reasonable examples

The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations


  Introduce test for how well models capture meaning in children’s books
  distinguishes
    
      syntactic function words from
      lower frequency words with more semantic content
    
  
  Benchmark existing models
    
      models with explicit long term context/memory perform best
      there is a sweet spot in amount of text each memory location represents
    
  
  The test
    
      20 sentences of context S.  Remove word from 21st sentence to get query Q
      models must identify the missing word `a` among 10 candidates C
    
  
  4 categories:
    
      named entities
      common nouns/verbs/prepositions
    
  
  Memory NNs and LSTMs work best
    
      Memory best at named entities and nouns, LSTMs best at verbs and prepositions (better than humans at prepositions)
    
  
[#C] MuProp: Unbiased Backpropagation For Stochastic Neural Networks


  Introduce unbiased gradient estimator for stochastic networks
    
      Stochastic network: computation graph with continuous/discrete sampling operations
      First estimator that can handle both discrete and continuous
    
  
  2 parts:
    
      deterministic term g_MF: computed by backprop through “mean-field network”
      LR term g_R: accounts for residuals from deterministic part to get unbiased estimates
    
  
Data Representation and Compression Using Linear-Programming Approximations


  framework for unsupervised feature selection from sequential data like text
  learns dict of n-grams that efficiently compress a corpus then recursively compress own dictionary (hence the deep part)
  finds useful data representation and compression using IP (/relaxations to LP)
  Technically deep, but it posed as an LP.

Diversity Networks


  Learning networks with diverse neurons
  reduces redundancy (and thus size) of NNs
    
      samples a diverse set of neurons
      merge remaining neurons with the selected neurons
    
  
  Describe for feed forward nets, but is more general
  Introduce
    
      Determinantal Point Processes (DPPs) to model neuron diversity
      fusing step that minimizes negative effects of removing neurons
        
          also helps existing pruning approaches
        
      
Deep Reinforcement Learning in Parameterized Action Space


  Extends paper `Continuous control with deep rl` into structured (parametrized) action space
    
      extend DDPG algorithm
      bound the actino space gradients suggested by critic
        
          likely useful for any continuous, bounded action space
        
      
  Apply to RoboCup soccer
    
      a few discrete action types, each parametrized by continuous variables
    
  
  Actor/critic model
    
      actor takes state and outputs continuous action
      critic takes input s and action a and outputs scalar Q-value
    
  
Learning Visual Predictive  Models of Physics for Playing Billiards


  process raw visual input
  trains itself through random interactions with collection of environments
  resulting model used to plan goal-directed action sin new environments
  test by plaing simulated billiards game
  “object-centric prediction”
    
      individually model future states of each of the L objects (e.g. balls) in the world
    
  
  Architecture
    
      provided
        
          previous 4 “glimpses”  centered at object position
          agent’s applied forces and hidden states of LSTMs from previous timestep
        
      
      output
        
          ball displacement for next h frames
        
      
  minimize difference between truth and predicted positions
  Results in “visual imagination”

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

bAbI dataset

  Goal: provide a set of tasks, as as a “leaf” test case independent of the others
    
      as simple as possible to test an intended behavior
    
  
  tasks
    
      single supporting fast: fact + irrelevant details -> answer question (“Mary travelled to office.  Mary ate an apple.  Where is mary?”)
      2-3 supporting facts: require chaining between facts to answer question
      2-3 argument relations: questions that require the ability to recognize subjects and objects, so must model word order
      yes/no questions: single supporting fact with yes/no questions
      counting and lists/sets: simple counting operations, keeping track of lists (“what is daniel holding?”
      simple negation/indefinite knowledge
      basic coreference/conjunctions/compound coreference
      time reasoning
      basic deduction and induction
      positional and size reasoning
      path finding: given description of spaces, find path from one to another
      agent motivation: why did a motivation perform an action?
    
  
  Test out MemNNs.  Surprise, they don’t solve everything
    
      Work better than N-gram and LSTM baselines, but fail at a number of tasks
      Structured SVM does better than vanilla MemNNs (but it has hand built featuers)
    
  
Evaluating Prerequisite Qualities for Learning End-to-end Dialog Systems


  Similar to bAbI but larger scale dataset
  dataset:
    
      factual questions about movies
      provide personalization
      carry short conversations about recommendations/movies
      perform natural dialog on reddit
    
  
  Benchmark several common models (SVD, LSTMs,

Better Computer Go Player with Neural Network and Long-term Prediction


  Awww… A go AI paper =(
    
      Poor FB research
    
  
  Darkforest bot
  next k-move prediction + monte carlo rollouts
  features
    
      our/opponent
        
          liberties
          stones/empty
          history
          rank
        
      
      ko position
      border
      distance from center of board
      whether position closer to us/opponent
    
  
  Achieve 5d rank on KGS

Distributional Smoothing with Virtual Adversarial Training


  Propose regularizatoin term: local distributional smoothness (LDS)
    
      defined as KL-divergence based robustness of the model distribution against local perturbations around the datapoint
    
  
  Virtual Adversarial Training
    
      applicable to semisupervised learning
      doesn’t use label information for getting adversarial direction
    
  
  few hyperparameters
  Simple model, but only outperformed by Ladder networks (which are fairly complicated)

[#A] Multi-task Sequence to Sequence Learning


  Examines
    
      one-to-many
      many-to-one
      many-to-many
    
  
  Multitask learning can help the performance of attention-free seq-to-seq models
    
      Learning image captioning and syntactic parsing help performance of english -> german by +1.5 BLEU over string single-task baseline
      image captioning & syntactic parsing are much smaller datasets but still help
    
  
  Unsupervised objectives
    
      autoencoder: can be viewed as special case of translation
    
  
A Test of Relative Similarity for Model Selection in Generative Models


  statistical test for relative similarity
    
      goal: determine which of two models generate samples closer to real-world reference dataset
    
  
  statistic: difference in maximum mean discrepancies (MMDs) between reference and model datasets
  Provides meaningful ranking of model performance

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications


  “one-shot whole network compression”
  3 steps
    
      rank selection with variational bayesian matrix factorization
      tucker decomposition on kernel tensor
      fine-tining to record accumulated accuracy loss
    
  
  Small accuracy loss, results:
    
      e.g. AlexNet 2.72x runtime and 3.41x energy consumption improvement
    
  
  Tested on physical phone

[#A] Neural Programmer-Interpreters

Session-based recommendations with recurrent neural networks


  RNNs for recommender systems
  Model whole sessions/user history as sequence
    
      instead of just recommending similar items
    
  
[#A] Continuous control with deep reinforcement learning

Recurrent Gaussian Processes


  Exactly what it sounds like
  Recurrent variational bayes framework
    
      efficient inference + strong regularization across layers
    
  
  Main benefit: applicable even when large datasets not available

Modeling Visual Representations:Defining Properties and Deep Approximations

Theory

  Desirable features of visual representation (defined in terms of minimal sufficient statistics)
    
      minimal sufficiency of representation: smallest complexity representation can be stored in lieu of raw data with no loss of performance for target task
      invariance: statistic is constant with respect to uninformative transformations in data
    
  
  Derive analytical expressions for these representations
    
      Show how common descriptors relate
    
  
  Representations only interesting in relation to some task
    
      Task could be particular pixel of some image.  Then representations less helpful
    
  
Auxiliary Image Regularization for Deep CNNs with Noisy Labels


  How to handle datasets with mislabeled examples
  Auxiliary image regularizer
    
      Adjusts weights on images
        
          ideally, mislabeled images = 0 weight
        
      
      idea: nearest neighbors to regularize fitting CNN to noisy samples.
        
          “overlapping group norms”
          Uses an outside pretrained model (on another dataset like imagenet) to aid in identifying groups of images / extract representations
        
      
  Uses Alternating Direction Method of Multipliers
    
      SGF can’t handle the loss function, so must use ADMM
    
  
  Substantial improvements compared to standard models with noisy labels

Convergent Learning: Do different neural networks learn the same representations?

[#B] Policy Distillation

[#B] Neural Random-Access Machines

[#C] Gated Graph Sequence Neural Networks

Metric Learning with Adaptive Density Discrimination

Censoring Representations with an Adversary

Order-Embeddings of Images and Language

[#A] Variable Rate Image Compression with Recurrent Neural Networks

[#B] Delving Deeper into Convolutional Networks for Learning Video Representations

8-Bit Approximations for Parallelism in Deep Learning

Data-dependent initializations of Convolutional Neural Networks

[#A] Order Matters: Sequence to sequence for sets

High-Dimensional Continuous Control Using Generalized Advantage Estimation

BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies

Deep Multi Scale Video Prediction Beyond Mean Square Error

[#A] Grid Long Short-Term Memory

[#A] Net2Net: Accelerating Learning via Knowledge Transfer

Predicting distributions with Linearizing Belief Networks

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning

Segmental Recurrent Neural Networks

Deep Linear Discriminant Analysis

Large-Scale Approximate Kernel Canonical Correlation Analysis

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Learning Representations from EEG with Deep Recurrent-Convolutional Neural Networks

Digging Deep into the layers of CNNs: In Search of How CNNs Achieve View Invariance

An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family

Data-Dependent Path Normalization in Neural Networks

Reasoning in Vector Space: An Exploratory Study of Question Answering

[#A] Neural GPUs Learn Algorithms

ACDC: A Structured Efficient Linear Layer 

Density Modeling of Images using a Generalized Normalization Transformation

Adversarial Manipulation of Deep Representations

Geodesics of learned representations

Sequence Level Training with Recurrent Neural Networks

Super-resolution with deep convolutional sufficient statistics

[#B] Variational Gaussian Process

Workshop papers

[#A] Visualizing and Understanding Recurrent Networks