Skip to content

Instantly share code, notes, and snippets.

@dmrd
Created April 28, 2016 20:53
Show Gist options
  • Save dmrd/dfbf56995d1c996f500a98c76508f934 to your computer and use it in GitHub Desktop.
Save dmrd/dfbf56995d1c996f500a98c76508f934 to your computer and use it in GitHub Desktop.
Rough notes on a bunch of ICLR conference papers

Repeated topics

Attention

Memory

Reinforcement learning

Efficient NNs

Bayesian methods

Variational inference

neural turing machines

transfer learning

adversarial networks and examples

ICLR

Observation: many papers cite other papers at same conference

  • Semantic segmentation
  • new cnn module
    • specifically desigend for dense prediction
    • dilated expansions systematically aggregate multiscale information
  • Maintains resolution throughout model
  • Meant to be added to existing semseg models
  • convolution with dilation -> each successive layer convolves a larger area
  • Variational autoencoder + priors to encourage “independence between sensitive and latent factors of variation”
  • Goal: “learn representations that are explicitly invariant with respect to some known aspect of a dataset while retaining as much remaining information as possible”
  • There are many ways to evaluate generative models
  • The 3 big ones are largely independent, so maximizing one != maximize others
    • average log-likelihood
    • parzen window estimates
      • Probably don’t want to use these in general
    • visual fidelity
  • Empirically evaluate lstms to recognize patterns in clinical measurement timeseries
  • establish effectiveness for clinical data
  • Trained with untimestamped diagnoses, but can add timestamps later to do early diagnoses
  • simple training strategy: replicate targets at each sequence step (auto
  • Does a better job than MLP on hand engineered features at diagnosing
  • Still pretty bad
  • Replay lets RL agents remember and reuse experiences
  • Prior work samples experiences uniformly from the past
    • Does not weight “significant” experiences more heavily. just based on frequency
  • They develop framework to replay significant experiences more frequently
  • “replay” is of state transitions
    • Some transitions are more surprising than other
  • Proposal: replay transitions with high expected learning progress
    • measured by temporal-difference (TD) error magnitude
    • Avoid losing diversity by doing stochastic prioritization + importance sampling
  • Resul: speeds up learning 2x and achieves new record in atari benchmark
  • VAE: pairs top-down generative network with bottom-up recognition network which approximates posterior inference
    • strong assumptions about posterior inference
      • posterior approximately factorial, parameters can be predicted from observables by nonlinear regression
    • trained to maximize variational lower bound on log-likelihood
  • contribution: train same architecture with a better log-likelihood lower bound derived from importance sampling
  • learns a richer representation
  • 3 stages of compression
    1. Prune network to just important connections
      • repeatedly train and prune connections
    2. quantize weights to enforce weight sharing
      • cluster weights, generate codebook, [quantize to codebook -> retrain codebook and repeat quantization]
    3. apply huffman coding
      • encode weights and indices
  • Results: compress alexnet 35x, VGG-16 49x
    • speedup: 3-4x layerwise speedup
    • No loss in accuracy (!)
  • Augment DGPs with recognition model
  • Learn a set of small basis filters, which it then learns to combine into more complex filters
  • Comparable accuracy and less compute
    • 26% less compute and 41% fewer parameters than GoogLeNet
  • Not that impressive compared to `deep compression`
  • interesting part is learning set of basis filters. Linear combination of bases = more complex filter
  • approach
    • forward pass: stochastically binarize weights. (“binary connect”)
      • benefit: Converts multiplications to computing sign changes
    • backwards pass: quantized backprop
      • quantize representations at each layer to convert remaining multiplications to binary shifts
  • performance: Tends to IMPROVE rather than damage performance
    • potential reasons
      • implicit regularization
      • low precision = harder to overfit and improved generalization
  • Promising for hardware implementation
  • Proposes new regularizer: DeCov
  • Minimize the cross-covariance of hidden activations
  • results:
    • always reduced overfitting
    • maintained or increased generalization performance
      • often improved over dropout
      • combining with dropout often worked best
  • effects all layers up to the one it is applied to
  • creates new loss function for boundary detection
  • better than human performance on Berkeley segmentation dataset (0.808 vs. 0.803) measured by F-measure
  • Slow: 320x420 image processed in 1 second
  • steps
    • processes at 3 scales by CNN
    • fuse scales and send to normlized cuts algorithm
    • combine “spectral boundaries” (eigenvalues) with original boundary map, nonmax supression, and optional thresholding
  • interesting
    • Graduated learning scheme - originally trained on slightly easier objective function to initialize the network
  • generative process of images conditioned on captions
    • captions = sequence of consecutive words
    • images = sequence of patches drawn on a canvas
    • sequence-to-sequence framework
  • Language model: bidirectional attention RNN
  • Image model: conditional DRAW network
    • extend DRAW network to have a caption representation at each step
  • trained to maximize variational lower bound on marginal likelihood of the correct image given the input caption
  • postprocessing
    • wow what a sentence:

    “adversarial network trained on residuals of a Laplacian pyramid conditioned on the skipthought representation (Kiros et al., 2015) of the captions to sharpen the generated images,”

  • Determines if 2 sentences
    • contradict
    • do not relate
    • the first sentence (premise) entails the second sentence (hypothesis)
  • Simple prior work:
    • map each sentence into semantic space with LSTMs,
    • concatenate and feed to MLP
  • Takes both sentences at once
    Premise
    Hypothesis
    • Different LSTMs for remis and hypothesis, but hypothesis LSTM initialized to hidden state of premise LSTM
  • word-by-word attention mechanism
    • Adding attention boosts 80.9->83.5% on benchmark
  • word2vec as word input representation
  • Further develop tensor decomposition technique for speeding up CNNs
  • new algorithm for computing low-rank tensor decomposition to remove redundancy in conv kernels
  • reduces forward time of VGG-16 by 50% with comparable accuracy
  • low rank constrained CNNs sometimes do better than unconstrained. regularization?
  • Simple idea: replace 4d conv kernel with 2 consecutive kernels with lower rank (idea from 2014 paper)

Super theoretical

  • Distillation and priviledge information are techniques to enable machines learn from other machines
    • Paper introduces combination: generalized distillation
  • Priviledged information
    • Each training example is (x_1, x_1^*, y_n), or (feature, additional information from “teacher” only available during training, label)
  • Distillation
    • simple machine learns complex task by imitatin solution of a flexible machine
  • Content retrieval
  • contributions
    • propose compact image region representation derived from cnn layer activations
      • encode multiple image regions at once
    • ? use “generalized mean” to enable using integral images along with max-pooling. lets them do particular object localization directly in the 2d cnn activation maps
    • localization also used for image re-ranking and helps define a simple query expansion method
  • interesting idea: Maximum Activations of Convolutions
    • the representation does not encode location of activations and does a max pool over single region of size WxH. It only encodes max response of each conv filter and is translation invariant

Simple approach

  • Propose layer-sequential unit-variance initialization
    1. pre-init weights of each conv and inner product layer with orthonormal matrices
    2. proceed from 1st to last layer and normalize variance of output of each layer to be equal to 1
  • performs at least as well as standard methods
  • at least as fast as complex schemes for very deep nets like highway networks
  • state-of-art or close to it results

Variational inference!

  • propose to combine “generative unsupervised feature learning” with a “probabilistic treatment of oractle information like triplets”
  • Joint unsupervised generative model over observations and triplet-constraints provided by an oracle
    • variational belief network, but could be used on GPs and other probabilistic models
  • learns features from images without labels, but feature learning is additionally guided by an oracle
    • provides a joint model of these components
    • oracle helps guide features to be semantically meaningful
      • allows semantic masking such as “take this face image, generate image conditioned on light from the right side”
  • try to “transfer implicit oracle knowledge into explicit parametric model”
  • Not much work on learning sentence representations that work as well across domains as word embeddings do
    • explore compositional models that can encode word sequences into vectors s.t. similar meaning = high cosine similarity
  • compare many architectures on paraphrasic datasets, both with similar test/train characteristics and applying to other domains
  • Complex models such as LSTMs work best on in-domain data
  • For out of domain data, simple architectures like word averaging perform best
  • Models
    1. word averaging
      • only parameter it learns is the word embedding matrix
      • 2nd model adds in a bias vector
      • Can have a many layer model to get “deep word averaging”
    2. several RNNs, including LSTM rnns
  • simplest model (word averaging) worked best in most cases, especially outside of training data domain (no neural network required)
    • does better on entailment and similarity tasks
    • LSTMs better on sentiment classification

What is an IRNN?

  • Applied specifically to IRNNs (integrated RNN - no idea what these are)
    • typically suffer from exploding gradients
    • Also helps LSTms, but not tanh-RNNs
  • penalize squared distance between successive hiden state norms
  • prevents exponential growth of activations and helps generalizing to longer sequences
  • Caffe on Spark & open source
  • parallelization scheme
    • split data among workers
    • in each iteration
      • broadcast params to workers
      • workers run SGD for fixed number of steps or length of time
      • send params back to master and average
    • init network by running SGD for a few iterations on master
    • with 4 GPUs per node vs. single GPU
      • 1 node = 3.5x
      • 3 node = 9.4x
      • 6 node = 11.2x
    • tolerates low-bandwidth intra-cluster communication
  • Learning discriminative classifier from unlabeled/partially labeled data
  • objective function: trades off
    • mutual infromation between examples and predicted categorical class distribution
    • and robustness of classifier to adversarial examples
  • One view is that it extends GANs
  • Generator that is learned alongside classifier generates reasonable examples
  • Introduce test for how well models capture meaning in children’s books
  • distinguishes
    1. syntactic function words from
    2. lower frequency words with more semantic content
  • Benchmark existing models
    • models with explicit long term context/memory perform best
    • there is a sweet spot in amount of text each memory location represents
  • The test
    • 20 sentences of context S. Remove word from 21st sentence to get query Q
    • models must identify the missing word `a` among 10 candidates C
  • 4 categories:
    • named entities
    • common nouns/verbs/prepositions
  • Memory NNs and LSTMs work best
    • Memory best at named entities and nouns, LSTMs best at verbs and prepositions (better than humans at prepositions)
  • Introduce unbiased gradient estimator for stochastic networks
    • Stochastic network: computation graph with continuous/discrete sampling operations
    • First estimator that can handle both discrete and continuous
  • 2 parts:
    • deterministic term g_MF: computed by backprop through “mean-field network”
    • LR term g_R: accounts for residuals from deterministic part to get unbiased estimates
  • framework for unsupervised feature selection from sequential data like text
  • learns dict of n-grams that efficiently compress a corpus then recursively compress own dictionary (hence the deep part)
  • finds useful data representation and compression using IP (/relaxations to LP)
  • Technically deep, but it posed as an LP.
  • Learning networks with diverse neurons
  • reduces redundancy (and thus size) of NNs
    • samples a diverse set of neurons
    • merge remaining neurons with the selected neurons
  • Describe for feed forward nets, but is more general
  • Introduce
    • Determinantal Point Processes (DPPs) to model neuron diversity
    • fusing step that minimizes negative effects of removing neurons
      • also helps existing pruning approaches
  • Extends paper `Continuous control with deep rl` into structured (parametrized) action space
    • extend DDPG algorithm
    • bound the actino space gradients suggested by critic
      • likely useful for any continuous, bounded action space
  • Apply to RoboCup soccer
    • a few discrete action types, each parametrized by continuous variables
  • Actor/critic model
    • actor takes state and outputs continuous action
    • critic takes input s and action a and outputs scalar Q-value
  • process raw visual input
  • trains itself through random interactions with collection of environments
  • resulting model used to plan goal-directed action sin new environments
  • test by plaing simulated billiards game
  • “object-centric prediction”
    • individually model future states of each of the L objects (e.g. balls) in the world
  • Architecture
    • provided
      • previous 4 “glimpses” centered at object position
      • agent’s applied forces and hidden states of LSTMs from previous timestep
    • output
      • ball displacement for next h frames
  • minimize difference between truth and predicted positions
  • Results in “visual imagination”

bAbI dataset

  • Goal: provide a set of tasks, as as a “leaf” test case independent of the others
    • as simple as possible to test an intended behavior
  • tasks
    1. single supporting fast: fact + irrelevant details -> answer question (“Mary travelled to office. Mary ate an apple. Where is mary?”)
    2. 2-3 supporting facts: require chaining between facts to answer question
    3. 2-3 argument relations: questions that require the ability to recognize subjects and objects, so must model word order
    4. yes/no questions: single supporting fact with yes/no questions
    5. counting and lists/sets: simple counting operations, keeping track of lists (“what is daniel holding?”
    6. simple negation/indefinite knowledge
    7. basic coreference/conjunctions/compound coreference
    8. time reasoning
    9. basic deduction and induction
    10. positional and size reasoning
    11. path finding: given description of spaces, find path from one to another
    12. agent motivation: why did a motivation perform an action?
  • Test out MemNNs. Surprise, they don’t solve everything
    • Work better than N-gram and LSTM baselines, but fail at a number of tasks
    • Structured SVM does better than vanilla MemNNs (but it has hand built featuers)
  • Similar to bAbI but larger scale dataset
  • dataset:
    • factual questions about movies
    • provide personalization
    • carry short conversations about recommendations/movies
    • perform natural dialog on reddit
  • Benchmark several common models (SVD, LSTMs,
  • Awww… A go AI paper =(
    • Poor FB research
  • Darkforest bot
  • next k-move prediction + monte carlo rollouts
  • features
    • our/opponent
      • liberties
      • stones/empty
      • history
      • rank
    • ko position
    • border
    • distance from center of board
    • whether position closer to us/opponent
  • Achieve 5d rank on KGS
  • Propose regularizatoin term: local distributional smoothness (LDS)
    • defined as KL-divergence based robustness of the model distribution against local perturbations around the datapoint
  • Virtual Adversarial Training
    • applicable to semisupervised learning
    • doesn’t use label information for getting adversarial direction
  • few hyperparameters
  • Simple model, but only outperformed by Ladder networks (which are fairly complicated)
  • Examines
    1. one-to-many
    2. many-to-one
    3. many-to-many
  • Multitask learning can help the performance of attention-free seq-to-seq models
    • Learning image captioning and syntactic parsing help performance of english -> german by +1.5 BLEU over string single-task baseline
    • image captioning & syntactic parsing are much smaller datasets but still help
  • Unsupervised objectives
    • autoencoder: can be viewed as special case of translation
  • statistical test for relative similarity
    • goal: determine which of two models generate samples closer to real-world reference dataset
  • statistic: difference in maximum mean discrepancies (MMDs) between reference and model datasets
  • Provides meaningful ranking of model performance
  • “one-shot whole network compression”
  • 3 steps
    1. rank selection with variational bayesian matrix factorization
    2. tucker decomposition on kernel tensor
    3. fine-tining to record accumulated accuracy loss
  • Small accuracy loss, results:
    • e.g. AlexNet 2.72x runtime and 3.41x energy consumption improvement
  • Tested on physical phone
  • RNNs for recommender systems
  • Model whole sessions/user history as sequence
    • instead of just recommending similar items
  • Exactly what it sounds like
  • Recurrent variational bayes framework
    • efficient inference + strong regularization across layers
  • Main benefit: applicable even when large datasets not available

Theory

  • Desirable features of visual representation (defined in terms of minimal sufficient statistics)
    • minimal sufficiency of representation: smallest complexity representation can be stored in lieu of raw data with no loss of performance for target task
    • invariance: statistic is constant with respect to uninformative transformations in data
  • Derive analytical expressions for these representations
    • Show how common descriptors relate
  • Representations only interesting in relation to some task
    • Task could be particular pixel of some image. Then representations less helpful
  • How to handle datasets with mislabeled examples
  • Auxiliary image regularizer
    • Adjusts weights on images
      • ideally, mislabeled images = 0 weight
    • idea: nearest neighbors to regularize fitting CNN to noisy samples.
      • “overlapping group norms”
      • Uses an outside pretrained model (on another dataset like imagenet) to aid in identifying groups of images / extract representations
  • Uses Alternating Direction Method of Multipliers
    • SGF can’t handle the loss function, so must use ADMM
  • Substantial improvements compared to standard models with noisy labels

Workshop papers

[#A] Visualizing and Understanding Recurrent Networks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment