Observation: many papers cite other papers at same conference
- Semantic segmentation
- new cnn module
- specifically desigend for dense prediction
- dilated expansions systematically aggregate multiscale information
- Maintains resolution throughout model
- Meant to be added to existing semseg models
- convolution with dilation -> each successive layer convolves a larger area
- Variational autoencoder + priors to encourage “independence between sensitive and latent factors of variation”
- Goal: “learn representations that are explicitly invariant with respect to some known aspect of a dataset while retaining as much remaining information as possible”
- There are many ways to evaluate generative models
- The 3 big ones are largely independent, so maximizing one != maximize others
- average log-likelihood
- parzen window estimates
- Probably don’t want to use these in general
- visual fidelity
- Empirically evaluate lstms to recognize patterns in clinical measurement timeseries
- establish effectiveness for clinical data
- Trained with untimestamped diagnoses, but can add timestamps later to do early diagnoses
- simple training strategy: replicate targets at each sequence step (auto
- Does a better job than MLP on hand engineered features at diagnosing
- Still pretty bad
- Replay lets RL agents remember and reuse experiences
- Prior work samples experiences uniformly from the past
- Does not weight “significant” experiences more heavily. just based on frequency
- They develop framework to replay significant experiences more frequently
- “replay” is of state transitions
- Some transitions are more surprising than other
- Proposal: replay transitions with high expected learning progress
- measured by temporal-difference (TD) error magnitude
- Avoid losing diversity by doing stochastic prioritization + importance sampling
- Resul: speeds up learning 2x and achieves new record in atari benchmark
- VAE: pairs top-down generative network with bottom-up recognition network which approximates posterior inference
- strong assumptions about posterior inference
- posterior approximately factorial, parameters can be predicted from observables by nonlinear regression
- trained to maximize variational lower bound on log-likelihood
- strong assumptions about posterior inference
- contribution: train same architecture with a better log-likelihood lower bound derived from importance sampling
- learns a richer representation
[#A] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
- 3 stages of compression
- Prune network to just important connections
- repeatedly train and prune connections
- quantize weights to enforce weight sharing
- cluster weights, generate codebook, [quantize to codebook -> retrain codebook and repeat quantization]
- apply huffman coding
- encode weights and indices
- Prune network to just important connections
- Results: compress alexnet 35x, VGG-16 49x
- speedup: 3-4x layerwise speedup
- No loss in accuracy (!)
- Augment DGPs with recognition model
- Learn a set of small basis filters, which it then learns to combine into more complex filters
- Comparable accuracy and less compute
- 26% less compute and 41% fewer parameters than GoogLeNet
- Not that impressive compared to `deep compression`
- interesting part is learning set of basis filters. Linear combination of bases = more complex filter
- approach
- forward pass: stochastically binarize weights. (“binary connect”)
- benefit: Converts multiplications to computing sign changes
- backwards pass: quantized backprop
- quantize representations at each layer to convert remaining multiplications to binary shifts
- forward pass: stochastically binarize weights. (“binary connect”)
- performance: Tends to IMPROVE rather than damage performance
- potential reasons
- implicit regularization
- low precision = harder to overfit and improved generalization
- potential reasons
- Promising for hardware implementation
- Proposes new regularizer: DeCov
- Minimize the cross-covariance of hidden activations
- results:
- always reduced overfitting
- maintained or increased generalization performance
- often improved over dropout
- combining with dropout often worked best
- effects all layers up to the one it is applied to
- creates new loss function for boundary detection
- better than human performance on Berkeley segmentation dataset (0.808 vs. 0.803) measured by F-measure
- Slow: 320x420 image processed in 1 second
- steps
- processes at 3 scales by CNN
- fuse scales and send to normlized cuts algorithm
- combine “spectral boundaries” (eigenvalues) with original boundary map, nonmax supression, and optional thresholding
- interesting
- Graduated learning scheme - originally trained on slightly easier objective function to initialize the network
- generative process of images conditioned on captions
- captions = sequence of consecutive words
- images = sequence of patches drawn on a canvas
- sequence-to-sequence framework
- Language model: bidirectional attention RNN
- Image model: conditional DRAW network
- extend DRAW network to have a caption representation at each step
- trained to maximize variational lower bound on marginal likelihood of the correct image given the input caption
- postprocessing
- wow what a sentence:
“adversarial network trained on residuals of a Laplacian pyramid conditioned on the skipthought representation (Kiros et al., 2015) of the captions to sharpen the generated images,”
- Determines if 2 sentences
- contradict
- do not relate
- the first sentence (premise) entails the second sentence (hypothesis)
- Simple prior work:
- map each sentence into semantic space with LSTMs,
- concatenate and feed to MLP
- Takes both sentences at once
- Premise
- Hypothesis
- Different LSTMs for remis and hypothesis, but hypothesis LSTM initialized to hidden state of premise LSTM
- word-by-word attention mechanism
- Adding attention boosts 80.9->83.5% on benchmark
- word2vec as word input representation
- Further develop tensor decomposition technique for speeding up CNNs
- new algorithm for computing low-rank tensor decomposition to remove redundancy in conv kernels
- reduces forward time of VGG-16 by 50% with comparable accuracy
- low rank constrained CNNs sometimes do better than unconstrained. regularization?
- Simple idea: replace 4d conv kernel with 2 consecutive kernels with lower rank (idea from 2014 paper)
Super theoretical
- Distillation and priviledge information are techniques to enable machines learn from other machines
- Paper introduces combination: generalized distillation
- Priviledged information
- Each training example is (x_1, x_1^*, y_n), or (feature, additional information from “teacher” only available during training, label)
- Distillation
- simple machine learns complex task by imitatin solution of a flexible machine
- Content retrieval
- contributions
- propose compact image region representation derived from cnn layer activations
- encode multiple image regions at once
- ? use “generalized mean” to enable using integral images along with max-pooling. lets them do particular object localization directly in the 2d cnn activation maps
- localization also used for image re-ranking and helps define a simple query expansion method
- propose compact image region representation derived from cnn layer activations
- interesting idea: Maximum Activations of Convolutions
- the representation does not encode location of activations and does a max pool over single region of size WxH. It only encodes max response of each conv filter and is translation invariant
Simple approach
- Propose layer-sequential unit-variance initialization
- pre-init weights of each conv and inner product layer with orthonormal matrices
- proceed from 1st to last layer and normalize variance of output of each layer to be equal to 1
- performs at least as well as standard methods
- at least as fast as complex schemes for very deep nets like highway networks
- state-of-art or close to it results
Variational inference!
- propose to combine “generative unsupervised feature learning” with a “probabilistic treatment of oractle information like triplets”
- Joint unsupervised generative model over observations and triplet-constraints provided by an oracle
- variational belief network, but could be used on GPs and other probabilistic models
- learns features from images without labels, but feature learning is additionally guided by an oracle
- provides a joint model of these components
- oracle helps guide features to be semantically meaningful
- allows semantic masking such as “take this face image, generate image conditioned on light from the right side”
- try to “transfer implicit oracle knowledge into explicit parametric model”
- Not much work on learning sentence representations that work as well across domains as word embeddings do
- explore compositional models that can encode word sequences into vectors s.t. similar meaning = high cosine similarity
- compare many architectures on paraphrasic datasets, both with similar test/train characteristics and applying to other domains
- Complex models such as LSTMs work best on in-domain data
- For out of domain data, simple architectures like word averaging perform best
- Models
- word averaging
- only parameter it learns is the word embedding matrix
- 2nd model adds in a bias vector
- Can have a many layer model to get “deep word averaging”
- several RNNs, including LSTM rnns
- word averaging
- simplest model (word averaging) worked best in most cases, especially outside of training data domain (no neural network required)
- does better on entailment and similarity tasks
- LSTMs better on sentiment classification
What is an IRNN?
- Applied specifically to IRNNs (integrated RNN - no idea what these are)
- typically suffer from exploding gradients
- Also helps LSTms, but not tanh-RNNs
- penalize squared distance between successive hiden state norms
- prevents exponential growth of activations and helps generalizing to longer sequences
- Caffe on Spark & open source
- parallelization scheme
- split data among workers
- in each iteration
- broadcast params to workers
- workers run SGD for fixed number of steps or length of time
- send params back to master and average
- init network by running SGD for a few iterations on master
- with 4 GPUs per node vs. single GPU
- 1 node = 3.5x
- 3 node = 9.4x
- 6 node = 11.2x
- tolerates low-bandwidth intra-cluster communication
- Learning discriminative classifier from unlabeled/partially labeled data
- objective function: trades off
- mutual infromation between examples and predicted categorical class distribution
- and robustness of classifier to adversarial examples
- One view is that it extends GANs
- Generator that is learned alongside classifier generates reasonable examples
- Introduce test for how well models capture meaning in children’s books
- distinguishes
- syntactic function words from
- lower frequency words with more semantic content
- Benchmark existing models
- models with explicit long term context/memory perform best
- there is a sweet spot in amount of text each memory location represents
- The test
- 20 sentences of context S. Remove word from 21st sentence to get query Q
- models must identify the missing word `a` among 10 candidates C
- 4 categories:
- named entities
- common nouns/verbs/prepositions
- Memory NNs and LSTMs work best
- Memory best at named entities and nouns, LSTMs best at verbs and prepositions (better than humans at prepositions)
- Introduce unbiased gradient estimator for stochastic networks
- Stochastic network: computation graph with continuous/discrete sampling operations
- First estimator that can handle both discrete and continuous
- 2 parts:
- deterministic term g_MF: computed by backprop through “mean-field network”
- LR term g_R: accounts for residuals from deterministic part to get unbiased estimates
- framework for unsupervised feature selection from sequential data like text
- learns dict of n-grams that efficiently compress a corpus then recursively compress own dictionary (hence the deep part)
- finds useful data representation and compression using IP (/relaxations to LP)
- Technically deep, but it posed as an LP.
- Learning networks with diverse neurons
- reduces redundancy (and thus size) of NNs
- samples a diverse set of neurons
- merge remaining neurons with the selected neurons
- Describe for feed forward nets, but is more general
- Introduce
- Determinantal Point Processes (DPPs) to model neuron diversity
- fusing step that minimizes negative effects of removing neurons
- also helps existing pruning approaches
- Extends paper `Continuous control with deep rl` into structured (parametrized) action space
- extend DDPG algorithm
- bound the actino space gradients suggested by critic
- likely useful for any continuous, bounded action space
- Apply to RoboCup soccer
- a few discrete action types, each parametrized by continuous variables
- Actor/critic model
- actor takes state and outputs continuous action
- critic takes input s and action a and outputs scalar Q-value
- process raw visual input
- trains itself through random interactions with collection of environments
- resulting model used to plan goal-directed action sin new environments
- test by plaing simulated billiards game
- “object-centric prediction”
- individually model future states of each of the L objects (e.g. balls) in the world
- Architecture
- provided
- previous 4 “glimpses” centered at object position
- agent’s applied forces and hidden states of LSTMs from previous timestep
- output
- ball displacement for next h frames
- provided
- minimize difference between truth and predicted positions
- Results in “visual imagination”
bAbI dataset
- Goal: provide a set of tasks, as as a “leaf” test case independent of the others
- as simple as possible to test an intended behavior
- tasks
- single supporting fast: fact + irrelevant details -> answer question (“Mary travelled to office. Mary ate an apple. Where is mary?”)
- 2-3 supporting facts: require chaining between facts to answer question
- 2-3 argument relations: questions that require the ability to recognize subjects and objects, so must model word order
- yes/no questions: single supporting fact with yes/no questions
- counting and lists/sets: simple counting operations, keeping track of lists (“what is daniel holding?”
- simple negation/indefinite knowledge
- basic coreference/conjunctions/compound coreference
- time reasoning
- basic deduction and induction
- positional and size reasoning
- path finding: given description of spaces, find path from one to another
- agent motivation: why did a motivation perform an action?
- Test out MemNNs. Surprise, they don’t solve everything
- Work better than N-gram and LSTM baselines, but fail at a number of tasks
- Structured SVM does better than vanilla MemNNs (but it has hand built featuers)
- Similar to bAbI but larger scale dataset
- dataset:
- factual questions about movies
- provide personalization
- carry short conversations about recommendations/movies
- perform natural dialog on reddit
- Benchmark several common models (SVD, LSTMs,
- Awww… A go AI paper =(
- Poor FB research
- Darkforest bot
- next k-move prediction + monte carlo rollouts
- features
- our/opponent
- liberties
- stones/empty
- history
- rank
- ko position
- border
- distance from center of board
- whether position closer to us/opponent
- our/opponent
- Achieve 5d rank on KGS
- Propose regularizatoin term: local distributional smoothness (LDS)
- defined as KL-divergence based robustness of the model distribution against local perturbations around the datapoint
- Virtual Adversarial Training
- applicable to semisupervised learning
- doesn’t use label information for getting adversarial direction
- few hyperparameters
- Simple model, but only outperformed by Ladder networks (which are fairly complicated)
- Examines
- one-to-many
- many-to-one
- many-to-many
- Multitask learning can help the performance of attention-free seq-to-seq models
- Learning image captioning and syntactic parsing help performance of english -> german by +1.5 BLEU over string single-task baseline
- image captioning & syntactic parsing are much smaller datasets but still help
- Unsupervised objectives
- autoencoder: can be viewed as special case of translation
- statistical test for relative similarity
- goal: determine which of two models generate samples closer to real-world reference dataset
- statistic: difference in maximum mean discrepancies (MMDs) between reference and model datasets
- Provides meaningful ranking of model performance
- “one-shot whole network compression”
- 3 steps
- rank selection with variational bayesian matrix factorization
- tucker decomposition on kernel tensor
- fine-tining to record accumulated accuracy loss
- Small accuracy loss, results:
- e.g. AlexNet 2.72x runtime and 3.41x energy consumption improvement
- Tested on physical phone
- RNNs for recommender systems
- Model whole sessions/user history as sequence
- instead of just recommending similar items
- Exactly what it sounds like
- Recurrent variational bayes framework
- efficient inference + strong regularization across layers
- Main benefit: applicable even when large datasets not available
Theory
- Desirable features of visual representation (defined in terms of minimal sufficient statistics)
- minimal sufficiency of representation: smallest complexity representation can be stored in lieu of raw data with no loss of performance for target task
- invariance: statistic is constant with respect to uninformative transformations in data
- Derive analytical expressions for these representations
- Show how common descriptors relate
- Representations only interesting in relation to some task
- Task could be particular pixel of some image. Then representations less helpful
- How to handle datasets with mislabeled examples
- Auxiliary image regularizer
- Adjusts weights on images
- ideally, mislabeled images = 0 weight
- idea: nearest neighbors to regularize fitting CNN to noisy samples.
- “overlapping group norms”
- Uses an outside pretrained model (on another dataset like imagenet) to aid in identifying groups of images / extract representations
- Adjusts weights on images
- Uses Alternating Direction Method of Multipliers
- SGF can’t handle the loss function, so must use ADMM
- Substantial improvements compared to standard models with noisy labels