ozansener/NIPS_2016.md

## NIPS_2016.md

      
    Raw
  

              NIPS_2016.md
            
          
    NIPS Notes

Tutorials

Variational Inference


Simple Intro by Blei mostly going over review paper of Jordan
Later introduce SVI (Stochastic VI) as a remedy to solve VI tractably with large dataset.
Review the black box inference -assumption free VI-  http://www.jmlr.org/proceedings/papers/v33/ranganath14.pdf

Key idea is replacing gradient and the expectation in VI formulation. Since expectation reqiuires exponential family assumption to work replacing expectation and gradient solves this if overall method is stochastic since your samples are unbiased gradient estimates satisfying Robinson-Monroe conditions however the variance is very large and it requires even further tricks


Final part of the tutorial was about extending VAE. The main idea is removing uncorrolated noise assumption. A few pointers to the literature:

Variational Gaussian Process - http://dustintran.com/papers/TranRanganathBlei2016.pdf
Gradient Flow https://arxiv.org/pdf/1401.4082v3.pdf
Flow like work mostly about reaching to any covariate through sequence of approximations https://arxiv.org/pdf/1602.05473v4.pdf
Somewhere between diagonal and full covariance matrix for gaussias http://aivalley.com/Papers/textVAUX_TR.pdf
Normlizing Flow https://arxiv.org/pdf/1606.04934v1.pdf


GAN Tutorial


Very clear review of the adverserial training. Relateding NCE, GAN etc with each other with a clean table.


Plug and Play Generative Models


Combining multiple different generative models


Visually appealing results


Keynotes

Intelligent Biosphere / Drew Purves


Very interesting talk everyine should watch it when the videos are released. He had many major points but most striking ones were:

ML/Stats community can help ecology a lot. For example, we are over-producing electricity and food more that 30% since we can not predict the usage.
Intelligence emerged in a natural environment and thinking artificial intelligence  can emerge without natural environment sounds pretty wrong. The most striking point for me when he put natural/artifical and simulated/real as two different axeses. This is clearly is the correct way to approach to the transfer learning problem. We need simulated and natural environments for emergent behavior because emergent behavior is the key point of nature.


Susan Holmes


There is nothing called meta-data it is invented by NSA, all meta-data is a data for ML prupose.

Oral Sessions


Best Paper Award Value Iteration Networks

Value iteration is in principle convnet :) It is matrix multiplication (state transitions) followed with max-pooling (max over future states). So, they simply did this and learned the entire policy directly.
It works great generalizes to some other enivornments


Hiearchical Clustering via …

Hierarchical clustering does not have a clear cost function unline k-means and k-medoids. And, this paper is simply proposing that
Let's say T is the tree of the hierarchical clustering s.t. each leaf is a data point. Then the cost is \sum_{i,j} k(i,j) |leaves(T(lca(i.j))|

lca(a, b): lowest common ancestor of a and b
leaves(T(x)): number of leaves having x as ancestor
k(i, j): similarity metric of data points


This is NP-hard (Dasgupta 16).
|leaves(T(lca(i.j))|  - 1 is an ultrametric (metric w/ strong triangle inequality)
They propose O(n^3) algorithm using ILP and its LP relaxation
Similar result in SODA17 by Moses


Self Cluster Query…

Active setting for clustering. Interaction is the question "is a and b in the same cluster?"
Theoretical setup n points, k clusters over d dimension
With no extra assumption, learned needs to ask Gamma(n) questions. So not applicable.
If there is a margin between clusters, one can make O(knd) algorithm with O(k log n) questions.


Time-Contrastive Learning and Nonlinear ICA

Main idea is this if I have a nonlinear ICA w/ gaussians and design a contrastive learning scheme as divide temporal scale into k parts and learn a supervised MLP seperating segments. They proove the the learned representation is indeed a linear ICA.
Although their setup is gaussian, the variance of the gaussian is non-stationary. They also assume mixing function is smooth and non-linear.


Good Seedings for k-Means

Problem is can we have efficient seeding for k-Means. k-Means++ does produce good seeds but not so efficient requires linear pass over all points.
They start with k-means++ and its D^2 sampling is sampling from p(x) \sum d(x,old_cent)^2
They design a Markov chain with same stationary distribution and state points
They require a pass over dataset for initial proposal distribution which is used in MC-MC
Have good guarantess and super easy to use simply pip install kmc2


Using Fast Weights to Attend to the Recent Past

Main problem is modelling short-term memory so basically state-dynamics+short term mem+long term mem.
Main idea is very simple adding intermediate steps within the RNN. (Not so) suprsingly, this ends-up being hopfield networks.


Sequential Neural Models with Stochastic Layers

Idea is combining RNN with state-space models.
RNN is not stochastic friendly but state-space is. However, state space is not easy to learn. Hence, they combine them.
Most hacky part is they design a inverse RNN to model and learn the proposal function.
Mostly very straight-forward usage of VAE and RNN with the aforementioned little hack.


Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences

Try to solve vanishing gradients to learn very long RNNs with no issue.
They also solve async input sources.
Key idea is using a khronos gate to keep track of time per neuron. Hence, LSTM sleeps if it is not working and gradients do not vanish.


Deep Learning w/o Poor Minima

They show that if network is linear, no saddle point without a hessian w/ negative eigenvalue. So, it is easy to optimize with GD.
If there is only ReLu as non-linearity, ReLu becomes linear unless activation pattern is changing hence it can be analyzed using linear NNs.
They have very non-realistic assumptions similar to LeCun's COLT paper. However, they have half of the assumptions so good direction.


Poking Paper

Random actions make you learn relationships between actions and results. Basic idea is combination of encoder/decoder in predictive setup. Encoder decoders can be in-terms of images and/or actions.


Draw What and Where

They want a conditional GAN but they want to control location like draw a bird right here.
They proposed a very hacky form of conditional GAN. They design adhoc architectures to give location as an input


Weight Normalization

Batch norm without bias


Finding features which seperates distributions

https://github.com/wittawatj/interpretable-test


Interpretability of data through examples

Prototype images do not give full picture while looking at the dataset because of very small clusters.
Prototype + criticisms give better picture
github.com/BeenKim/MMD-critic


Symposium

Points by Yoshia Bengio


Unsupervised learning is about discovering high-level abstractions and we are failing. The fact that conditional labels helping might point that sample complexity is an issue. Two important thing might help for this

Learning representations over different time-scales
Exploiting causality more


Papers


Real NVP

If your generative model is invertable, you can use change of variable to related encoder and decoder so you learn a signle model.
However, the invertable transformation means input and output dimension is same. So, input noise is very high dimensional.
They design a specific architecture to make Jacobian computation tractable.


Conditional PixelCNN

They try to solve the blind spot in PixelCNN.
They design two filters one horizontal and one vertical and combine them through gating.


Stochastic Length Networks

What we are doing is playing the game of fixing errors and linear chain is not a good idea.
If you analyze stochastic network, it gives connection between all layers so key is lots of skip connections
AdaNet: Dense connectivity. They connect everything with the same filter size.

If you have pooling, first make dense layers then pool.


Counter Factual Inference


Causal inference from observational data


Not supervised learning because it does not trained to differentiate the influence of the treatment(y) of x,y->success. Main idea is what if y was y'?


y_t|X, interest un E[Y_1 - Y_0|X] factual vs counterfactual (action/treatment is once)


Key Idea: this is domain adaptation problem


If treatments are random, same distribution in train and test


Idea is learn a feature st control and train are similar and use only such points


Implicitly assumes there is a policy


Panel


Are we actually learning meaningful representations

No. We are adding the structure with our eyes. The model do not understand the difference btw mountain and mouse in terms of scale.
We need to understand action as well. In cake analogy, action is the spoon.


Necessity of generative models

Babies first learn all phonems then loose them after learning language. If they are useless, may be there is generative models of phonems in our brain.
Interpretability is not important because it is only an engineering intuition and unnecessary. In a way, we want to be in charge although we do not need to be. For example, if I am in a taxi I do not care what driver is thinking etc.
Yann LeCun We are missing basic principles behind human/animal learning. Is brain minimizing an optimization functior or doing sth else? What is the equivalent of Bernoulli dynamics for intelligence?


Interpretability vs Principles

Emerging behavior is the reason for not interpretable models. For example, neurons are simples and gas dynamics are simple, however their emergent behavior is not interpretable. Compositionality is related to interpretability in a way which level you want the interpretability?


Humans do unsupervised learning on-line so why are we not doing it?

The brain has memory so we try to get that memory filled. So, it should be on-line but with lots of memory.


What is the correct measure for unsupervised learning

What we have is simple toy test beds and hope that they will go somewhere else. For example, mnist was not because of digit classification. So, basically idea is work on actual intelligence ideas but test on simple problems.


What is the best way to benchmark generative modelling, is log-likelihood useful

We do not have any good way and log-likelihood is not a good way. Is Turing test good measure for AI? It does not include concept off being less/more wrong, it should not be binary.
In the regime of large-scale data, log-prob matters. However, we are not even close there so may be we should not care now and care later. Log likelihood will always have leaking probability.
Thing that we care most requires verry little number of bits (like edges etc) so log-likelihood is not a good choice.


Posters

Workshops

How to Train a GAN


You can consider even generating the model
Visual inspection seems like the only way during training to poke
DCGAN stable up to 64x64 (issues mode dropping and underfitting)
Use the hacks from Salimans paper about stability
HACKS:

Normalize inputs -1/+1 and use tanh for generator output
use max log D instead of min(log 1 -D) (flip labels while training the generator)
Use spherical z from Tom White "Sampling Generative Models"
Batch Norm: Use it based on only for reals or fakes so never combine real and fake
Avoid sparse gradients. Use LeakyReLu instead of ReLu. AveragePooling/Strided convolution instead of max-pooling. ConvTranspose2d + stride instead of PixelShuffle
Label Smoothing and/or make the labels noisy for the discriminator w/ some probability
Use DC-GAN and if not possible try to use combination like KL+GAN (inpainting stuff) or VAE + GAN
Pfau & Vinyals 2016 / RL tricks helps w/ Experience Replay etc.
ADAM params in DCGAN works everywhere :)
If GAN is working it means loss will have a small variance but still it will vary. If it always decreases, it is fooling w/ garbage
Can you balance via loss stats? It generally does not work
Improved GAN paper/code
Add some noise to input and then decrease it through time
CGAN: Use an embedding layer of 120 dimension and upsample to match image channel


Connecting Actor-Critic and GAN


GAN vs Actor Critic: They are the same thing in TD-error case
GAN Game is POMDP since Generator never sees the environment

Implicit Generative Models


Main idea: Indirect inference/Prosterior free inference/Adverserial learning all are same thing so let's read each other and do not loose time.

Conditional Image Generation


Same as C-GAN but do not give the class label to the discriminator and make D estimate class as well
??? This actually makes sense if you think from NCE angle

Unrolled GAN


GAN: min_G max_D f
Instead they do max_D min_G f
The key is unroll the max_D through time and gradient through multiple iterations of it so generator sees the context/algorithm of the discrimantor.

Semantic Segmentation using Adverserial


To make the loss structuted they put the adverserial as l(x,s) s is simply sum of class labels* class segments. But labels are GT

Quantitative Evaluation of Decoder-based Generative Models


tonywu95/eval_gan

Suvrit Sra


Manifold Opt using Geometry
In general theoretical analysis of first order gradient descent on manifolds
Convex algorithm for metric learning
http://suvrit.de/papers/sra_hosseini_chapter.pdf
https://arxiv.org/abs/1602.06053
https://arxiv.org/pdf/1605.07147v1.pdf
https://arxiv.org/pdf/1602.06053v1.pdf
https://arxiv.org/pdf/1507.08366v2.pdf
http://suvrit.de/gopt.html