amartya18x/ICML2019.md

## ICML2019.md

      
    Raw
  

              ICML2019.md
            
          
    Theory

DL Theory


[Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer NNs](http://proceedings.mlr.press/v97/arora19a.html]

Analysis by NTK and how training speed depends on the projection of y on the eigenvectors of the NTK.
Projections on top eigen values decrease faster than smaller eigenvalues.
Hence, loss on correct labels decrease faster than incorrect.
Gen bound depends on the NTK, which is data-dependant.


Gradient Descent Finds Global Minima of Deep Neural Networks

Also, NTK based analysis
Fully Connected - Global Convergence in Linear Time, # Parameters polynomial in  (# Samples, max eig of NTK, 2^(number of layers))
ResNets - Same as above except that sample complexity is polynomial in  number of layers.


A Convergence Theory for Deep Learning via Over-Parameterization

Needs to be compared with #2.


Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

Instead of using parameter-dependant terms, their bound depends on both paprameter and data.
e.g. maximum norm (over the dataset) of the activation vector etc.
Makes sense as the spectrak norm is a rather pessimistic quantity, which might never be realized (quantification of the bound shows this to be true).
Novel analysis method where you put the data-dependant properties as indicator variables in the loss function and work with the rademacher complexity of the entire hypothesis class (which is defined before seeing the data)
Possibly apply to our paper ??


Optimization


Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks

Shows a discrete time step analysis of GD in deep networks.
Shows how the network recovers the basis vectors of the data one by one and the phase


A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent
Alternating Minimizations Converge to Second-Order Optimal Solutions
A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

The noise in SGD might not be gaussian and might instead converge to a heavy-tailed alpha-stable distribution
This allows the 'iterates' to incur jumps.
It supports the argument that SGD converges to classical wide-minimas. (as it would have jumped out otherwise)


Matrix-Free Preconditioning in Online Learning

I don't quite understand how they do it but the idea is to avoid using matrices to pre-condition the gradients in OLO.


Adversarial


[Adversarial examples from computational constraints](http://proceedings.mlr.press/v97/bubeck19a.html]
Rademacher Complexity for Adversarially RobustGeneralization


Tight upper and lower bound for adversarial rademacher complexity in terms.
Shows unavoidable dimension dependance in adv-rad-compl.
Always worse than non-robust rad-compl.

Empirical Insights in DL


An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

Derives a procedure (with code) to get the eienvalues of the hessian.
Shows that very early in the training, negatieve eigenvalues disappear
Has many othe insights


The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study
Uniform convergence may be unable to explain generalization in deep learning

They show that recent generalization bound quantities when evaluated for networks trained to achieve a certain margin on the training set consisting of m samples, actually increase with increasing m.
As expected the actual test error decreases with increasing m. So something seems to be fundamentally wrong (or not ?)


Using Pre-Training Can Improve Model Robustness and Uncertainty.

With pre-training, fewer epochs are required to get optimal train error.
Less training and hence less test-error.
Early Stopping helps in adversarial robustness ??


The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
Complexity of Linear Regions in Deep Networks
On the Spectral Bias of Neural Networks

Deep Networks Learn simpler functions first


Some more papers


On Certifying Non-Uniform Bounds against Adversarial Attacks