Skip to content

Instantly share code, notes, and snippets.

@amartya18x
Last active June 15, 2019 22:08
Show Gist options
  • Save amartya18x/d9fbbd9e448aee75fda05c38740a5caa to your computer and use it in GitHub Desktop.
Save amartya18x/d9fbbd9e448aee75fda05c38740a5caa to your computer and use it in GitHub Desktop.
ICML2019 Papers on theory and insights into deep learning

Theory

DL Theory

  1. [Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer NNs](http://proceedings.mlr.press/v97/arora19a.html]
    • Analysis by NTK and how training speed depends on the projection of y on the eigenvectors of the NTK.
    • Projections on top eigen values decrease faster than smaller eigenvalues.
    • Hence, loss on correct labels decrease faster than incorrect.
    • Gen bound depends on the NTK, which is data-dependant.
  2. Gradient Descent Finds Global Minima of Deep Neural Networks
    • Also, NTK based analysis
    • Fully Connected - Global Convergence in Linear Time, # Parameters polynomial in (# Samples, max eig of NTK, 2^(number of layers))
    • ResNets - Same as above except that sample complexity is polynomial in number of layers.
  3. A Convergence Theory for Deep Learning via Over-Parameterization
    • Needs to be compared with #2.
  4. Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation
    • Instead of using parameter-dependant terms, their bound depends on both paprameter and data.
    • e.g. maximum norm (over the dataset) of the activation vector etc.
    • Makes sense as the spectrak norm is a rather pessimistic quantity, which might never be realized (quantification of the bound shows this to be true).
    • Novel analysis method where you put the data-dependant properties as indicator variables in the loss function and work with the rademacher complexity of the entire hypothesis class (which is defined before seeing the data)
    • Possibly apply to our paper ??

Optimization

  1. Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks
    • Shows a discrete time step analysis of GD in deep networks.
    • Shows how the network recovers the basis vectors of the data one by one and the phase
  2. A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent
  3. Alternating Minimizations Converge to Second-Order Optimal Solutions
  4. A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks
    • The noise in SGD might not be gaussian and might instead converge to a heavy-tailed alpha-stable distribution
    • This allows the 'iterates' to incur jumps.
    • It supports the argument that SGD converges to classical wide-minimas. (as it would have jumped out otherwise)
  5. Matrix-Free Preconditioning in Online Learning
    • I don't quite understand how they do it but the idea is to avoid using matrices to pre-condition the gradients in OLO.

Adversarial

  1. [Adversarial examples from computational constraints](http://proceedings.mlr.press/v97/bubeck19a.html]
  2. Rademacher Complexity for Adversarially RobustGeneralization
  • Tight upper and lower bound for adversarial rademacher complexity in terms.
  • Shows unavoidable dimension dependance in adv-rad-compl.
  • Always worse than non-robust rad-compl.

Empirical Insights in DL

  1. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density
    • Derives a procedure (with code) to get the eienvalues of the hessian.
    • Shows that very early in the training, negatieve eigenvalues disappear
    • Has many othe insights
  2. The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study
  3. Uniform convergence may be unable to explain generalization in deep learning
    • They show that recent generalization bound quantities when evaluated for networks trained to achieve a certain margin on the training set consisting of m samples, actually increase with increasing m.
    • As expected the actual test error decreases with increasing m. So something seems to be fundamentally wrong (or not ?)
  4. Using Pre-Training Can Improve Model Robustness and Uncertainty.
    • With pre-training, fewer epochs are required to get optimal train error.
    • Less training and hence less test-error.
    • Early Stopping helps in adversarial robustness ??
  5. The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
  6. Complexity of Linear Regions in Deep Networks
  7. On the Spectral Bias of Neural Networks
    • Deep Networks Learn simpler functions first

Some more papers

  1. On Certifying Non-Uniform Bounds against Adversarial Attacks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment