- [Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer NNs](http://proceedings.mlr.press/v97/arora19a.html]
- Analysis by NTK and how training speed depends on the projection of y on the eigenvectors of the NTK.
- Projections on top eigen values decrease faster than smaller eigenvalues.
- Hence, loss on correct labels decrease faster than incorrect.
- Gen bound depends on the NTK, which is data-dependant.
- Gradient Descent Finds Global Minima of Deep Neural Networks
- Also, NTK based analysis
- Fully Connected - Global Convergence in Linear Time, # Parameters polynomial in (# Samples, max eig of NTK, 2^(number of layers))
- ResNets - Same as above except that sample complexity is polynomial in number of layers.
- A Convergence Theory for Deep Learning via Over-Parameterization
- Needs to be compared with #2.
- Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation
- Instead of using parameter-dependant terms, their bound depends on both paprameter and data.
- e.g. maximum norm (over the dataset) of the activation vector etc.
- Makes sense as the spectrak norm is a rather pessimistic quantity, which might never be realized (quantification of the bound shows this to be true).
- Novel analysis method where you put the data-dependant properties as indicator variables in the loss function and work with the rademacher complexity of the entire hypothesis class (which is defined before seeing the data)
- Possibly apply to our paper ??
- Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks
- Shows a discrete time step analysis of GD in deep networks.
- Shows how the network recovers the basis vectors of the data one by one and the phase
- A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent
- Alternating Minimizations Converge to Second-Order Optimal Solutions
- A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks
- The noise in SGD might not be gaussian and might instead converge to a heavy-tailed alpha-stable distribution
- This allows the 'iterates' to incur jumps.
- It supports the argument that SGD converges to classical wide-minimas. (as it would have jumped out otherwise)
- Matrix-Free Preconditioning in Online Learning
- I don't quite understand how they do it but the idea is to avoid using matrices to pre-condition the gradients in OLO.
- [Adversarial examples from computational constraints](http://proceedings.mlr.press/v97/bubeck19a.html]
- Rademacher Complexity for Adversarially RobustGeneralization
- Tight upper and lower bound for adversarial rademacher complexity in terms.
- Shows unavoidable dimension dependance in adv-rad-compl.
- Always worse than non-robust rad-compl.
- An Investigation into Neural Net Optimization via Hessian Eigenvalue Density
- Derives a procedure (with code) to get the eienvalues of the hessian.
- Shows that very early in the training, negatieve eigenvalues disappear
- Has many othe insights
- The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study
- Uniform convergence may be unable to explain generalization in deep learning
- They show that recent generalization bound quantities when evaluated for networks trained to achieve a certain margin on the training set consisting of m samples, actually increase with increasing m.
- As expected the actual test error decreases with increasing m. So something seems to be fundamentally wrong (or not ?)
- Using Pre-Training Can Improve Model Robustness and Uncertainty.
- With pre-training, fewer epochs are required to get optimal train error.
- Less training and hence less test-error.
- Early Stopping helps in adversarial robustness ??
- The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
- Complexity of Linear Regions in Deep Networks
- On the Spectral Bias of Neural Networks
- Deep Networks Learn simpler functions first