dfeng/regularization.md

## regularization.md

      
    Raw
  

              regularization.md
            
          
tags: #paper
results:

we have this new ratio regularization penalty $$l_1 / l_2$$ or $$l_{1,2}$$


outline (draft 1):


basically, a follow-up of Arora

what they show is that, in the context of matrix completion, we have this interesting theory (sort-of verified applied) that shows that depth (in the context of gradient descent) has this particular effect to essentially make the singular values more extreme (by analyzing the gradient flow, and finding that there's a $$N$$ factor there, corresponding to the depth)
they show that this cannot just be explained by nuclear norm minimization/regularization


we make a few remarks

firstly, the comparison between actual nuclear norm minimization (via a convex program) is not a particularly fair comparison

due to non-convexity, they're not equivalent
but more importantly, these are different "algorithms", where convex optimization forces a single, optimal solution, while DLNN is a non-convex program that has many other potential properties
or, put another way, this is more just a statement about how non-convex > convex in many settings, because we prefer the flexibility to have implicit regularizations with multiple "optima"


the more important, practical question is:

is depth performing something akin to a $$l_1$$ penalty?

nope


can the effect of depth be replicated by a different penalty?

yes


maybe:

depth is very crude, and actually only works for a narrow window: 3,4. With depth 5, your test error fluctuates much more
so one could argue that depth is a very blunt weapon


ratio penalty

intuition: this is a "normalized" version of the nuclear norm, which itself is a convex relaxation of the rank-norm

it tries to target one of the weaknesses of the nuclear norm, which is that you can reduce it simply by shrinking all elements
since this ratio is now normalized, minimizing this ratio can only happen if things are altered in a relative manner


(cute) relation to other things:

Adam, similarly is a ratio of first and second moments, whereas we are dealing with $$l_p$$'s
it's like an $$F$$-test, or some sort of ANOVA test
or perhaps it's like $$k$$-means (within vs between squares)


something about optimizers (?)


thoughts:

^^I'm a little worried that we're worrying a little too much about the test error in terms of the actual exponent, when it might just be the case that, at some level, the performance of these algorithms are ultimately at the whim of the initialization, and the numerical error from floating-point representation of numbers^^

in other words, it might be the case that anything at 1E-5 test error is already the best we can hope for
though, if that were the case, then running repeated samples will give us a better indicator

from a quick glance, i suspect that ratio has lower variance, while the unpenalized is much more variable (dependent on initialization)