Skip to content

Instantly share code, notes, and snippets.

@dfeng
Created April 10, 2020 13:55
Show Gist options
  • Save dfeng/96a384bf2085471e3def1c313cd733ac to your computer and use it in GitHub Desktop.
Save dfeng/96a384bf2085471e3def1c313cd733ac to your computer and use it in GitHub Desktop.
  • tags: #paper
  • results:
    • we have this new ratio regularization penalty $$l_1 / l_2$$ or $$l_{1,2}$$
  • outline (draft 1):
    • basically, a follow-up of Arora
      • what they show is that, in the context of matrix completion, we have this interesting theory (sort-of verified applied) that shows that depth (in the context of gradient descent) has this particular effect to essentially make the singular values more extreme (by analyzing the gradient flow, and finding that there's a $$N$$ factor there, corresponding to the depth)
      • they show that this cannot just be explained by nuclear norm minimization/regularization
    • we make a few remarks
      • firstly, the comparison between actual nuclear norm minimization (via a convex program) is not a particularly fair comparison
        • due to non-convexity, they're not equivalent
        • but more importantly, these are different "algorithms", where convex optimization forces a single, optimal solution, while DLNN is a non-convex program that has many other potential properties
        • or, put another way, this is more just a statement about how non-convex > convex in many settings, because we prefer the flexibility to have implicit regularizations with multiple "optima"
      • the more important, practical question is:
        • is depth performing something akin to a $$l_1$$ penalty?
          • nope
        • can the effect of depth be replicated by a different penalty?
          • yes
      • maybe:
        • depth is very crude, and actually only works for a narrow window: 3,4. With depth 5, your test error fluctuates much more
        • so one could argue that depth is a very blunt weapon
      • ratio penalty
        • intuition: this is a "normalized" version of the nuclear norm, which itself is a convex relaxation of the rank-norm
          • it tries to target one of the weaknesses of the nuclear norm, which is that you can reduce it simply by shrinking all elements
          • since this ratio is now normalized, minimizing this ratio can only happen if things are altered in a relative manner
        • (cute) relation to other things:
          • Adam, similarly is a ratio of first and second moments, whereas we are dealing with $$l_p$$'s
          • it's like an $$F$$-test, or some sort of ANOVA test
          • or perhaps it's like $$k$$-means (within vs between squares)
      • something about optimizers (?)
  • thoughts:
    • ^^I'm a little worried that we're worrying a little too much about the test error in terms of the actual exponent, when it might just be the case that, at some level, the performance of these algorithms are ultimately at the whim of the initialization, and the numerical error from floating-point representation of numbers^^
      • in other words, it might be the case that anything at 1E-5 test error is already the best we can hope for
      • though, if that were the case, then running repeated samples will give us a better indicator
        • from a quick glance, i suspect that ratio has lower variance, while the unpenalized is much more variable (dependent on initialization)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment