Skip to content

Instantly share code, notes, and snippets.

@sethah
Last active March 23, 2018 20:51
Show Gist options
  • Save sethah/84d77c0910da1f495038af4c1231bf49 to your computer and use it in GitHub Desktop.
Save sethah/84d77c0910da1f495038af4c1231bf49 to your computer and use it in GitHub Desktop.
Federated Multitask Learning

Federated Multi-Task Learning

TL;DR

Mocha is a convex optimization routine that is very, very similar to CoCoA, but has some modifications to allow for unreliable nodes (stragglers, failures). Each node solves its own sub-problem, but weights on each node are constrained to be related (hence, multitask learning). Does not apply to non-convex problems, yet.

Notes

  • Builds on CoCoA, which a distributed optimization routing leveraging primal/dual formulation that shares some similarity with ADMM
  • The propose Mocha, which, in contrast to CoCoA, aims to solve problems in federated learning like stragglers, fault tolerance (nodes failing), non-IID data, number of samples on each device, etc...
  • The training is a multi-task learning setting since each device learns its own set of weights, but the weights are constrained to be related in some way (in the paper, they propose that weights will form clusters)
  • ADMM is a special case of CoCoA, which is a special case of Mocha
  • Straggler mitigation is accomplished by allowing the worker nodes to only approximately solve the subproblem on that node, where the definition of approximately is flexible across nodes
  • The key difference from CoCoA seems to be that each node has its own nob for the level of approximation on its particular sub problem. So some nodes have the flexibility to make very little progress on their problem if they have limited resources or they drop out
  • Also a difference is that you can specify how the solutions to the different sub-tasks are related
  • Seems like they deal with the inherent non-IID nature of federated learning by just saying that multitask learning is good at handling this
  • They allow for dropped nodes by making the approximation nob equal to 1, which means that a particular node doesn't have to make any progress whatsoever on its sub problem
  • "While MOCHA does not apply to non-convex deep learning models in its current form, we note that there may be natural connections between this approach and “convexified” deep learning models [6, 34, 51, 56] in the context of kernelized federated multi-task learning."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment