sethah/fed_mtl.md

## fed_mtl.md

      
    Raw
  

              fed_mtl.md
            
          
    Federated Multi-Task Learning
TL;DR

Mocha is a convex optimization routine that is very, very similar to CoCoA, but has some modifications to allow for unreliable nodes (stragglers, failures). Each node solves its own sub-problem, but weights on each node are constrained to be related (hence, multitask learning). Does not apply to non-convex problems, yet.
Notes


Builds on CoCoA, which a distributed optimization routing leveraging primal/dual formulation that shares some similarity with ADMM
The propose Mocha, which, in contrast to CoCoA, aims to solve problems in federated learning like stragglers, fault tolerance (nodes failing), non-IID data, number of samples on each device, etc...
The training is a multi-task learning setting since each device learns its own set of weights, but the weights are constrained to be related in some way (in the paper, they propose that weights will form clusters)
ADMM is a special case of CoCoA, which is a special case of Mocha
Straggler mitigation is accomplished by allowing the worker nodes to only approximately solve the subproblem on that node, where the definition of approximately is flexible across nodes
The key difference from CoCoA seems to be that each node has its own nob for the level of approximation on its particular sub problem. So some nodes have the flexibility to make very little progress on their problem if they have limited resources or they drop out
Also a difference is that you can specify how the solutions to the different sub-tasks are related
Seems like they deal with the inherent non-IID nature of federated learning by just saying that multitask learning is good at handling this
They allow for dropped nodes by making the approximation nob equal to 1, which means that a particular node doesn't have to make any progress whatsoever on its sub problem
"While MOCHA does not apply to non-convex deep learning models in its current form, we note that there may be natural connections between this approach and “convexified” deep learning models [6, 34, 51, 56] in the context of kernelized federated multi-task learning."