One of the main reason mixture of Experts are gaining so much attention is due to their high degree of parallelization while allowing to scale exponentially the number of parameters.
Usually this requires a lot of complex code and deep knowledge of distributed systems but we can get this for free with the FastMoE library.
First of all we need to define our Experts and specify in the expert_dp_comm attribute which type of gradient reduction we would like to use out of:
- dp: reduced across the data-parallel groups, which means that in the model parallel group, they are not synchronized.
- world: gradients are synchronized across all workers, regardless their model or data parallel group. This is extremely useful for shared layers like the gate.
Let's define our MoE layer by opting for the synchronization across all workers:
from fmoe.layers import FMoE