- Setup WSL
-
Install wsl:
wsl --install -d Ubuntu
-
Run Powershell as Administrator and enter:
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart wsl --set-default-version 2
-
One of the main reason mixture of Experts are gaining so much attention is due to their high degree of parallelization while allowing to scale exponentially the number of parameters. Usually this requires a lot of complex code and deep knowledge of distributed systems but we can get this for free with the FastMoE library.
First of all we need to define our Experts and specify in the expert_dp_comm attribute which type of gradient reduction we would like to use out of:
- dp: reduced across the data-parallel groups, which means that in the model parallel group, they are not synchronized.
- world: gradients are synchronized across all workers, regardless their model or data parallel group. This is extremely useful for shared layers like the gate.
Let's define our MoE layer by opting for the synchronization across all workers:
from fmoe.layers import FMoE
In this tutorial we are going to consider a simple model in which we are going to replace the MLP with a MoE. The starting model is defined like this:
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
Step by step tutorial to install FastMoE on your local machine:
- First of all you'll need to check your torch and nccl version, make sure to have a CUDA version compatible to the one torch was compiled (in general if you have the latest torch version it works also with the latest CUDA):
# go in terminal and use this command, the output should be something like this:
python -c 'import torch; print(torch.__version__); print(torch.cuda.nccl.version())'
>>> 2.0.1+cu117
>>> (2, 14, 3) # -> this means version 2.14.3
To launch a distributed training in torch with mpirun we have to:
- Configure a passwordless ssh connection with the nodes
- Setup the distributed environment inside the training script, in this case train.py
- Launch the training from the MASTER node with mpirun
For the first step, this is the pipeline:
# generate a public/private ssh key and make sure to NOT insert a passphrase
ssh-keygen -t rsa
In this tutorial we assume to launch a distributed training on 2 nodes using DeepSpeed with the OpenMPI Launcher.
- First of all DeepSpeed needs a passwordless ssh connection with all the nodes, MASTER included:
# generate a public/private ssh key and make sure to NOT insert a passphrase
ssh-keygen -t rsa
# copy public key 'id_rsa' on the MASTER and SLAVE