This is a speed benchmark for distributed training.
- Ubuntu xxx
- CUDA xxx
- NCCL xxx
- Autobot xxx
- Tensorflow xxx
- Pytorch xxx
- MXNet xxx
- cProfile
- NVIDIA Nsight Systems
- Profile tools provided by each framework
*Image Classification: ResNet50 VGG16 *Translation: GNMT-16 *Video Captioning: S2VT
Each experient below should be tested among four deep learning frameworks.
- Different GPU placement (e.g. 4 GPUs in different nodes)
- Horovod or not, our Horvod vs offical Horovod
- RDMA or socket
- Different parallel architecture