Skip to content

Instantly share code, notes, and snippets.

@chenyaofo
Last active May 3, 2024 12:53
Show Gist options
  • Save chenyaofo/d4cd6a20ab4b95cb3a49491747e2139a to your computer and use it in GitHub Desktop.
Save chenyaofo/d4cd6a20ab4b95cb3a49491747e2139a to your computer and use it in GitHub Desktop.
Throughput Benchmark based on deepspeed-based LLM traning code.

We training LLM with the code and report the training speed of different settings (see the Table). We use a machine with A800 x 8, 1 TB CPU memory, Intel 8358 CPU x 2. For the software, we use CUDA 12.1, PyTorch 2.2.0, Deepspeed 0.14.2.

Table. Benchmark of LLaMA-7B models using deepspeed-based traning code. The squence length is 4096.

Zero Stage Ckpt.1 Optim. Off.2 Param. Off.3 Zero++4 BS5 CPU Mem.6 GPU Mem.7 Th.put
2 × × × × 1/64 320.1 19.4/44.8 5.33
2 × × × 1/64 320.0 19.4/23.5 4.19
2 × × 1/64 361.3 13.4/16.9 1.81
2 × × × 4/64 320.4 27.2/38.6 4.69
3 × × × × 2/64 319.5 14.8/75.7 4.95
3 × × × 2/64 319.6 14.8/20.4 4.45
3 × × 2/64 387.4 3.8/9.4 2.05
3 × 4/64 398.9 2.2/7.9 2.06
3 4/64 411.1 2.2/7.9 1.85
3 × × × 8/64 319.6 17.7/39.1 4.73
3 × × × 8/128 319.9 21.4/63.9 4.32

Footnotes

  1. Ckpt. indicates whether to enable HF gradient checkpointing for the model.

  2. Optim. Off. indicates whether to enable HF offload_optimizer in the config zero_optimization.

  3. Param. Off. indicates whether to enable HF offload_param in the config zero_optimization.

  4. Zero++ represents the techiques at https://www.deepspeed.ai/tutorials/zeropp/

  5. BS represents batch size per device per iteration/batch size for gradient decent

  6. CPU Mem. denotes psutil.virtual_memory().used

  7. GPU Mem. represents torch.cuda.memory_allocated()/torch.cuda.max_memory_allocated()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment