chenyaofo/deepspeed-benchmark.md

## deepspeed-benchmark.md

      
    Raw
  

              deepspeed-benchmark.md
            
          
    We training LLM with the code and report the training speed of different settings (see the Table). We use a machine with A800 x 8, 1 TB CPU memory, Intel 8358 CPU x 2. For the software, we use CUDA 12.1, PyTorch 2.2.0, Deepspeed 0.14.2.

  Table. Benchmark of LLaMA-7B models using deepspeed-based traning code. The squence length is 4096.  


Zero Stage
Ckpt.¹
Optim. Off.²
Param. Off.³
Zero++⁴
BS⁵
CPU Mem.⁶
GPU Mem.⁷
Th.put


2
×
×
×
×
1/64
320.1
19.4/44.8
5.33


2
√
×
×
×
1/64
320.0
19.4/23.5
4.19


2
√
√
×
×
1/64
361.3
13.4/16.9
1.81


2
√
×
×
×
4/64
320.4
27.2/38.6
4.69


3
×
×
×
×
2/64
319.5
14.8/75.7
4.95


3
√
×
×
×
2/64
319.6
14.8/20.4
4.45


3
√
√
×
×
2/64
387.4
3.8/9.4
2.05


3
√
√
√
×
4/64
398.9
2.2/7.9
2.06


3
√
√
√
√
4/64
411.1
2.2/7.9
1.85


3
√
×
×
×
8/64
319.6
17.7/39.1
4.73


3
√
×
×
×
8/128
319.9
21.4/63.9
4.32


Footnotes


Ckpt. indicates whether to enable HF gradient checkpointing for the model. ↩


Optim. Off. indicates whether to enable HF offload_optimizer in the config zero_optimization. ↩


Param. Off. indicates whether to enable HF offload_param in the config zero_optimization. ↩


Zero++ represents the techiques at https://www.deepspeed.ai/tutorials/zeropp/ ↩


BS represents batch size per device per iteration/batch size for gradient decent ↩


CPU Mem. denotes psutil.virtual_memory().used ↩


GPU Mem. represents torch.cuda.memory_allocated()/torch.cuda.max_memory_allocated() ↩
Zero Stage	Ckpt.¹	Optim. Off.²	Param. Off.³	Zero++⁴	BS⁵	CPU Mem.⁶	GPU Mem.⁷	Th.put
2	×	×	×	×	1/64	320.1	19.4/44.8	5.33
2	√	×	×	×	1/64	320.0	19.4/23.5	4.19
2	√	√	×	×	1/64	361.3	13.4/16.9	1.81
2	√	×	×	×	4/64	320.4	27.2/38.6	4.69
3	×	×	×	×	2/64	319.5	14.8/75.7	4.95
3	√	×	×	×	2/64	319.6	14.8/20.4	4.45
3	√	√	×	×	2/64	387.4	3.8/9.4	2.05
3	√	√	√	×	4/64	398.9	2.2/7.9	2.06
3	√	√	√	√	4/64	411.1	2.2/7.9	1.85
3	√	×	×	×	8/64	319.6	17.7/39.1	4.73
3	√	×	×	×	8/128	319.9	21.4/63.9	4.32