Skip to content

Instantly share code, notes, and snippets.

@HamidShojanazeri
Last active September 15, 2023 18:55
optimizer_overlap.logs
(llama-package) hamidnazeri@a100-st-p4d24xlarge-2:~/llama-package/llama-recipes$ torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name meta-llama/Llama-2-7b-chat-hf --num_epochs 2 --pure_bf16 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder with_new_config --optimizer_overlap True
[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING]
[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] *****************************************
[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] *****************************************
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.03s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.11s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.90s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:26<00:00, 13.05s/it]
--> Model meta-llama/Llama-2-7b-chat-hf
--> meta-llama/Llama-2-7b-chat-hf has 6738.415616 Million params
bFloat16 enabled for mixed precision - using bfSixteen policy
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
setting up optimizer overlap
setting up optimizer overlap
/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1: 0%| | 0/97 [00:00<?, ?it/s]/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1: 0%| | 0/97 [00:00<?, ?it/s]--> Training Set Length = 1555
setting up optimizer overlap
/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1: 0%| | 0/97 [00:00<?, ?it/s]--> Validation Set Length = 84
setting up optimizer overlap
/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1/2, step 96/97 completed (loss: 1.6897648572921753): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:53<00:00, 4.26s/it]
Training Epoch: 1/2, step 96/97 completed (loss: 1.619846224784851): 100%|███████████████████████████████████████████████████████████████████████████| 97/97 [06:56<00:00, 4.30s/it]
Training Epoch: 1/2, step 96/97 completed (loss: 1.6953213214874268): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:54<00:00, 4.27s/it]
Training Epoch: 1/2, step 96/97 completed (loss: 1.6714438199996948): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:56<00:00, 4.29s/it]
Max CUDA memory allocated was 21 GB
Max CUDA memory reserved was 28 GB
Peak active CUDA memory was 23 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 1 GB
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.07s/it]
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.08s/it]
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.07s/it]
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.07s/it]
eval_ppl=tensor(5.3825, device='cuda:0') eval_epoch_loss=tensor(1.6832, device='cuda:0')
Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
Saving model to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
2023-09-15 16:59:45,444 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
Sharded state checkpoint saved to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
Checkpoint Time = 41.2271
best eval loss on epoch 1 is 1.683154582977295
Epoch 1: train_perplexity=6.5870, train_epoch_loss=1.8851, epoch time 414.2397181370761s
Training Epoch: 2/2, step 96/97 completed (loss: 1.4305708408355713): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:49<00:00, 4.22s/it]
Training Epoch: 2/2, step 96/97 completed (loss: 1.425536036491394): 100%|███████████████████████████████████████████████████████████████████████████| 97/97 [06:49<00:00, 4.22s/it]
Training Epoch: 2/2, step 96/97 completed (loss: 1.3905612230300903): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:49<00:00, 4.22s/it]
Training Epoch: 2/2, step 96/97 completed (loss: 1.4353440999984741): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:49<00:00, 4.22s/it]
Max CUDA memory allocated was 21 GB
Max CUDA memory reserved was 27 GB
Peak active CUDA memory was 23 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 2 GB
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.08s/it]
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.08s/it]
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.08s/it]
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.08s/it]
eval_ppl=tensor(5.3765, device='cuda:0') eval_epoch_loss=tensor(1.6820, device='cuda:0')
Saving the FSDP model checkpoints using SHARDED_STATE_DICT Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
=====================================================
Saving model to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
2023-09-15 17:07:38,388 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
Sharded state checkpoint saved to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
Checkpoint Time = 35.9966
best eval loss on epoch 2 is 1.6820366382598877
Epoch 2: train_perplexity=4.4708, train_epoch_loss=1.4976, epoch time 410.0324876109371s
training params are saved in /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf/train_params.yaml
Key: avg_train_prep, Value: 5.528891563415527
Key: avg_train_loss, Value: 1.6913299560546875
Key: avg_eval_prep, Value: 5.379501819610596
Key: avg_eval_loss, Value: 1.6825956106185913
Key: avg_epoch_time, Value: 412.1361028740066
Key: avg_checkpoint_time, Value: 38.64281301997835
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment