Last active
September 15, 2023 18:55
optimizer_overlap.logs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(llama-package) hamidnazeri@a100-st-p4d24xlarge-2:~/llama-package/llama-recipes$ torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name meta-llama/Llama-2-7b-chat-hf --num_epochs 2 --pure_bf16 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder with_new_config --optimizer_overlap True | |
[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] | |
[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] ***************************************** | |
[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] ***************************************** | |
Clearing GPU cache for all ranks | |
--> Running with torch dist debug set to detail | |
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.03s/it] | |
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.11s/it] | |
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.90s/it] | |
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:26<00:00, 13.05s/it] | |
--> Model meta-llama/Llama-2-7b-chat-hf | |
--> meta-llama/Llama-2-7b-chat-hf has 6738.415616 Million params | |
bFloat16 enabled for mixed precision - using bfSixteen policy | |
--> applying fsdp activation checkpointing... | |
--> applying fsdp activation checkpointing... | |
--> applying fsdp activation checkpointing... | |
--> applying fsdp activation checkpointing... | |
setting up optimizer overlap | |
setting up optimizer overlap | |
/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch: 1: 0%| | 0/97 [00:00<?, ?it/s]/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch: 1: 0%| | 0/97 [00:00<?, ?it/s]--> Training Set Length = 1555 | |
setting up optimizer overlap | |
/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch: 1: 0%| | 0/97 [00:00<?, ?it/s]--> Validation Set Length = 84 | |
setting up optimizer overlap | |
/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch: 1/2, step 96/97 completed (loss: 1.6897648572921753): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:53<00:00, 4.26s/it] | |
Training Epoch: 1/2, step 96/97 completed (loss: 1.619846224784851): 100%|███████████████████████████████████████████████████████████████████████████| 97/97 [06:56<00:00, 4.30s/it] | |
Training Epoch: 1/2, step 96/97 completed (loss: 1.6953213214874268): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:54<00:00, 4.27s/it] | |
Training Epoch: 1/2, step 96/97 completed (loss: 1.6714438199996948): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:56<00:00, 4.29s/it] | |
Max CUDA memory allocated was 21 GB | |
Max CUDA memory reserved was 28 GB | |
Peak active CUDA memory was 23 GB | |
Cuda Malloc retires : 0 | |
CPU Total Peak Memory consumed during the train (max): 1 GB | |
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.07s/it] | |
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.08s/it] | |
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.07s/it] | |
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.07s/it] | |
eval_ppl=tensor(5.3825, device='cuda:0') eval_epoch_loss=tensor(1.6832, device='cuda:0') | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
Saving model to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
2023-09-15 16:59:45,444 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {} | |
Sharded state checkpoint saved to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf | |
Checkpoint Time = 41.2271 | |
best eval loss on epoch 1 is 1.683154582977295 | |
Epoch 1: train_perplexity=6.5870, train_epoch_loss=1.8851, epoch time 414.2397181370761s | |
Training Epoch: 2/2, step 96/97 completed (loss: 1.4305708408355713): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:49<00:00, 4.22s/it] | |
Training Epoch: 2/2, step 96/97 completed (loss: 1.425536036491394): 100%|███████████████████████████████████████████████████████████████████████████| 97/97 [06:49<00:00, 4.22s/it] | |
Training Epoch: 2/2, step 96/97 completed (loss: 1.3905612230300903): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:49<00:00, 4.22s/it] | |
Training Epoch: 2/2, step 96/97 completed (loss: 1.4353440999984741): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:49<00:00, 4.22s/it] | |
Max CUDA memory allocated was 21 GB | |
Max CUDA memory reserved was 27 GB | |
Peak active CUDA memory was 23 GB | |
Cuda Malloc retires : 0 | |
CPU Total Peak Memory consumed during the train (max): 2 GB | |
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.08s/it] | |
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.08s/it] | |
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.08s/it] | |
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00, 1.08s/it] | |
eval_ppl=tensor(5.3765, device='cuda:0') eval_epoch_loss=tensor(1.6820, device='cuda:0') | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
===================================================== | |
Saving model to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
2023-09-15 17:07:38,388 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {} | |
Sharded state checkpoint saved to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf | |
Checkpoint Time = 35.9966 | |
best eval loss on epoch 2 is 1.6820366382598877 | |
Epoch 2: train_perplexity=4.4708, train_epoch_loss=1.4976, epoch time 410.0324876109371s | |
training params are saved in /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf/train_params.yaml | |
Key: avg_train_prep, Value: 5.528891563415527 | |
Key: avg_train_loss, Value: 1.6913299560546875 | |
Key: avg_eval_prep, Value: 5.379501819610596 | |
Key: avg_eval_loss, Value: 1.6825956106185913 | |
Key: avg_epoch_time, Value: 412.1361028740066 | |
Key: avg_checkpoint_time, Value: 38.64281301997835 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment