Created
October 30, 2023 16:31
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#### without overlap ###### | |
(llama-recipe-package) hamidnazeri@a100-st-p4d24xlarge-27:~/llama-package/llama-recipes$ torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name meta-llama/Llama-2-7b-chat-hf --pure_bf16 --dist_checkpoint_root_folder dist-root --dist_checkpoint_folder fsdp-ft-checkpoints --optimizer anyprecisio | |
n | |
[2023-10-30 04:33:37,161] torch.distributed.run: [WARNING] | |
[2023-10-30 04:33:37,161] torch.distributed.run: [WARNING] ***************************************** | |
[2023-10-30 04:33:37,161] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
[2023-10-30 04:33:37,161] torch.distributed.run: [WARNING] ***************************************** | |
Warning: unknown parameter pure_bf16Warning: unknown parameter pure_bf16 | |
Warning: unknown parameter optimizer | |
Warning: unknown parameter pure_bf16 | |
Warning: unknown parameter optimizer | |
Warning: unknown parameter pure_bf16 | |
Warning: unknown parameter optimizer | |
Clearing GPU cache for all ranks | |
--> Running with torch dist debug set to detail | |
Warning: unknown parameter optimizer | |
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.32s/it] | |
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.75s/it] | |
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.89s/it] | |
--> Model meta-llama/Llama-2-7b-chat-hf | |
--> meta-llama/Llama-2-7b-chat-hf has 6738.415616 Million params | |
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.75s/it] | |
bFloat16 enabled for mixed precision - using bfSixteen policy | |
--> applying fsdp activation checkpointing... | |
--> applying fsdp activation checkpointing... | |
--> applying fsdp activation checkpointing... | |
--> applying fsdp activation checkpointing... | |
--> Training Set Length = 14732 | |
Preprocessing dataset: 7%|███████▏ | 1064/14732 [00:00<00:06, 2086.78it/s]--> Validation Set Length = 818 | |
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:13<00:00, 1085.71it/s] | |
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:00<00:00, 1046.10it/s] | |
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:13<00:00, 1055.47it/s] | |
Preprocessing dataset: 70%|███████████████████████████████████████████████████████████████████████▎ | 572/818 [00:00<00:00, 1077.76it/s]/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:14<00:00, 1045.99it/s] | |
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 783.08it/s] | |
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:14<00:00, 1013.72it/s] | |
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 780.62it/s] | |
Preprocessing dataset: 80%|██████████████████████████████████████████████████████████████████████████████████ | 658/818 [00:00<00:00, 1050.31it/s]/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 693.65it/s] | |
/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch: 1: 0%| | 0/775 [00:00<?, ?it/s]/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch: 1/1, step 5/775 completed (loss: 1.5907225608825684): 1%|▍ | 6/775 [00:24<52:04, 4.06s/it] | |
Training Epoch: 1/1, step 5/775 completed (loss: 1.5579142570495605): 1%|▍ | 6/775 [00:23<49:18, 3.85s/it] | |
Training Epoch: 1/1, step 5/775 completed (loss: 1.6977660655975342): 1%|▍ | 6/775 [00:20<43:35, 3.40s/it] | |
Training Epoch: 1/1, step 5/775 completed (loss: 1.799242615699768): 1%|▍ | 6/775 [00:21<46:48, 3.65s/it] | |
Max CUDA memory allocated was 14 GB | |
Max CUDA memory reserved was 17 GB | |
Peak active CUDA memory was 16 GB | |
Cuda Malloc retires : 0 | |
CPU Total Peak Memory consumed during the train (max): 2 GB | |
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:54, 1.45it/s] | |
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:57, 1.37it/s] | |
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:55, 1.42it/s] | |
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:55, 1.42it/s] | |
eval_ppl=tensor(1.1320, device='cuda:0') eval_epoch_loss=tensor(0.1240, device='cuda:0') | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
===================================================== | |
Saving model to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/dist-root/fsdp-ft-checkpoints-meta-llama/Llama-2-7b-chat-hf | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
2023-10-30 04:38:33,299 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {} | |
Sharded state checkpoint saved to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/dist-root/fsdp-ft-checkpoints-meta-llama/Llama-2-7b-chat-hf | |
Checkpoint Time = 48.2229 | |
best eval loss on epoch 1 is 0.12400449812412262 | |
Epoch 1: train_perplexity=1.0171, train_epoch_loss=0.0169, epoch time 24.710794023936614s | |
training params are saved in /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/dist-root/fsdp-ft-checkpoints-meta-llama/Llama-2-7b-chat-hf/train_params.yaml | |
Key: avg_train_prep, Value: 1.0170575380325317 | |
Key: avg_train_loss, Value: 0.016913706436753273 | |
Key: avg_eval_prep, Value: 1.1320209503173828 | |
Key: avg_eval_loss, Value: 0.12400449812412262 | |
Key: avg_epoch_time, Value: 24.710794023936614 | |
Key: avg_checkpoint_time, Value: 48.24225016287528 | |
#### with overlap ###### | |
(llama-recipe-package) hamidnazeri@a100-st-p4d24xlarge-27:~/llama-package/llama-recipes$ torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name meta-llama/Llama-2-7b-chat-hf --pure_bf16 --dist_checkpoint_root_folder dist-root --dist_checkpoint_folder fsdp-ft-checkpoints --optimizer_overlap --optimizer anyprecision | |
[2023-10-30 04:22:17,520] torch.distributed.run: [WARNING] | |
[2023-10-30 04:22:17,520] torch.distributed.run: [WARNING] ***************************************** | |
[2023-10-30 04:22:17,520] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
[2023-10-30 04:22:17,520] torch.distributed.run: [WARNING] ***************************************** | |
Warning: unknown parameter pure_bf16 | |
Warning: unknown parameter optimizer_overlap | |
Warning: unknown parameter optimizer | |
Warning: unknown parameter pure_bf16 | |
Warning: unknown parameter optimizer_overlap | |
Warning: unknown parameter optimizer | |
Warning: unknown parameter pure_bf16 | |
Warning: unknown parameter pure_bf16 | |
Warning: unknown parameter optimizer_overlap | |
Warning: unknown parameter optimizer | |
Warning: unknown parameter optimizer_overlap | |
Warning: unknown parameter optimizer | |
Clearing GPU cache for all ranks | |
--> Running with torch dist debug set to detail | |
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.07s/it] | |
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.21s/it] | |
--> Model meta-llama/Llama-2-7b-chat-hf | |
--> meta-llama/Llama-2-7b-chat-hf has 6738.415616 Million params | |
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.50s/it] | |
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.60s/it] | |
bFloat16 enabled for mixed precision - using bfSixteen policy | |
--> applying fsdp activation checkpointing... | |
--> applying fsdp activation checkpointing... | |
--> applying fsdp activation checkpointing... | |
--> Training Set Length = 14732 | |
--> applying fsdp activation checkpointing... | |
--> Validation Set Length = 818 | |
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:13<00:00, 1057.85it/s] | |
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:00<00:00, 1028.55it/s] | |
setting up optimizer overlap | |
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:14<00:00, 1007.20it/s] | |
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:14<00:00, 1018.21it/s] | |
Preprocessing dataset: 99%|█████████████████████████████████████████████████████████████████████████████████████████████████ | 14584/14732 [00:14<00:00, 1030.55it/s]/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:14<00:00, 1033.62it/s] | |
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 724.64it/s] | |
setting up optimizer overlap | |
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 605.19it/s] | |
setting up optimizer overlap | |
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 567.60it/s] | |
setting up optimizer overlap | |
/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch: 1: 0%| | 0/775 [00:00<?, ?it/s]/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch: 1: 0%| | 0/775 [00:00<?, ?it/s]/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch: 1/1, step 5/775 completed (loss: 1.799242615699768): 1%|▍ | 6/775 [00:22<49:00, 3.82s/it] | |
Training Epoch: 1/1, step 5/775 completed (loss: 1.5579142570495605): 1%|▍ | 6/775 [00:25<55:17, 4.31s/it] | |
Training Epoch: 1/1, step 5/775 completed (loss: 1.5907225608825684): 1%|▍ | 6/775 [00:21<46:57, 3.66s/it] | |
Training Epoch: 1/1, step 5/775 completed (loss: 1.6977660655975342): 1%|▍ | 6/775 [00:24<51:25, 4.01s/it] | |
Max CUDA memory allocated was 11 GB | |
Max CUDA memory reserved was 14 GB | |
Peak active CUDA memory was 13 GB | |
Cuda Malloc retires : 0 | |
CPU Total Peak Memory consumed during the train (max): 2 GB | |
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:55, 1.44it/s] | |
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:57, 1.38it/s] | |
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:55, 1.41it/s] | |
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:56, 1.40it/s] | |
eval_ppl=tensor(1.1320, device='cuda:0') eval_epoch_loss=tensor(0.1240, device='cuda:0') | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
Saving model to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/dist-root/fsdp-ft-checkpoints-meta-llama/Llama-2-7b-chat-hf | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
2023-10-30 04:27:20,478 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {} | |
Sharded state checkpoint saved to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/dist-root/fsdp-ft-checkpoints-meta-llama/Llama-2-7b-chat-hf | |
Checkpoint Time = 45.2486 | |
best eval loss on epoch 1 is 0.12400449812412262 | |
Epoch 1: train_perplexity=1.0171, train_epoch_loss=0.0169, epoch time 27.255307742860168s | |
training params are saved in /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/dist-root/fsdp-ft-checkpoints-meta-llama/Llama-2-7b-chat-hf/train_params.yaml | |
Key: avg_train_prep, Value: 1.0170575380325317 | |
Key: avg_train_loss, Value: 0.016913706436753273 | |
Key: avg_eval_prep, Value: 1.1320209503173828 | |
Key: avg_eval_loss, Value: 0.12400449812412262 | |
Key: avg_epoch_time, Value: 27.255307742860168 | |
Key: avg_checkpoint_time, Value: 45.27225174685009 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment