Skip to content

Instantly share code, notes, and snippets.

@HamidShojanazeri
Created October 30, 2023 16:31
#### without overlap ######
(llama-recipe-package) hamidnazeri@a100-st-p4d24xlarge-27:~/llama-package/llama-recipes$ torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name meta-llama/Llama-2-7b-chat-hf --pure_bf16 --dist_checkpoint_root_folder dist-root --dist_checkpoint_folder fsdp-ft-checkpoints --optimizer anyprecisio
n
[2023-10-30 04:33:37,161] torch.distributed.run: [WARNING]
[2023-10-30 04:33:37,161] torch.distributed.run: [WARNING] *****************************************
[2023-10-30 04:33:37,161] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-10-30 04:33:37,161] torch.distributed.run: [WARNING] *****************************************
Warning: unknown parameter pure_bf16Warning: unknown parameter pure_bf16
Warning: unknown parameter optimizer
Warning: unknown parameter pure_bf16
Warning: unknown parameter optimizer
Warning: unknown parameter pure_bf16
Warning: unknown parameter optimizer
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
Warning: unknown parameter optimizer
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.32s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.75s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.89s/it]
--> Model meta-llama/Llama-2-7b-chat-hf
--> meta-llama/Llama-2-7b-chat-hf has 6738.415616 Million params
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.75s/it]
bFloat16 enabled for mixed precision - using bfSixteen policy
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> Training Set Length = 14732
Preprocessing dataset: 7%|███████▏ | 1064/14732 [00:00<00:06, 2086.78it/s]--> Validation Set Length = 818
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:13<00:00, 1085.71it/s]
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:00<00:00, 1046.10it/s]
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:13<00:00, 1055.47it/s]
Preprocessing dataset: 70%|███████████████████████████████████████████████████████████████████████▎ | 572/818 [00:00<00:00, 1077.76it/s]/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:14<00:00, 1045.99it/s]
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 783.08it/s]
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:14<00:00, 1013.72it/s]
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 780.62it/s]
Preprocessing dataset: 80%|██████████████████████████████████████████████████████████████████████████████████ | 658/818 [00:00<00:00, 1050.31it/s]/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 693.65it/s]
/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1: 0%| | 0/775 [00:00<?, ?it/s]/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1/1, step 5/775 completed (loss: 1.5907225608825684): 1%|▍ | 6/775 [00:24<52:04, 4.06s/it]
Training Epoch: 1/1, step 5/775 completed (loss: 1.5579142570495605): 1%|▍ | 6/775 [00:23<49:18, 3.85s/it]
Training Epoch: 1/1, step 5/775 completed (loss: 1.6977660655975342): 1%|▍ | 6/775 [00:20<43:35, 3.40s/it]
Training Epoch: 1/1, step 5/775 completed (loss: 1.799242615699768): 1%|▍ | 6/775 [00:21<46:48, 3.65s/it]
Max CUDA memory allocated was 14 GB
Max CUDA memory reserved was 17 GB
Peak active CUDA memory was 16 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 2 GB
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:54, 1.45it/s]
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:57, 1.37it/s]
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:55, 1.42it/s]
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:55, 1.42it/s]
eval_ppl=tensor(1.1320, device='cuda:0') eval_epoch_loss=tensor(0.1240, device='cuda:0')
Saving the FSDP model checkpoints using SHARDED_STATE_DICT Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
=====================================================
Saving model to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/dist-root/fsdp-ft-checkpoints-meta-llama/Llama-2-7b-chat-hf
Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
2023-10-30 04:38:33,299 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
Sharded state checkpoint saved to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/dist-root/fsdp-ft-checkpoints-meta-llama/Llama-2-7b-chat-hf
Checkpoint Time = 48.2229
best eval loss on epoch 1 is 0.12400449812412262
Epoch 1: train_perplexity=1.0171, train_epoch_loss=0.0169, epoch time 24.710794023936614s
training params are saved in /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/dist-root/fsdp-ft-checkpoints-meta-llama/Llama-2-7b-chat-hf/train_params.yaml
Key: avg_train_prep, Value: 1.0170575380325317
Key: avg_train_loss, Value: 0.016913706436753273
Key: avg_eval_prep, Value: 1.1320209503173828
Key: avg_eval_loss, Value: 0.12400449812412262
Key: avg_epoch_time, Value: 24.710794023936614
Key: avg_checkpoint_time, Value: 48.24225016287528
#### with overlap ######
(llama-recipe-package) hamidnazeri@a100-st-p4d24xlarge-27:~/llama-package/llama-recipes$ torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name meta-llama/Llama-2-7b-chat-hf --pure_bf16 --dist_checkpoint_root_folder dist-root --dist_checkpoint_folder fsdp-ft-checkpoints --optimizer_overlap --optimizer anyprecision
[2023-10-30 04:22:17,520] torch.distributed.run: [WARNING]
[2023-10-30 04:22:17,520] torch.distributed.run: [WARNING] *****************************************
[2023-10-30 04:22:17,520] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-10-30 04:22:17,520] torch.distributed.run: [WARNING] *****************************************
Warning: unknown parameter pure_bf16
Warning: unknown parameter optimizer_overlap
Warning: unknown parameter optimizer
Warning: unknown parameter pure_bf16
Warning: unknown parameter optimizer_overlap
Warning: unknown parameter optimizer
Warning: unknown parameter pure_bf16
Warning: unknown parameter pure_bf16
Warning: unknown parameter optimizer_overlap
Warning: unknown parameter optimizer
Warning: unknown parameter optimizer_overlap
Warning: unknown parameter optimizer
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.07s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.21s/it]
--> Model meta-llama/Llama-2-7b-chat-hf
--> meta-llama/Llama-2-7b-chat-hf has 6738.415616 Million params
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.50s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.60s/it]
bFloat16 enabled for mixed precision - using bfSixteen policy
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> Training Set Length = 14732
--> applying fsdp activation checkpointing...
--> Validation Set Length = 818
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:13<00:00, 1057.85it/s]
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:00<00:00, 1028.55it/s]
setting up optimizer overlap
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:14<00:00, 1007.20it/s]
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:14<00:00, 1018.21it/s]
Preprocessing dataset: 99%|█████████████████████████████████████████████████████████████████████████████████████████████████ | 14584/14732 [00:14<00:00, 1030.55it/s]/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:14<00:00, 1033.62it/s]
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 724.64it/s]
setting up optimizer overlap
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 605.19it/s]
setting up optimizer overlap
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 567.60it/s]
setting up optimizer overlap
/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1: 0%| | 0/775 [00:00<?, ?it/s]/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1: 0%| | 0/775 [00:00<?, ?it/s]/data/home/hamidnazeri/miniconda/envs/llama-recipe-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1/1, step 5/775 completed (loss: 1.799242615699768): 1%|▍ | 6/775 [00:22<49:00, 3.82s/it]
Training Epoch: 1/1, step 5/775 completed (loss: 1.5579142570495605): 1%|▍ | 6/775 [00:25<55:17, 4.31s/it]
Training Epoch: 1/1, step 5/775 completed (loss: 1.5907225608825684): 1%|▍ | 6/775 [00:21<46:57, 3.66s/it]
Training Epoch: 1/1, step 5/775 completed (loss: 1.6977660655975342): 1%|▍ | 6/775 [00:24<51:25, 4.01s/it]
Max CUDA memory allocated was 11 GB
Max CUDA memory reserved was 14 GB
Peak active CUDA memory was 13 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 2 GB
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:55, 1.44it/s]
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:57, 1.38it/s]
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:55, 1.41it/s]
evaluating Epoch: 7%|███████▉ | 6/85 [00:04<00:56, 1.40it/s]
eval_ppl=tensor(1.1320, device='cuda:0') eval_epoch_loss=tensor(0.1240, device='cuda:0')
Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
Saving model to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/dist-root/fsdp-ft-checkpoints-meta-llama/Llama-2-7b-chat-hf
Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
2023-10-30 04:27:20,478 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
Sharded state checkpoint saved to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/dist-root/fsdp-ft-checkpoints-meta-llama/Llama-2-7b-chat-hf
Checkpoint Time = 45.2486
best eval loss on epoch 1 is 0.12400449812412262
Epoch 1: train_perplexity=1.0171, train_epoch_loss=0.0169, epoch time 27.255307742860168s
training params are saved in /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/dist-root/fsdp-ft-checkpoints-meta-llama/Llama-2-7b-chat-hf/train_params.yaml
Key: avg_train_prep, Value: 1.0170575380325317
Key: avg_train_loss, Value: 0.016913706436753273
Key: avg_eval_prep, Value: 1.1320209503173828
Key: avg_eval_loss, Value: 0.12400449812412262
Key: avg_epoch_time, Value: 27.255307742860168
Key: avg_checkpoint_time, Value: 45.27225174685009
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment