HamidShojanazeri/gist:3d1147012e9db130dd7cebf75d3caa64 Secret

## gistfile1.txt
(llama-package) hamidnazeri@a100-st-p4d24xlarge-2:~/llama-package/llama-recipes$ torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py  --enable_fsdp  --model_name meta-llama/Llama-2-7b-chat-hf  --num_epochs 2 --pure_bf16 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder with_new_config --optimizer_overlap True
[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING]
[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] *****************************************
[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] *****************************************
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.03s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.11s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.90s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:26<00:00, 13.05s/it]
--> Model meta-llama/Llama-2-7b-chat-hf

--> meta-llama/Llama-2-7b-chat-hf has 6738.415616 Million params

bFloat16 enabled for mixed precision - using bfSixteen policy
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
setting up optimizer overlap
setting up optimizer overlap
/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                                                      | 0/97 [00:00<?, ?it/s]/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                                                      | 0/97 [00:00<?, ?it/s]--> Training Set Length = 1555
setting up optimizer overlap
/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                                                      | 0/97 [00:00<?, ?it/s]--> Validation Set Length = 84
setting up optimizer overlap
/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1/2, step 96/97 completed (loss: 1.6897648572921753): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:53<00:00,  4.26s/it]
Training Epoch: 1/2, step 96/97 completed (loss: 1.619846224784851): 100%|███████████████████████████████████████████████████████████████████████████| 97/97 [06:56<00:00,  4.30s/it]
Training Epoch: 1/2, step 96/97 completed (loss: 1.6953213214874268): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:54<00:00,  4.27s/it]
Training Epoch: 1/2, step 96/97 completed (loss: 1.6714438199996948): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:56<00:00,  4.29s/it]
Max CUDA memory allocated was 21 GB
Max CUDA memory reserved was 28 GB
Peak active CUDA memory was 23 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 1 GB
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00,  1.07s/it]
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00,  1.08s/it]
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00,  1.07s/it]
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00,  1.07s/it]
 eval_ppl=tensor(5.3825, device='cuda:0') eval_epoch_loss=tensor(1.6832, device='cuda:0')
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
Saving model to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
2023-09-15 16:59:45,444 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
Sharded state checkpoint saved to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
Checkpoint Time = 41.2271

best eval loss on epoch 1 is 1.683154582977295
Epoch 1: train_perplexity=6.5870, train_epoch_loss=1.8851, epoch time 414.2397181370761s
Training Epoch: 2/2, step 96/97 completed (loss: 1.4305708408355713): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:49<00:00,  4.22s/it]
Training Epoch: 2/2, step 96/97 completed (loss: 1.425536036491394): 100%|███████████████████████████████████████████████████████████████████████████| 97/97 [06:49<00:00,  4.22s/it]
Training Epoch: 2/2, step 96/97 completed (loss: 1.3905612230300903): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:49<00:00,  4.22s/it]
Training Epoch: 2/2, step 96/97 completed (loss: 1.4353440999984741): 100%|██████████████████████████████████████████████████████████████████████████| 97/97 [06:49<00:00,  4.22s/it]
Max CUDA memory allocated was 21 GB
Max CUDA memory reserved was 27 GB
Peak active CUDA memory was 23 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 2 GB
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00,  1.08s/it]
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00,  1.08s/it]
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00,  1.08s/it]
evaluating Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00,  1.08s/it]
 eval_ppl=tensor(5.3765, device='cuda:0') eval_epoch_loss=tensor(1.6820, device='cuda:0')
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT Saving the FSDP model checkpoints using SHARDED_STATE_DICT

=====================================================
=====================================================
Saving model to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
2023-09-15 17:07:38,388 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
Sharded state checkpoint saved to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
Checkpoint Time = 35.9966

best eval loss on epoch 2 is 1.6820366382598877
Epoch 2: train_perplexity=4.4708, train_epoch_loss=1.4976, epoch time 410.0324876109371s
training params are saved in /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf/train_params.yaml
Key: avg_train_prep, Value: 5.528891563415527
Key: avg_train_loss, Value: 1.6913299560546875
Key: avg_eval_prep, Value: 5.379501819610596
Key: avg_eval_loss, Value: 1.6825956106185913
Key: avg_epoch_time, Value: 412.1361028740066
Key: avg_checkpoint_time, Value: 38.64281301997835
	(llama-package) hamidnazeri@a100-st-p4d24xlarge-2:~/llama-package/llama-recipes$ torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name meta-llama/Llama-2-7b-chat-hf --num_epochs 2 --pure_bf16 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder with_new_config --optimizer_overlap True
	[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING]
	[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] *****************************************
	[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
	[2023-09-15 16:48:13,916] torch.distributed.run: [WARNING] *****************************************
	Clearing GPU cache for all ranks
	--> Running with torch dist debug set to detail
	Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:24<00:00, 12.03s/it]
	Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:24<00:00, 12.11s/it]
	Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:25<00:00, 12.90s/it]
	Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:26<00:00, 13.05s/it]
	--> Model meta-llama/Llama-2-7b-chat-hf

	--> meta-llama/Llama-2-7b-chat-hf has 6738.415616 Million params

	bFloat16 enabled for mixed precision - using bfSixteen policy
	--> applying fsdp activation checkpointing...
	--> applying fsdp activation checkpointing...
	--> applying fsdp activation checkpointing...
	--> applying fsdp activation checkpointing...
	setting up optimizer overlap
	setting up optimizer overlap
	/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
	warnings.warn(
	Training Epoch: 1: 0%\| \| 0/97 [00:00<?, ?it/s]/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
	warnings.warn(
	Training Epoch: 1: 0%\| \| 0/97 [00:00<?, ?it/s]--> Training Set Length = 1555
	setting up optimizer overlap
	/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
	warnings.warn(
	Training Epoch: 1: 0%\| \| 0/97 [00:00<?, ?it/s]--> Validation Set Length = 84
	setting up optimizer overlap
	/data/home/hamidnazeri/miniconda/envs/llama-package/lib/python3.10/site-packages/torch/cuda/memory.py:329: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
	warnings.warn(
	Training Epoch: 1/2, step 96/97 completed (loss: 1.6897648572921753): 100%\|██████████████████████████████████████████████████████████████████████████\| 97/97 [06:53<00:00, 4.26s/it]
	Training Epoch: 1/2, step 96/97 completed (loss: 1.619846224784851): 100%\|███████████████████████████████████████████████████████████████████████████\| 97/97 [06:56<00:00, 4.30s/it]
	Training Epoch: 1/2, step 96/97 completed (loss: 1.6953213214874268): 100%\|██████████████████████████████████████████████████████████████████████████\| 97/97 [06:54<00:00, 4.27s/it]
	Training Epoch: 1/2, step 96/97 completed (loss: 1.6714438199996948): 100%\|██████████████████████████████████████████████████████████████████████████\| 97/97 [06:56<00:00, 4.29s/it]
	Max CUDA memory allocated was 21 GB
	Max CUDA memory reserved was 28 GB
	Peak active CUDA memory was 23 GB
	Cuda Malloc retires : 0
	CPU Total Peak Memory consumed during the train (max): 1 GB
	evaluating Epoch: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 21/21 [00:22<00:00, 1.07s/it]
	evaluating Epoch: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 21/21 [00:22<00:00, 1.08s/it]
	evaluating Epoch: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 21/21 [00:22<00:00, 1.07s/it]
	evaluating Epoch: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 21/21 [00:22<00:00, 1.07s/it]
	eval_ppl=tensor(5.3825, device='cuda:0') eval_epoch_loss=tensor(1.6832, device='cuda:0')
	Saving the FSDP model checkpoints using SHARDED_STATE_DICT
	=====================================================
	Saving the FSDP model checkpoints using SHARDED_STATE_DICT
	=====================================================
	Saving model to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
	Saving the FSDP model checkpoints using SHARDED_STATE_DICT
	=====================================================
	Saving the FSDP model checkpoints using SHARDED_STATE_DICT
	=====================================================
	2023-09-15 16:59:45,444 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
	Sharded state checkpoint saved to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
	Checkpoint Time = 41.2271

	best eval loss on epoch 1 is 1.683154582977295
	Epoch 1: train_perplexity=6.5870, train_epoch_loss=1.8851, epoch time 414.2397181370761s
	Training Epoch: 2/2, step 96/97 completed (loss: 1.4305708408355713): 100%\|██████████████████████████████████████████████████████████████████████████\| 97/97 [06:49<00:00, 4.22s/it]
	Training Epoch: 2/2, step 96/97 completed (loss: 1.425536036491394): 100%\|███████████████████████████████████████████████████████████████████████████\| 97/97 [06:49<00:00, 4.22s/it]
	Training Epoch: 2/2, step 96/97 completed (loss: 1.3905612230300903): 100%\|██████████████████████████████████████████████████████████████████████████\| 97/97 [06:49<00:00, 4.22s/it]
	Training Epoch: 2/2, step 96/97 completed (loss: 1.4353440999984741): 100%\|██████████████████████████████████████████████████████████████████████████\| 97/97 [06:49<00:00, 4.22s/it]
	Max CUDA memory allocated was 21 GB
	Max CUDA memory reserved was 27 GB
	Peak active CUDA memory was 23 GB
	Cuda Malloc retires : 0
	CPU Total Peak Memory consumed during the train (max): 2 GB
	evaluating Epoch: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 21/21 [00:22<00:00, 1.08s/it]
	evaluating Epoch: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 21/21 [00:22<00:00, 1.08s/it]
	evaluating Epoch: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 21/21 [00:22<00:00, 1.08s/it]
	evaluating Epoch: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 21/21 [00:22<00:00, 1.08s/it]
	eval_ppl=tensor(5.3765, device='cuda:0') eval_epoch_loss=tensor(1.6820, device='cuda:0')
	Saving the FSDP model checkpoints using SHARDED_STATE_DICT Saving the FSDP model checkpoints using SHARDED_STATE_DICT

	=====================================================
	=====================================================
	Saving model to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
	Saving the FSDP model checkpoints using SHARDED_STATE_DICT
	=====================================================
	Saving the FSDP model checkpoints using SHARDED_STATE_DICT
	=====================================================
	2023-09-15 17:07:38,388 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
	Sharded state checkpoint saved to /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
	Checkpoint Time = 35.9966

	best eval loss on epoch 2 is 1.6820366382598877
	Epoch 2: train_perplexity=4.4708, train_epoch_loss=1.4976, epoch time 410.0324876109371s
	training params are saved in /opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/llama-package/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf/train_params.yaml
	Key: avg_train_prep, Value: 5.528891563415527
	Key: avg_train_loss, Value: 1.6913299560546875
	Key: avg_eval_prep, Value: 5.379501819610596
	Key: avg_eval_loss, Value: 1.6825956106185913
	Key: avg_epoch_time, Value: 412.1361028740066
	Key: avg_checkpoint_time, Value: 38.64281301997835