HamidShojanazeri/gist:587755cebc0aa0e67edf0995a70f8127 Secret

## gistfile1.txt
(pippy_tp) hamidnazeri@a100-st-p4d24xlarge-21:~/llama-recipes$ torchrun --nnodes 1 --nproc_per_node 4  llama_finetuning.py --enable_fsdp  --model_name meta-llama/Llama-2-7b-chat-hf  --num_epochs 2 --pure_bf16 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder with_new_config
[2023-08-04 20:09:14,458] torch.distributed.run: [WARNING]
[2023-08-04 20:09:14,458] torch.distributed.run: [WARNING] *****************************************
[2023-08-04 20:09:14,458] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-08-04 20:09:14,458] torch.distributed.run: [WARNING] *****************************************

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

===================================BUG REPORT===================================
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
bin /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /data/home/hamidnazeri/miniconda/envs/pippy_tp did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/lib:/usr/lib did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('1'), PosixPath('/usr/share/modules/$MODULE_VERSION/modulefiles')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-git-3b65b17622.sock')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('"/data/home/hamidnazeri/.vscode-server/bin/695af097c7bd098fbf017ce3ac85e09bbc5dda06/extensions/git/dist/git-editor.sh"')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/modulefiles')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-ipc-108e87c6-bbba-41ad-9306-fece523e8ba5.sock')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_gaa2pj7b/none_tw70ocve/attempt_0/1/error.json')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /data/home/hamidnazeri/miniconda/envs/pippy_tp did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/lib:/usr/lib did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/share/modules/$MODULE_VERSION/modulefiles'), PosixPath('1')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-git-3b65b17622.sock')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('"/data/home/hamidnazeri/.vscode-server/bin/695af097c7bd098fbf017ce3ac85e09bbc5dda06/extensions/git/dist/git-editor.sh"')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/modulefiles')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-ipc-108e87c6-bbba-41ad-9306-fece523e8ba5.sock')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_gaa2pj7b/none_tw70ocve/attempt_0/3/error.json')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...bin
 /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /data/home/hamidnazeri/miniconda/envs/pippy_tp did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/lib:/usr/lib did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('1'), PosixPath('/usr/share/modules/$MODULE_VERSION/modulefiles')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-git-3b65b17622.sock')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('"/data/home/hamidnazeri/.vscode-server/bin/695af097c7bd098fbf017ce3ac85e09bbc5dda06/extensions/git/dist/git-editor.sh"')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/modulefiles')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-ipc-108e87c6-bbba-41ad-9306-fece523e8ba5.sock')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_gaa2pj7b/none_tw70ocve/attempt_0/2/error.json')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
bin /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /data/home/hamidnazeri/miniconda/envs/pippy_tp did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/lib:/usr/lib did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('1'), PosixPath('/usr/share/modules/$MODULE_VERSION/modulefiles')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-git-3b65b17622.sock')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('"/data/home/hamidnazeri/.vscode-server/bin/695af097c7bd098fbf017ce3ac85e09bbc5dda06/extensions/git/dist/git-editor.sh"')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/modulefiles')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-ipc-108e87c6-bbba-41ad-9306-fece523e8ba5.sock')}
  warn(msg)
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_gaa2pj7b/none_tw70ocve/attempt_0/0/error.json')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
--> Running with torch dist debug set to detail
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.47s/it]
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.36s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.40s/it]
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
--> Model meta-llama/Llama-2-7b-chat-hf

--> meta-llama/Llama-2-7b-chat-hf has 6738.415616 Million params

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:13<00:00,  6.54s/it]
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
bFloat16 enabled for mixed precision - using bfSixteen policy
--> applying fsdp activation checkpointing...
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/torch/cuda/memory.py:306: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch0:   0%|                                                                                                                                   | 0/97 [00:00<?, ?it/s]--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> Training Set Length = 1555
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/torch/cuda/memory.py:306: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch0:   0%|                                                                                                                                   | 0/97 [00:00<?, ?it/s]--> Validation Set Length = 84
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/torch/cuda/memory.py:306: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch0:   0%|                                                                                                                                   | 0/97 [00:00<?, ?it/s]--> applying fsdp activation checkpointing...
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/torch/cuda/memory.py:306: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch0:   1%|█▎                                                                                                                         | 1/97 [00:24<39:29, 24.68s/it]
 step 0 is completed and loss is 2.4715213775634766
Training Epoch0:   2%|██▌                                                                                                                        | 2/97 [00:32<21:41, 13.70s/it]
 step 1 is completed and loss is 5.230667591094971
Training Epoch0:   3%|███▊                                                                                                                       | 3/97 [00:30<12:01,  7.68s/it]
 step 2 is completed and loss is 5.525859832763672
Training Epoch0:   4%|█████                                                                                                                      | 4/97 [00:33<08:54,  5.74s/it]
 step 3 is completed and loss is 3.961235761642456
Training Epoch0:   5%|██████▎                                                                                                                    | 5/97 [00:40<07:46,  5.07s/it]
 step 4 is completed and loss is 2.868651866912842
Training Epoch0:   6%|███████▌                                                                                                                   | 6/97 [00:38<06:09,  4.06s/it]
 step 5 is completed and loss is 2.596735954284668
Training Epoch0:   7%|████████▉                                                                                                                  | 7/97 [00:32<04:58,  3.32s/it]
 step 6 is completed and loss is 2.3276379108428955
Training Epoch0:   8%|██████████▏                                                                                                                | 8/97 [00:48<05:10,  3.49s/it]
 step 7 is completed and loss is 2.192718982696533
Training Epoch0:   9%|███████████▍                                                                                                               | 9/97 [00:51<04:48,  3.28s/it]
 step 8 is completed and loss is 2.1461222171783447
Training Epoch0:  10%|████████████▌                                                                                                             | 10/97 [00:49<04:27,  3.08s/it]
 step 9 is completed and loss is 2.0069150924682617
Training Epoch0:  11%|█████████████▊                                                                                                            | 11/97 [00:52<04:18,  3.00s/it]
 step 10 is completed and loss is 1.9500118494033813
Training Epoch0:  11%|█████████████▊                                                                                                            | 11/97 [00:44<05:47,  4.04s/it]
Training Epoch0:  11%|█████████████▊                                                                                                            | 11/97 [00:52<06:53,  4.81s/it]
Training Epoch0:  11%|█████████████▊                                                                                                            | 11/97 [00:57<07:29,  5.22s/it]
Training Epoch0:  11%|█████████████▊                                                                                                            | 11/97 [00:52<06:49,  4.76s/it]
Max CUDA memory allocated was 19 GB
Max CUDA memory reserved was 27 GB
Peak active CUDA memory was 20 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 5 GB
evaluating Epoch:  29%|██████████████████████████████████▊                                                                                       | 6/21 [00:03<00:09,  1.59it/s]
evaluating Epoch:  29%|██████████████████████████████████▊                                                                                       | 6/21 [00:03<00:09,  1.59it/s]
evaluating Epoch:  29%|██████████████████████████████████▊                                                                                       | 6/21 [00:03<00:09,  1.54it/s]
evaluating Epoch:  29%|██████████████████████████████████▊                                                                                       | 6/21 [00:03<00:09,  1.56it/s]
 eval_ppl=tensor(1.7574, device='cuda:0') eval_epoch_loss=tensor(0.5639, device='cuda:0')
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT Saving the FSDP model checkpoints using SHARDED_STATE_DICT

==========================================================================================================

 Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
Saving model to /data/home/hamidnazeri/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
2023-08-04 20:12:31,107 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
Sharded state checkpoint saved to /data/home/hamidnazeri/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
Checkpoint Time = 46.2713

best eval loss on epoch 0 is 0.5638577938079834
Epoch 1: train_perplexity=1.4111, train_epoch_loss=0.3443, epcoh time 53.10582085000351s
Training Epoch1:   1%|█▎                                                                                                                         | 1/97 [00:03<05:08,  3.22s/it]
 step 0 is completed and loss is 1.5544689893722534
Training Epoch1:   2%|██▌                                                                                                                        | 2/97 [00:05<04:40,  2.95s/it]
 step 1 is completed and loss is 1.915114164352417
Training Epoch1:   3%|███▊                                                                                                                       | 3/97 [00:08<04:28,  2.86s/it]
 step 2 is completed and loss is 1.932603359222412
Training Epoch1:   4%|█████                                                                                                                      | 4/97 [00:11<04:24,  2.85s/it]
 step 3 is completed and loss is 1.9791666269302368
Training Epoch1:   5%|██████▎                                                                                                                    | 5/97 [00:14<04:19,  2.82s/it]
 step 4 is completed and loss is 1.7437982559204102
Training Epoch1:   6%|███████▌                                                                                                                   | 6/97 [00:17<04:15,  2.80s/it]
 step 5 is completed and loss is 1.7695246934890747
Training Epoch1:   7%|████████▉                                                                                                                  | 7/97 [00:19<04:11,  2.80s/it]
 step 6 is completed and loss is 1.7550206184387207
Training Epoch1:   8%|██████████▏                                                                                                                | 8/97 [00:22<04:07,  2.78s/it]
 step 7 is completed and loss is 1.746717095375061
Training Epoch1:   9%|███████████▍                                                                                                               | 9/97 [00:25<04:04,  2.78s/it]
 step 8 is completed and loss is 1.7609888315200806
Training Epoch1:  10%|████████████▌                                                                                                             | 10/97 [00:28<04:02,  2.79s/it]
 step 9 is completed and loss is 1.6778242588043213
Training Epoch1:  11%|█████████████▊                                                                                                            | 11/97 [00:31<04:00,  2.80s/it]
 step 10 is completed and loss is 1.6500493288040161
Training Epoch1:  11%|█████████████▊                                                                                                            | 11/97 [00:31<04:03,  2.84s/it]
Training Epoch1:  11%|█████████████▊                                                                                                            | 11/97 [00:31<04:04,  2.84s/it]
Training Epoch1:  11%|█████████████▊                                                                                                            | 11/97 [00:31<04:04,  2.84s/it]
Training Epoch1:  11%|█████████████▊                                                                                                            | 11/97 [00:31<04:04,  2.85s/it]
Max CUDA memory allocated was 19 GB
Max CUDA memory reserved was 27 GB
Peak active CUDA memory was 20 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 6 GB
evaluating Epoch:  29%|██████████████████████████████████▊                                                                                       | 6/21 [00:03<00:09,  1.57it/s]
evaluating Epoch:  29%|██████████████████████████████████▊                                                                                       | 6/21 [00:03<00:09,  1.52it/s]
evaluating Epoch:  29%|██████████████████████████████████▊                                                                                       | 6/21 [00:03<00:09,  1.54it/s]
evaluating Epoch:  29%|██████████████████████████████████▊                                                                                       | 6/21 [00:03<00:09,  1.50it/s]
 eval_ppl=tensor(1.6894, device='cuda:0') eval_epoch_loss=tensor(0.5244, device='cuda:0')
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT=====================================================

=====================================================
 Saving the FSDP model checkpoints using SHARDED_STATE_DICT
=====================================================
Saving model to /data/home/hamidnazeri/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
2023-08-04 20:13:47,365 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
Sharded state checkpoint saved to /data/home/hamidnazeri/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf
Checkpoint Time = 32.6997

best eval loss on epoch 1 is 0.5243545174598694
Epoch 2: train_perplexity=1.2238, train_epoch_loss=0.2020, epcoh time 31.780189700890332s
training params are saved in /data/home/hamidnazeri/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf/train_params.yaml
Key: avg_train_prep, Value: 1.317427396774292
Key: avg_train_loss, Value: 0.27314862608909607
Key: avg_eval_prep, Value: 1.7234036922454834
Key: avg_eval_loss, Value: 0.544106125831604
Key: avg_epoch_time, Value: 42.44300527544692
Key: avg_checkpoint_time, Value: 39.49147454963531