Created
August 4, 2023 20:18
-
-
Save HamidShojanazeri/587755cebc0aa0e67edf0995a70f8127 to your computer and use it in GitHub Desktop.
7b-bt.log
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(pippy_tp) hamidnazeri@a100-st-p4d24xlarge-21:~/llama-recipes$ torchrun --nnodes 1 --nproc_per_node 4 llama_finetuning.py --enable_fsdp --model_name meta-llama/Llama-2-7b-chat-hf --num_epochs 2 --pure_bf16 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder with_new_config | |
[2023-08-04 20:09:14,458] torch.distributed.run: [WARNING] | |
[2023-08-04 20:09:14,458] torch.distributed.run: [WARNING] ***************************************** | |
[2023-08-04 20:09:14,458] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
[2023-08-04 20:09:14,458] torch.distributed.run: [WARNING] ***************************************** | |
===================================BUG REPORT=================================== | |
Welcome to bitsandbytes. For bug reports, please run | |
python -m bitsandbytes | |
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues | |
================================================================================ | |
===================================BUG REPORT=================================== | |
Welcome to bitsandbytes. For bug reports, please run | |
python -m bitsandbytes | |
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues | |
================================================================================ | |
===================================BUG REPORT=================================== | |
===================================BUG REPORT=================================== | |
Welcome to bitsandbytes. For bug reports, please run | |
python -m bitsandbytes | |
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues | |
================================================================================ | |
Welcome to bitsandbytes. For bug reports, please run | |
python -m bitsandbytes | |
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues | |
================================================================================ | |
bin /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so | |
bin /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /data/home/hamidnazeri/miniconda/envs/pippy_tp did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/lib:/usr/lib did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('1'), PosixPath('/usr/share/modules/$MODULE_VERSION/modulefiles')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-git-3b65b17622.sock')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('"/data/home/hamidnazeri/.vscode-server/bin/695af097c7bd098fbf017ce3ac85e09bbc5dda06/extensions/git/dist/git-editor.sh"')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/modulefiles')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-ipc-108e87c6-bbba-41ad-9306-fece523e8ba5.sock')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_gaa2pj7b/none_tw70ocve/attempt_0/1/error.json')} | |
warn(msg) | |
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. | |
Either way, this might cause trouble in the future: | |
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. | |
warn(msg) | |
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so | |
CUDA SETUP: Highest compute capability among GPUs detected: 8.0 | |
CUDA SETUP: Detected CUDA version 113 | |
CUDA SETUP: Loading binary /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so... | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /data/home/hamidnazeri/miniconda/envs/pippy_tp did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/lib:/usr/lib did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/share/modules/$MODULE_VERSION/modulefiles'), PosixPath('1')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-git-3b65b17622.sock')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('"/data/home/hamidnazeri/.vscode-server/bin/695af097c7bd098fbf017ce3ac85e09bbc5dda06/extensions/git/dist/git-editor.sh"')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/modulefiles')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-ipc-108e87c6-bbba-41ad-9306-fece523e8ba5.sock')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_gaa2pj7b/none_tw70ocve/attempt_0/3/error.json')} | |
warn(msg) | |
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...bin | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward. | |
Either way, this might cause trouble in the future: | |
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. | |
warn(msg) | |
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0 | |
CUDA SETUP: Highest compute capability among GPUs detected: 8.0 | |
CUDA SETUP: Detected CUDA version 113 | |
CUDA SETUP: Loading binary /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so... | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /data/home/hamidnazeri/miniconda/envs/pippy_tp did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/lib:/usr/lib did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('1'), PosixPath('/usr/share/modules/$MODULE_VERSION/modulefiles')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-git-3b65b17622.sock')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('"/data/home/hamidnazeri/.vscode-server/bin/695af097c7bd098fbf017ce3ac85e09bbc5dda06/extensions/git/dist/git-editor.sh"')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/modulefiles')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-ipc-108e87c6-bbba-41ad-9306-fece523e8ba5.sock')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_gaa2pj7b/none_tw70ocve/attempt_0/2/error.json')} | |
warn(msg) | |
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. | |
Either way, this might cause trouble in the future: | |
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. | |
warn(msg) | |
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so | |
CUDA SETUP: Highest compute capability among GPUs detected: 8.0 | |
CUDA SETUP: Detected CUDA version 113 | |
CUDA SETUP: Loading binary /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so... | |
bin /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /data/home/hamidnazeri/miniconda/envs/pippy_tp did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/lib:/usr/lib did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('1'), PosixPath('/usr/share/modules/$MODULE_VERSION/modulefiles')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-git-3b65b17622.sock')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('"/data/home/hamidnazeri/.vscode-server/bin/695af097c7bd098fbf017ce3ac85e09bbc5dda06/extensions/git/dist/git-editor.sh"')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/modulefiles')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/run/user/2336/vscode-ipc-108e87c6-bbba-41ad-9306-fece523e8ba5.sock')} | |
warn(msg) | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_gaa2pj7b/none_tw70ocve/attempt_0/0/error.json')} | |
warn(msg) | |
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward. | |
Either way, this might cause trouble in the future: | |
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. | |
warn(msg) | |
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0 | |
CUDA SETUP: Highest compute capability among GPUs detected: 8.0 | |
CUDA SETUP: Detected CUDA version 113 | |
CUDA SETUP: Loading binary /data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so... | |
--> Running with torch dist debug set to detail | |
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.47s/it] | |
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details. | |
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.36s/it] | |
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.40s/it] | |
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details. | |
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details. | |
--> Model meta-llama/Llama-2-7b-chat-hf | |
--> meta-llama/Llama-2-7b-chat-hf has 6738.415616 Million params | |
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:13<00:00, 6.54s/it] | |
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details. | |
bFloat16 enabled for mixed precision - using bfSixteen policy | |
--> applying fsdp activation checkpointing... | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/torch/cuda/memory.py:306: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch0: 0%| | 0/97 [00:00<?, ?it/s]--> applying fsdp activation checkpointing... | |
--> applying fsdp activation checkpointing... | |
--> Training Set Length = 1555 | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/torch/cuda/memory.py:306: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch0: 0%| | 0/97 [00:00<?, ?it/s]--> Validation Set Length = 84 | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/torch/cuda/memory.py:306: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch0: 0%| | 0/97 [00:00<?, ?it/s]--> applying fsdp activation checkpointing... | |
/data/home/hamidnazeri/miniconda/envs/pippy_tp/lib/python3.10/site-packages/torch/cuda/memory.py:306: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. | |
warnings.warn( | |
Training Epoch0: 1%|█▎ | 1/97 [00:24<39:29, 24.68s/it] | |
step 0 is completed and loss is 2.4715213775634766 | |
Training Epoch0: 2%|██▌ | 2/97 [00:32<21:41, 13.70s/it] | |
step 1 is completed and loss is 5.230667591094971 | |
Training Epoch0: 3%|███▊ | 3/97 [00:30<12:01, 7.68s/it] | |
step 2 is completed and loss is 5.525859832763672 | |
Training Epoch0: 4%|█████ | 4/97 [00:33<08:54, 5.74s/it] | |
step 3 is completed and loss is 3.961235761642456 | |
Training Epoch0: 5%|██████▎ | 5/97 [00:40<07:46, 5.07s/it] | |
step 4 is completed and loss is 2.868651866912842 | |
Training Epoch0: 6%|███████▌ | 6/97 [00:38<06:09, 4.06s/it] | |
step 5 is completed and loss is 2.596735954284668 | |
Training Epoch0: 7%|████████▉ | 7/97 [00:32<04:58, 3.32s/it] | |
step 6 is completed and loss is 2.3276379108428955 | |
Training Epoch0: 8%|██████████▏ | 8/97 [00:48<05:10, 3.49s/it] | |
step 7 is completed and loss is 2.192718982696533 | |
Training Epoch0: 9%|███████████▍ | 9/97 [00:51<04:48, 3.28s/it] | |
step 8 is completed and loss is 2.1461222171783447 | |
Training Epoch0: 10%|████████████▌ | 10/97 [00:49<04:27, 3.08s/it] | |
step 9 is completed and loss is 2.0069150924682617 | |
Training Epoch0: 11%|█████████████▊ | 11/97 [00:52<04:18, 3.00s/it] | |
step 10 is completed and loss is 1.9500118494033813 | |
Training Epoch0: 11%|█████████████▊ | 11/97 [00:44<05:47, 4.04s/it] | |
Training Epoch0: 11%|█████████████▊ | 11/97 [00:52<06:53, 4.81s/it] | |
Training Epoch0: 11%|█████████████▊ | 11/97 [00:57<07:29, 5.22s/it] | |
Training Epoch0: 11%|█████████████▊ | 11/97 [00:52<06:49, 4.76s/it] | |
Max CUDA memory allocated was 19 GB | |
Max CUDA memory reserved was 27 GB | |
Peak active CUDA memory was 20 GB | |
Cuda Malloc retires : 0 | |
CPU Total Peak Memory consumed during the train (max): 5 GB | |
evaluating Epoch: 29%|██████████████████████████████████▊ | 6/21 [00:03<00:09, 1.59it/s] | |
evaluating Epoch: 29%|██████████████████████████████████▊ | 6/21 [00:03<00:09, 1.59it/s] | |
evaluating Epoch: 29%|██████████████████████████████████▊ | 6/21 [00:03<00:09, 1.54it/s] | |
evaluating Epoch: 29%|██████████████████████████████████▊ | 6/21 [00:03<00:09, 1.56it/s] | |
eval_ppl=tensor(1.7574, device='cuda:0') eval_epoch_loss=tensor(0.5639, device='cuda:0') | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
========================================================================================================== | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
Saving model to /data/home/hamidnazeri/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf | |
2023-08-04 20:12:31,107 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {} | |
Sharded state checkpoint saved to /data/home/hamidnazeri/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf | |
Checkpoint Time = 46.2713 | |
best eval loss on epoch 0 is 0.5638577938079834 | |
Epoch 1: train_perplexity=1.4111, train_epoch_loss=0.3443, epcoh time 53.10582085000351s | |
Training Epoch1: 1%|█▎ | 1/97 [00:03<05:08, 3.22s/it] | |
step 0 is completed and loss is 1.5544689893722534 | |
Training Epoch1: 2%|██▌ | 2/97 [00:05<04:40, 2.95s/it] | |
step 1 is completed and loss is 1.915114164352417 | |
Training Epoch1: 3%|███▊ | 3/97 [00:08<04:28, 2.86s/it] | |
step 2 is completed and loss is 1.932603359222412 | |
Training Epoch1: 4%|█████ | 4/97 [00:11<04:24, 2.85s/it] | |
step 3 is completed and loss is 1.9791666269302368 | |
Training Epoch1: 5%|██████▎ | 5/97 [00:14<04:19, 2.82s/it] | |
step 4 is completed and loss is 1.7437982559204102 | |
Training Epoch1: 6%|███████▌ | 6/97 [00:17<04:15, 2.80s/it] | |
step 5 is completed and loss is 1.7695246934890747 | |
Training Epoch1: 7%|████████▉ | 7/97 [00:19<04:11, 2.80s/it] | |
step 6 is completed and loss is 1.7550206184387207 | |
Training Epoch1: 8%|██████████▏ | 8/97 [00:22<04:07, 2.78s/it] | |
step 7 is completed and loss is 1.746717095375061 | |
Training Epoch1: 9%|███████████▍ | 9/97 [00:25<04:04, 2.78s/it] | |
step 8 is completed and loss is 1.7609888315200806 | |
Training Epoch1: 10%|████████████▌ | 10/97 [00:28<04:02, 2.79s/it] | |
step 9 is completed and loss is 1.6778242588043213 | |
Training Epoch1: 11%|█████████████▊ | 11/97 [00:31<04:00, 2.80s/it] | |
step 10 is completed and loss is 1.6500493288040161 | |
Training Epoch1: 11%|█████████████▊ | 11/97 [00:31<04:03, 2.84s/it] | |
Training Epoch1: 11%|█████████████▊ | 11/97 [00:31<04:04, 2.84s/it] | |
Training Epoch1: 11%|█████████████▊ | 11/97 [00:31<04:04, 2.84s/it] | |
Training Epoch1: 11%|█████████████▊ | 11/97 [00:31<04:04, 2.85s/it] | |
Max CUDA memory allocated was 19 GB | |
Max CUDA memory reserved was 27 GB | |
Peak active CUDA memory was 20 GB | |
Cuda Malloc retires : 0 | |
CPU Total Peak Memory consumed during the train (max): 6 GB | |
evaluating Epoch: 29%|██████████████████████████████████▊ | 6/21 [00:03<00:09, 1.57it/s] | |
evaluating Epoch: 29%|██████████████████████████████████▊ | 6/21 [00:03<00:09, 1.52it/s] | |
evaluating Epoch: 29%|██████████████████████████████████▊ | 6/21 [00:03<00:09, 1.54it/s] | |
evaluating Epoch: 29%|██████████████████████████████████▊ | 6/21 [00:03<00:09, 1.50it/s] | |
eval_ppl=tensor(1.6894, device='cuda:0') eval_epoch_loss=tensor(0.5244, device='cuda:0') | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT===================================================== | |
===================================================== | |
Saving the FSDP model checkpoints using SHARDED_STATE_DICT | |
===================================================== | |
Saving model to /data/home/hamidnazeri/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf | |
2023-08-04 20:13:47,365 _dedup_tensors.py:44 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {} | |
Sharded state checkpoint saved to /data/home/hamidnazeri/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf | |
Checkpoint Time = 32.6997 | |
best eval loss on epoch 1 is 0.5243545174598694 | |
Epoch 2: train_perplexity=1.2238, train_epoch_loss=0.2020, epcoh time 31.780189700890332s | |
training params are saved in /data/home/hamidnazeri/llama-recipes/model_checkpoints/with_new_config-meta-llama/Llama-2-7b-chat-hf/train_params.yaml | |
Key: avg_train_prep, Value: 1.317427396774292 | |
Key: avg_train_loss, Value: 0.27314862608909607 | |
Key: avg_eval_prep, Value: 1.7234036922454834 | |
Key: avg_eval_loss, Value: 0.544106125831604 | |
Key: avg_epoch_time, Value: 42.44300527544692 | |
Key: avg_checkpoint_time, Value: 39.49147454963531 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment