Created
August 29, 2023 21:34
-
-
Save denizyuret/6217573d64ef0b5004415888d7bca981 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/bin/deepspeed:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html | |
__import__('pkg_resources').require('deepspeed==0.10.2+c69bd1f7') | |
[2023-08-30 00:25:45,359] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2023-08-30 00:25:46,642] [WARNING] [runner.py:201:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. | |
Detected CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7: setting --include=localhost:0,1,2,3,4,5,6,7 | |
[2023-08-30 00:25:46,643] [INFO] [runner.py:567:main] cmd = /truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/bin/python3.11 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --data_split 2,4,4 --model_name_or_path meta-llama/Llama-2-7b-hf --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 4 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --deepspeed --output_dir ./output_step1_llama2_7b | |
[2023-08-30 00:25:47,849] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2023-08-30 00:25:49,082] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} | |
[2023-08-30 00:25:49,083] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0 | |
[2023-08-30 00:25:49,083] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) | |
[2023-08-30 00:25:49,083] [INFO] [launch.py:163:main] dist_world_size=8 | |
[2023-08-30 00:25:49,083] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | |
[2023-08-30 00:25:51,144] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2023-08-30 00:25:51,273] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2023-08-30 00:25:51,293] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2023-08-30 00:25:51,305] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2023-08-30 00:25:51,311] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2023-08-30 00:25:51,316] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2023-08-30 00:25:51,337] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2023-08-30 00:25:51,338] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2023-08-30 00:25:55,754] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2023-08-30 00:25:55,755] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl | |
[2023-08-30 00:25:55,817] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2023-08-30 00:25:55,870] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2023-08-30 00:25:55,918] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2023-08-30 00:25:55,930] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2023-08-30 00:25:55,935] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2023-08-30 00:25:55,941] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2023-08-30 00:25:55,943] [INFO] [comm.py:637:init_distributed] cdb=None | |
Using pad_token, but it is not set yet. | |
Using pad_token, but it is not set yet. | |
Using pad_token, but it is not set yet. | |
Using pad_token, but it is not set yet. | |
Using pad_token, but it is not set yet. | |
Using pad_token, but it is not set yet. | |
Using pad_token, but it is not set yet. | |
Using pad_token, but it is not set yet. | |
[2023-08-30 00:26:03,311] [INFO] [partition_parameters.py:340:__exit__] finished initializing model - num_params = 291, num_elems = 6.74B | |
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] | |
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] | |
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] | |
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] | |
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] | |
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] | |
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] | |
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] | |
Loading checkpoint shards: 50%|█████ | 1/2 [00:00<00:00, 1.15it/s] | |
Loading checkpoint shards: 50%|█████ | 1/2 [00:00<00:00, 1.15it/s] | |
Loading checkpoint shards: 50%|█████ | 1/2 [00:00<00:00, 1.15it/s] | |
Loading checkpoint shards: 50%|█████ | 1/2 [00:00<00:00, 1.15it/s] | |
Loading checkpoint shards: 50%|█████ | 1/2 [00:00<00:00, 1.15it/s] | |
Loading checkpoint shards: 50%|█████ | 1/2 [00:00<00:00, 1.14it/s] | |
Loading checkpoint shards: 50%|█████ | 1/2 [00:00<00:00, 1.13it/s] | |
Loading checkpoint shards: 50%|█████ | 1/2 [00:01<00:01, 1.73s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.03s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.00s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.03s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.01s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.04s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.01s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.04s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.01s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.03s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.01s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.04s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.02s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.05s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.02s/it] | |
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 32008. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | |
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 32008. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | |
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 32008. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | |
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 32008. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | |
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 32008. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | |
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 32008. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | |
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 32008. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.10s/it] | |
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00, 1.20s/it] | |
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 32008. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | |
[2023-08-30 00:26:05,918] [INFO] [partition_parameters.py:340:__exit__] finished initializing model - num_params = 292, num_elems = 6.87B | |
/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op | |
warnings.warn("Initializing zero-element tensors is a no-op") | |
/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op | |
warnings.warn("Initializing zero-element tensors is a no-op") | |
/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op | |
warnings.warn("Initializing zero-element tensors is a no-op") | |
/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op | |
warnings.warn("Initializing zero-element tensors is a no-op") | |
/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op | |
warnings.warn("Initializing zero-element tensors is a no-op") | |
/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op | |
warnings.warn("Initializing zero-element tensors is a no-op") | |
/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op | |
warnings.warn("Initializing zero-element tensors is a no-op") | |
[2023-08-30 00:26:05,926] [INFO] [partition_parameters.py:340:__exit__] finished initializing model - num_params = 293, num_elems = 6.87B | |
/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op | |
warnings.warn("Initializing zero-element tensors is a no-op") | |
Installed CUDA version 11.7 does not match the version torch was compiled with 11.2 but since the APIs are compatible, accepting this combination | |
Using /dev/shm/.cache/torch_extensions/py311_cu112 as PyTorch extensions root... | |
Installed CUDA version 11.7 does not match the version torch was compiled with 11.2 but since the APIs are compatible, accepting this combination | |
Using /dev/shm/.cache/torch_extensions/py311_cu112 as PyTorch extensions root... | |
Detected CUDA files, patching ldflags | |
Emitting ninja build file /dev/shm/.cache/torch_extensions/py311_cu112/fused_adam/build.ninja... | |
Building extension module fused_adam... | |
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) | |
Installed CUDA version 11.7 does not match the version torch was compiled with 11.2 but since the APIs are compatible, accepting this combination | |
Using /dev/shm/.cache/torch_extensions/py311_cu112 as PyTorch extensions root... | |
ninja: no work to do. | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 0.08058977127075195 seconds | |
Installed CUDA version 11.7 does not match the version torch was compiled with 11.2 but since the APIs are compatible, accepting this combination | |
Using /dev/shm/.cache/torch_extensions/py311_cu112 as PyTorch extensions root... | |
Detected CUDA files, patching ldflags | |
Emitting ninja build file /dev/shm/.cache/torch_extensions/py311_cu112/fused_adam/build.ninja... | |
Building extension module fused_adam... | |
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) | |
Installed CUDA version 11.7 does not match the version torch was compiled with 11.2 but since the APIs are compatible, accepting this combination | |
Using /dev/shm/.cache/torch_extensions/py311_cu112 as PyTorch extensions root... | |
ninja: no work to do. | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 0.08021855354309082 seconds | |
Installed CUDA version 11.7 does not match the version torch was compiled with 11.2 but since the APIs are compatible, accepting this combination | |
Using /dev/shm/.cache/torch_extensions/py311_cu112 as PyTorch extensions root... | |
Installed CUDA version 11.7 does not match the version torch was compiled with 11.2 but since the APIs are compatible, accepting this combination | |
Using /dev/shm/.cache/torch_extensions/py311_cu112 as PyTorch extensions root... | |
Detected CUDA files, patching ldflags | |
Emitting ninja build file /dev/shm/.cache/torch_extensions/py311_cu112/fused_adam/build.ninja... | |
Building extension module fused_adam... | |
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) | |
Installed CUDA version 11.7 does not match the version torch was compiled with 11.2 but since the APIs are compatible, accepting this combination | |
Using /dev/shm/.cache/torch_extensions/py311_cu112 as PyTorch extensions root... | |
ninja: no work to do. | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 0.0788273811340332 seconds | |
[2023-08-30 00:26:11,214] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.2+c69bd1f7, git-hash=c69bd1f7, git-branch=master | |
[2023-08-30 00:26:11,215] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 0.10779261589050293 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 0.3023865222930908 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 0.10199952125549316 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 0.10212111473083496 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 0.30235838890075684 seconds | |
[2023-08-30 00:26:11,914] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False | |
[2023-08-30 00:26:11,915] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer | |
[2023-08-30 00:26:11,915] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer | |
[2023-08-30 00:26:11,924] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam | |
[2023-08-30 00:26:11,924] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'> | |
[2023-08-30 00:26:11,924] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False | |
[2023-08-30 00:26:11,924] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer | |
[2023-08-30 00:26:12,219] [INFO] [utils.py:803:see_memory_usage] Stage 3 initialize beginning | |
[2023-08-30 00:26:12,219] [INFO] [utils.py:804:see_memory_usage] MA 1.67 GB Max_MA 2.41 GB CA 3.45 GB Max_CA 3 GB | |
[2023-08-30 00:26:12,219] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 79.85 GB, percent = 7.9% | |
[2023-08-30 00:26:12,221] [INFO] [stage3.py:126:__init__] Reduce bucket size 500,000,000 | |
[2023-08-30 00:26:12,221] [INFO] [stage3.py:127:__init__] Prefetch bucket size 30000000 | |
[2023-08-30 00:26:12,514] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] | |
[2023-08-30 00:26:12,516] [INFO] [utils.py:804:see_memory_usage] MA 1.67 GB Max_MA 1.67 GB CA 3.45 GB Max_CA 3 GB | |
[2023-08-30 00:26:12,516] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 79.85 GB, percent = 7.9% | |
Parameter Offload: Total persistent parameters: 266240 in 66 params | |
[2023-08-30 00:26:12,825] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [end] | |
[2023-08-30 00:26:12,826] [INFO] [utils.py:804:see_memory_usage] MA 1.67 GB Max_MA 1.67 GB CA 3.45 GB Max_CA 3 GB | |
[2023-08-30 00:26:12,826] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 79.85 GB, percent = 7.9% | |
[2023-08-30 00:26:13,124] [INFO] [utils.py:803:see_memory_usage] Before creating fp16 partitions | |
[2023-08-30 00:26:13,125] [INFO] [utils.py:804:see_memory_usage] MA 1.67 GB Max_MA 1.67 GB CA 3.45 GB Max_CA 3 GB | |
[2023-08-30 00:26:13,125] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 79.85 GB, percent = 7.9% | |
[2023-08-30 00:26:14,972] [INFO] [utils.py:803:see_memory_usage] After creating fp16 partitions: 1 | |
[2023-08-30 00:26:14,989] [INFO] [utils.py:804:see_memory_usage] MA 1.66 GB Max_MA 1.67 GB CA 2.18 GB Max_CA 3 GB | |
[2023-08-30 00:26:14,989] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 89.24 GB, percent = 8.9% | |
[2023-08-30 00:26:15,304] [INFO] [utils.py:803:see_memory_usage] Before creating fp32 partitions | |
[2023-08-30 00:26:15,306] [INFO] [utils.py:804:see_memory_usage] MA 1.66 GB Max_MA 1.66 GB CA 2.18 GB Max_CA 2 GB | |
[2023-08-30 00:26:15,306] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 90.03 GB, percent = 8.9% | |
[2023-08-30 00:26:15,699] [INFO] [utils.py:803:see_memory_usage] After creating fp32 partitions | |
[2023-08-30 00:26:15,700] [INFO] [utils.py:804:see_memory_usage] MA 4.74 GB Max_MA 6.28 GB CA 6.79 GB Max_CA 7 GB | |
[2023-08-30 00:26:15,700] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 90.63 GB, percent = 9.0% | |
[2023-08-30 00:26:16,919] [INFO] [utils.py:803:see_memory_usage] Before initializing optimizer states | |
[2023-08-30 00:26:16,921] [INFO] [utils.py:804:see_memory_usage] MA 4.74 GB Max_MA 4.74 GB CA 6.79 GB Max_CA 7 GB | |
[2023-08-30 00:26:16,921] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 79.9 GB, percent = 7.9% | |
[2023-08-30 00:26:17,230] [INFO] [utils.py:803:see_memory_usage] After initializing optimizer states | |
[2023-08-30 00:26:17,230] [INFO] [utils.py:804:see_memory_usage] MA 10.89 GB Max_MA 13.97 GB CA 16.03 GB Max_CA 16 GB | |
[2023-08-30 00:26:17,231] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 79.9 GB, percent = 7.9% | |
[2023-08-30 00:26:17,231] [INFO] [stage3.py:445:_setup_for_real_optimizer] optimizer state initialized | |
[2023-08-30 00:26:18,135] [INFO] [utils.py:803:see_memory_usage] After initializing ZeRO optimizer | |
[2023-08-30 00:26:18,137] [INFO] [utils.py:804:see_memory_usage] MA 13.36 GB Max_MA 13.85 GB CA 16.45 GB Max_CA 16 GB | |
[2023-08-30 00:26:18,137] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 79.9 GB, percent = 7.9% | |
[2023-08-30 00:26:18,137] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam | |
[2023-08-30 00:26:18,137] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler | |
[2023-08-30 00:26:18,137] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x14b817def010> | |
[2023-08-30 00:26:18,138] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[9.65e-06], mom=[(0.9, 0.95)] | |
[2023-08-30 00:26:18,138] [INFO] [config.py:963:print] DeepSpeedEngine configuration: | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] activation_checkpointing_config { | |
"partition_activations": false, | |
"contiguous_memory_optimization": false, | |
"cpu_checkpointing": false, | |
"number_checkpoints": null, | |
"synchronize_checkpoint_boundary": false, | |
"profile": false | |
} | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] amp_enabled .................. False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] amp_params ................... False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] autotuning_config ............ { | |
"enabled": false, | |
"start_step": null, | |
"end_step": null, | |
"metric_path": null, | |
"arg_mappings": null, | |
"metric": "throughput", | |
"model_info": null, | |
"results_dir": "autotuning_results", | |
"exps_dir": "autotuning_exps", | |
"overwrite": true, | |
"fast": true, | |
"start_profile_step": 3, | |
"end_profile_step": 5, | |
"tuner_type": "gridsearch", | |
"tuner_early_stopping": 5, | |
"tuner_num_trials": 50, | |
"model_info_path": null, | |
"mp_size": 1, | |
"max_train_batch_size": null, | |
"min_train_batch_size": 1, | |
"max_train_micro_batch_size_per_gpu": 1.024000e+03, | |
"min_train_micro_batch_size_per_gpu": 1, | |
"num_tuning_micro_batch_sizes": 3 | |
} | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] bfloat16_enabled ............. False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] checkpoint_parallel_write_pipeline False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] checkpoint_tag_validation_enabled True | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] checkpoint_tag_validation_fail False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x14b8306d7fd0> | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] communication_data_type ...... None | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] curriculum_enabled_legacy .... False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] curriculum_params_legacy ..... False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] data_efficiency_enabled ...... False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] dataloader_drop_last ......... False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] disable_allgather ............ False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] dump_state ................... False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1} | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] eigenvalue_enabled ........... False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] eigenvalue_gas_boundary_resolution 1 | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] eigenvalue_layer_name ........ bert.encoder.layer | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] eigenvalue_layer_num ......... 0 | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] eigenvalue_max_iter .......... 100 | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] eigenvalue_stability ......... 1e-06 | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] eigenvalue_tol ............... 0.01 | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] eigenvalue_verbose ........... False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] elasticity_enabled ........... False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] flops_profiler_config ........ { | |
"enabled": false, | |
"recompute_fwd_factor": 0.0, | |
"profile_step": 1, | |
"module_depth": -1, | |
"top_modules": 1, | |
"detailed": true, | |
"output_file": null | |
} | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] fp16_auto_cast ............... False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] fp16_enabled ................. True | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] fp16_master_weights_and_gradients False | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] global_rank .................. 0 | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] grad_accum_dtype ............. None | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] gradient_accumulation_steps .. 1 | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] gradient_clipping ............ 1.0 | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] gradient_predivide_factor .... 1.0 | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] initial_dynamic_scale ........ 65536 | |
[2023-08-30 00:26:18,139] [INFO] [config.py:967:print] load_universal_checkpoint .... False | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] loss_scale ................... 0 | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] memory_breakdown ............. False | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] mics_hierarchial_params_gather False | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] mics_shard_size .............. -1 | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='step1_tensorboard/ds_tensorboard_logs/', job_name='step1_model_tensorboard') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] nebula_config ................ { | |
"enabled": false, | |
"persistent_storage_path": null, | |
"persistent_time_interval": 100, | |
"num_of_version_in_retention": 2, | |
"enable_nebula_load": true, | |
"load_path": null | |
} | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] optimizer_legacy_fusion ...... False | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] optimizer_name ............... None | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] optimizer_params ............. None | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] pld_enabled .................. False | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] pld_params ................... False | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] prescale_gradients ........... False | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] scheduler_name ............... None | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] scheduler_params ............. None | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] sparse_attention ............. None | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] sparse_gradients_enabled ..... False | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] steps_per_print .............. 10 | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] train_batch_size ............. 32 | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] train_micro_batch_size_per_gpu 4 | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] use_node_local_storage ....... False | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] wall_clock_breakdown ......... False | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] world_size ................... 8 | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] zero_allow_untested_optimizer False | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] zero_enabled ................. True | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] zero_force_ds_cpu_optimizer .. True | |
[2023-08-30 00:26:18,140] [INFO] [config.py:967:print] zero_optimization_stage ...... 3 | |
[2023-08-30 00:26:18,140] [INFO] [config.py:953:print_user_config] json = { | |
"train_batch_size": 32, | |
"train_micro_batch_size_per_gpu": 4, | |
"steps_per_print": 10, | |
"zero_optimization": { | |
"stage": 3, | |
"offload_param": { | |
"device": "none" | |
}, | |
"offload_optimizer": { | |
"device": "none" | |
}, | |
"stage3_param_persistence_threshold": 1.000000e+04, | |
"stage3_max_live_parameters": 3.000000e+07, | |
"stage3_prefetch_bucket_size": 3.000000e+07, | |
"memory_efficient_linear": false | |
}, | |
"fp16": { | |
"enabled": true, | |
"loss_scale_window": 100 | |
}, | |
"gradient_clipping": 1.0, | |
"prescale_gradients": false, | |
"wall_clock_breakdown": false, | |
"hybrid_engine": { | |
"enabled": false, | |
"max_out_tokens": 512, | |
"inference_tp_size": 1, | |
"release_inference_cache": false, | |
"pin_parameters": true, | |
"tp_gather_partition_size": 8 | |
}, | |
"tensorboard": { | |
"enabled": false, | |
"output_path": "step1_tensorboard/ds_tensorboard_logs/", | |
"job_name": "step1_model_tensorboard" | |
} | |
} | |
***** Running training ***** | |
***** Evaluating perplexity, Epoch 0/4 ***** | |
Traceback (most recent call last): | |
Traceback (most recent call last): | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 366, in <module> | |
Traceback (most recent call last): | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 366, in <module> | |
Traceback (most recent call last): | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 366, in <module> | |
Traceback (most recent call last): | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 366, in <module> | |
Traceback (most recent call last): | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 366, in <module> | |
Traceback (most recent call last): | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 366, in <module> | |
Traceback (most recent call last): | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 366, in <module> | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 366, in <module> | |
main() | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 317, in main | |
main() | |
main() | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 317, in main | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 317, in main | |
main() | |
main() | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 317, in main | |
main() | |
main() | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 317, in main | |
perplexity = evaluation(model, eval_dataloader) | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 317, in main | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 317, in main | |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 268, in evaluation | |
perplexity = evaluation(model, eval_dataloader) | |
perplexity = evaluation(model, eval_dataloader) | |
perplexity = evaluation(model, eval_dataloader) | |
perplexity = evaluation(model, eval_dataloader)main() perplexity = evaluation(model, eval_dataloader)perplexity = evaluation(model, eval_dataloader) | |
outputs = model(**batch) | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 317, in main | |
perplexity = evaluation(model, eval_dataloader) | |
^ ^ ^ ^ ^^^ ^^^ ^^^ ^^^^ ^^^^^ ^ ^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^^^^ | |
^^^^^^^^^^^^^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl | |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^ | |
^ | |
^^return forward_call(*args, **kwargs)^^^^^^ | |
^^ File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 268, in evaluation | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 268, in evaluation | |
^^^^^^ ^^^ File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 268, in evaluation | |
^^^^ ^^ File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 268, in evaluation | |
^^ ^ File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 268, in evaluation | |
File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 268, in evaluation | |
^ ^ ^ ^ ^ ^^^^^^^^ | |
^ outputs = model(**batch)outputs = model(**batch)^ File "/truba/home/dyuret/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 268, in evaluation | |
^outputs = model(**batch)^ | |
^ ^ ^ ^ ^ outputs = model(**batch) ^ | |
^ ^ outputs = model(**batch)outputs = model(**batch) ^ | |
^ outputs = model(**batch) ^^ | |
^ ^^ ^ ^ ^ ^ ^^^ ^ ^^^ ^ ^^^ ^ ^^^ | |
^^^ ^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^^^ ^ ^^^ ^ ^^^ ^ ^^^^ ^^^^^^ ^^^^^^ret_val = func(*args, **kwargs)^^^^^^^ | |
^^^^^ | |
^^^^ | |
^^^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl | |
^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^ | |
^ ^^^ ^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl | |
^ | |
^ | |
^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl | |
^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl | |
^^^^^^^^^^^^^^^^^^^^ | |
File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/engine.py", line 1801, in forward | |
return forward_call(*args, **kwargs)return forward_call(*args, **kwargs)return forward_call(*args, **kwargs) | |
^ ^ ^return forward_call(*args, **kwargs) ^ | |
^ ^^ ^^ ^return forward_call(*args, **kwargs)^^return forward_call(*args, **kwargs) ^ | |
^^ | |
return forward_call(*args, **kwargs)^ ^^^ | |
^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^ ^^^ ^ ^^^^ ^^^ ^^^ ^^^loss = self.module(*inputs, **kwargs)^^^^^^^ | |
^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^ | |
^^^^ ^^^^^^^ ^ | |
File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^^^^ ^ | |
File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^^^ ^^^^^ ^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^^^ ^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^ret_val = func(*args, **kwargs) | |
^^^^^ret_val = func(*args, **kwargs)^^^^^ | |
^^^ ^^ret_val = func(*args, **kwargs)^^^ ^^ | |
^^^ ^^^^^ ^ | |
^^^ ^^ | |
^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^ret_val = func(*args, **kwargs)^ ^^^ | |
^^^ ^^ ^^^^ret_val = func(*args, **kwargs) ^ret_val = func(*args, **kwargs)^^ ^ | |
ret_val = func(*args, **kwargs)^ | |
^^ ^ | |
^^^ ^^^ | |
^ ^^ ^ ^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl | |
^ ^^ ^ ^^ ^ ^^ ^ ^^ ^ ^^ ^ ^^ ^ ^^ ^ | |
^ ^ ^ ^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/engine.py", line 1801, in forward | |
^^ ^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/engine.py", line 1801, in forward | |
^^^ ^^^^^result = forward_call(*args, **kwargs)^^^ | |
^ | |
^^^^^^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/engine.py", line 1801, in forward | |
^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^^^ | |
^^^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/engine.py", line 1801, in forward | |
^^ | |
^^^^ | |
File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/engine.py", line 1801, in forward | |
^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/engine.py", line 1801, in forward | |
^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/engine.py", line 1801, in forward | |
^^^^^^^^^^^^^^^^^^^^ | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward | |
loss = self.module(*inputs, **kwargs) | |
^^^^^ ^logits = self.lm_head(hidden_states)^ | |
^^^^^^^^^^^^^^^^^ ^ ^ ^ ^ ^ ^ | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl | |
^^^^^^^^^^^^^^^^^^^^^^ ^ loss = self.module(*inputs, **kwargs)^loss = self.module(*inputs, **kwargs) | |
^ | |
^^ | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
^ ^^^^^^^^^^^^^^ ^^loss = self.module(*inputs, **kwargs)^^^ | |
^^loss = self.module(*inputs, **kwargs)loss = self.module(*inputs, **kwargs)loss = self.module(*inputs, **kwargs)^^ | |
^^ ^^ ^^ ^^^^ ^ ^ ^result = forward_call(*args, **kwargs)^ ^ | |
^ ^^^ result = hook(self, args) ^^ | |
^^ ^^ ^^ ^ ^^ ^ ^^ ^ ^^^ ^ ^^^^^ ^ ^^ | |
^^ ^ ^^ | |
^ ^ ^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl | |
^ ^ ^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl | |
^ ^ ^^^ ^ ^^^ ^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^ ^^^^result = forward_call(*args, **kwargs)^^result = forward_call(*args, **kwargs)^^^^ | |
^^ | |
^^^^^^^^^^^^^^ ^^ ^^^^ ^^ ^^^^ ^^ ^^^ | |
^^ ^^^^ ^^ ^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^ ^^^ ^ ^^ | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl | |
^ | |
^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl | |
^ | |
^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl | |
^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl | |
ret_val = func(*args, **kwargs) ^ | |
^^ ^^^^ | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward | |
^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ result = forward_call(*args, **kwargs)^^ | |
^^ ^^ ^^ ^ ^^result = forward_call(*args, **kwargs)^ ^^ | |
^ ^ ^^result = forward_call(*args, **kwargs) result = forward_call(*args, **kwargs)^logits = self.lm_head(hidden_states)^^ | |
^ | |
^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^ ^^^ ^ | |
^^ ^ ^ | |
^ ^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward | |
^ ^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward | |
^^ ^ ^^ ^ ^^^ ^^^^^^^^^^^^ | |
^^^^^^^^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook | |
^^^^^^ ^^^^^logits = self.lm_head(hidden_states)^^^^^ | |
^^^^^logits = self.lm_head(hidden_states)^^^^^^ | |
^^ ^^^^^ ^ ^^^^ ^self.pre_sub_module_forward_function(module)^^^^ ^ | |
^^^^ ^^^^^^ ^^^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^ ^^^^^ ^^^^^ ^^^^^ ^^^^^ ^^^^ ^^ ^^^return func(*args, **kwargs)^ | |
^^^ | |
^^^ ^^^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward | |
^ ^^^^^ ^^^^^^ ^ | |
^ | |
^^ ^^ | |
^ ^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
^ ^ | |
^ ^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward | |
^ ^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward | |
^^^ ^^logits = self.lm_head(hidden_states) ^^ | |
^^^^^^^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^^ logits = self.lm_head(hidden_states)^ ^ ^ | |
^logits = self.lm_head(hidden_states)^logits = self.lm_head(hidden_states)^result = hook(self, args) ^ | |
^ | |
^ | |
^ | |
^^ ^^^ ^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
^^ ^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
^^ ^^ | |
^ ^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 504, in pre_sub_module_forward_function | |
^ ^ ^ ^ ^ ^ ^^^ ^^^^result = hook(self, args)^^^^^ | |
^^^^^^param_coordinator.fetch_sub_module(sub_module, forward=True)result = hook(self, args)^^^^^ | |
^^^ ^^^^^ ^^^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^^^^^ ^^^^^ ^^^^^ ^^^^^ ^ ^^^^ ^ret_val = func(*args, **kwargs)^^^^ ^ | |
^^^^ ^^^^^ | |
^^^^ ^^^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
^^ | |
^ ^ ^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^ ^^^ ^^^^^ ^^^^^ ^^^^^ ^^^^^ ^^^^^ ret_val = func(*args, **kwargs)^^^^^ | |
^^ | |
^^ ^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
^ result = hook(self, args)^^^ | |
^^^ ^^^ ^^^ ^^^ | |
^^ ^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^ ^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^ ^ ^ ^^ ^^ ^^ ^^^ ^^^ret_val = func(*args, **kwargs)^^^ | |
^^ ^^ ^ret_val = func(*args, **kwargs)result = hook(self, args)^^result = hook(self, args)result = hook(self, args) | |
^^ | |
^^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ return func(*args, **kwargs) ^ ^ | |
^ ^ ^ | |
^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook | |
^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^ ^ ^ ^ ^ ^^^^^ ^^ ^^^ ^^self.pre_sub_module_forward_function(module)^^^ ret_val = func(*args, **kwargs)^^ | |
^^^ | |
^^^^^ ^^^^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^return func(*args, **kwargs) ^^^^^^ | |
^^^^^^ ^^^ | |
^ | |
^^ ^^^ | |
^ ^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook | |
^ | |
^ ^^ ^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook | |
^ ^^ ^^ ^^ ret_val = func(*args, **kwargs)^ ^^^ | |
ret_val = func(*args, **kwargs) ^^^^ | |
ret_val = func(*args, **kwargs)^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module | |
^^ | |
^^self.pre_sub_module_forward_function(module) ^ ^^ | |
self.pre_sub_module_forward_function(module)^ ^ | |
^ ^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^^ ^^ ^^ | |
^ ^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook | |
^ ^^ ^return func(*args, **kwargs) | |
^ ^ | |
return func(*args, **kwargs)^^ | |
^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 504, in pre_sub_module_forward_function | |
^^^ ^^^ self.pre_sub_module_forward_function(module)^^^ | |
^^^ ^^^ ^^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^^^ ^^^ ^^^ param_coordinator.fetch_sub_module(sub_module, forward=True)^^^^ ^ | |
^^^assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()^ ^^^^ | |
^ ^^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^return func(*args, **kwargs)^^^^^ | |
^^^ ^^^^^ ^^^^^^ ^^^ | |
^ ^^^ | |
ret_val = func(*args, **kwargs)^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook | |
^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook | |
^^ ^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook | |
^ ^^ ^^^ ^ ^ ^ ^self.pre_sub_module_forward_function(module)^ ^^^ | |
^ ^^^^self.pre_sub_module_forward_function(module) | |
^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^^ | |
^self.pre_sub_module_forward_function(module)^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 504, in pre_sub_module_forward_function | |
^ | |
^^ ^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^^ | |
^^^ ^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 504, in pre_sub_module_forward_function | |
^^^^^ ^^^return func(*args, **kwargs)^^^ | |
^^^^^^^^ ^ ^^param_coordinator.fetch_sub_module(sub_module, forward=True)^ ^ ^ | |
^ ^return func(*args, **kwargs)^^^return func(*args, **kwargs) ^ | |
^ | |
^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 504, in pre_sub_module_forward_function | |
param_coordinator.fetch_sub_module(sub_module, forward=True)^ ^ | |
^ ^ ^ ^ ^ ^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^ ^ ^ ^ ^ ^ret_val = func(*args, **kwargs)^ ^^ ^ | |
^ ^ ^^ ret_val = func(*args, **kwargs) | |
param_coordinator.fetch_sub_module(sub_module, forward=True)^^ | |
^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^^ ^ ^^^ ^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^^^ ^ ^^^ ^ ^^^ ^ ^^^ ^ return func(*args, **kwargs)^^^ ^ | |
^^^ret_val = func(*args, **kwargs) ^ ^^^ | |
^ ^^^ ^ ^ ^^ ^ ^ | |
^ ^ ^ ^AssertionError ^^ ^ ^ ^: ^^^ ^ ^{'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}^^^ ^ ^ | |
^^ | |
^ ^^^^ ^ ^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 504, in pre_sub_module_forward_function | |
^ ^ ^^^ ^ ^^^ ^^^^^ ^^^ | |
^ | |
^^^ ^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 504, in pre_sub_module_forward_function | |
^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 504, in pre_sub_module_forward_function | |
^^^^ ^^^^param_coordinator.fetch_sub_module(sub_module, forward=True)^^^^ | |
^^^^^^^^^^^^^^^^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^^^^^^^ | |
^^ param_coordinator.fetch_sub_module(sub_module, forward=True)^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^^param_coordinator.fetch_sub_module(sub_module, forward=True)^^ | |
^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^^ret_val = func(*args, **kwargs)^^^ | |
File "/truba/home/dyuret/DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn | |
^ | |
^^ | |
File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
return func(*args, **kwargs) | |
return func(*args, **kwargs) | |
ret_val = func(*args, **kwargs) | |
^ret_val = func(*args, **kwargs) return func(*args, **kwargs)^ | |
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()^ ^ ^ | |
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^ ^ ^^^^ ^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^ | |
^^^^^ ^^^^^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^return func(*args, **kwargs)^^^^^^ | |
^ | |
^^^^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module | |
^^ | |
^^ ^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module | |
^ ^^^^ ^^ | |
^ ^^^ ^ | |
File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module | |
^ ^^ | |
^ File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() ^ | |
File "/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context | |
^ assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()^ | |
^^ ^^ ^^ ^^ ^^ assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() ^^ ^ | |
^^ return func(*args, **kwargs)^ ^return func(*args, **kwargs) | |
^ ^ | |
^ ^^ ^^^ ^^ ^ ^^ ^ ^^ ^ ^^^ ^ ^^^ ^ ^^^ ^ ^^^ ^ ^^^ ^^ ^^^ ^^^^^^ ^^^^ | |
^^^^^ | |
^^^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module | |
^^^^^AssertionError^^^^^^^^: ^^^^^{'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}^^^^^ | |
^^^^^^^^^^ ^^^^^assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()^^^^^ | |
^^^^^^^^^^^^^^^^^^ ^^^^^ ^^^^^ ^^^^^ ^^^^^ ^^^^^ ^^^^ | |
^^^^ ^ | |
^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module | |
^^^^ File "/truba/home/dyuret/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module | |
^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()^^^^^ | |
^ ^^^^assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()^^^^ | |
^^^ ^^ | |
^ | |
^^^ AssertionError^ ^AssertionError ^ : ^ ^: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}^ | |
{'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])} | |
^ | |
^ AssertionError ^ ^ : ^ {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])} ^^ | |
^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
^^^^AssertionError^^^^^: ^^{'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}^^ | |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
^^ | |
AssertionErrorAssertionError: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}: | |
{'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])} | |
[2023-08-30 00:26:22,090] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3512099 | |
[2023-08-30 00:26:22,134] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3512100 | |
[2023-08-30 00:26:22,160] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3512101 | |
[2023-08-30 00:26:22,160] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3512102 | |
[2023-08-30 00:26:22,318] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3512103 | |
[2023-08-30 00:26:22,345] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3512104 | |
[2023-08-30 00:26:22,369] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3512105 | |
[2023-08-30 00:26:22,605] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3512106 | |
[2023-08-30 00:26:22,631] [ERROR] [launch.py:321:sigkill_handler] ['/truba/home/dyuret/.julia/conda/3/x86_64/envs/llm/bin/python3.11', '-u', 'main.py', '--local_rank=7', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'meta-llama/Llama-2-7b-hf', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '4', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--gradient_checkpointing', '--zero_stage', '3', '--deepspeed', '--output_dir', './output_step1_llama2_7b'] exits with return code = 1 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment