Skip to content

Instantly share code, notes, and snippets.

@chauhang
Created March 27, 2024 02:57
Show Gist options
  • Save chauhang/bedba9ba588354d9b962359db844f674 to your computer and use it in GitHub Desktop.
Save chauhang/bedba9ba588354d9b962359db844f674 to your computer and use it in GitHub Desktop.
torchtrain checkpoint save error 1b
CONFIG_FILE=./train_configs/llama_1b.toml ./run_llama_train.sh
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gchauhan/local/torchtrain
+ NGPU=8
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/llama_1b.toml
+ torchrun --nproc_per_node=8 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/llama_1b.toml
W0326 18:44:01.559000 140707157537792 torch/distributed/run.py:757]
W0326 18:44:01.559000 140707157537792 torch/distributed/run.py:757] *****************************************
W0326 18:44:01.559000 140707157537792 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0326 18:44:01.559000 140707157537792 torch/distributed/run.py:757] *****************************************
[rank0]:2024-03-26 18:44:05,682 - root - INFO - Starting job: LLaMA 1B training
[rank0]:2024-03-26 18:44:05,683 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-03-26 18:44:05,686 - root - INFO - Building 1-D device mesh with ['dp'], [8]
[rank0]:2024-03-26 18:44:05,689 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-26 18:44:05,698 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-03-26 18:44:05,698 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-03-26 18:44:07,994 - root - INFO - Building llama 1B with ModelArgs(dim=2048, n_layers=18, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-26 18:44:08,078 - root - INFO - Model llama 1B size: 1,055,991,808 total parameters
[rank0]:2024-03-26 18:44:08,079 - root - INFO - GPU capacity: AMD Instinct MI250X / MI250 (0) with 63.98GiB memory
[rank0]:2024-03-26 18:44:08,358 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-03-26 18:44:08,358 - root - INFO - Applied FSDP to the model
[rank0]:2024-03-26 18:44:08,386 - root - INFO - Model fully initialized via reset_parameters
[rank0]:2024-03-26 18:44:08,387 - root - INFO - Gradient scaling not enabled
[rank0]:2024-03-26 18:44:08,387 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs1b/tb/20240326-1844
[rank0]:2024-03-26 18:44:08,835 - root - INFO - Profiling active. Traces will be saved at ./outputs1b/profiling/traces
[rank0]:/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
[rank0]: warnings.warn(
[rank0]:2024-03-26 18:44:17,066 - root - INFO - step: 1 loss: 10.9107 memory: 26.81GiB(41.90%) wps: 1,991 mfu: 4.62%
[rank0]:2024-03-26 18:44:17,066 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-03-26 18:44:52,854 - root - INFO - step: 11 loss: 8.7908 memory: 26.84GiB(41.95%) wps: 4,578 mfu: 10.63%
[rank0]:2024-03-26 18:45:28,646 - root - INFO - step: 21 loss: 6.1142 memory: 26.84GiB(41.95%) wps: 4,578 mfu: 10.63%
[rank0]:2024-03-26 18:46:04,433 - root - INFO - step: 31 loss: 5.3807 memory: 26.84GiB(41.95%) wps: 4,578 mfu: 10.63%
[rank0]:2024-03-26 18:46:40,243 - root - INFO - step: 41 loss: 4.7834 memory: 26.85GiB(41.97%) wps: 4,575 mfu: 10.62%
[rank0]:2024-03-26 18:47:08,919 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 18:47:16,056 - root - INFO - step: 51 loss: 4.2560 memory: 26.85GiB(41.97%) wps: 4,575 mfu: 10.62%
[rank0]:2024-03-26 18:47:51,871 - root - INFO - step: 61 loss: 3.9028 memory: 26.85GiB(41.97%) wps: 4,575 mfu: 10.62%
[rank0]:2024-03-26 18:48:27,672 - root - INFO - step: 71 loss: 3.6946 memory: 26.85GiB(41.97%) wps: 4,576 mfu: 10.62%
[rank0]:2024-03-26 18:49:03,476 - root - INFO - step: 81 loss: 3.4869 memory: 26.85GiB(41.97%) wps: 4,576 mfu: 10.62%
[rank0]:2024-03-26 18:49:39,289 - root - INFO - step: 91 loss: 3.3043 memory: 26.85GiB(41.97%) wps: 4,575 mfu: 10.62%
[rank0]:[rank0]:[W326 18:50:07.325342063 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-03-26 18:50:07,966 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:[rank0]:[W326 18:50:11.007441554 collection.cpp:1042] Warning: ROCTracer produced duplicate flow start: 4288 (function operator())
[rank0]:2024-03-26 18:50:16,359 - root - INFO - step: 101 loss: 3.1451 memory: 26.85GiB(41.97%) wps: 4,420 mfu: 10.26%
[rank0]:2024-03-26 18:50:52,180 - root - INFO - step: 111 loss: 3.0083 memory: 26.85GiB(41.97%) wps: 4,574 mfu: 10.62%
[rank0]:2024-03-26 18:51:28,004 - root - INFO - step: 121 loss: 2.9166 memory: 26.85GiB(41.97%) wps: 4,574 mfu: 10.62%
[rank0]:2024-03-26 18:52:03,845 - root - INFO - step: 131 loss: 2.8386 memory: 26.85GiB(41.97%) wps: 4,571 mfu: 10.61%
[rank0]:2024-03-26 18:52:39,680 - root - INFO - step: 141 loss: 2.7286 memory: 26.85GiB(41.97%) wps: 4,572 mfu: 10.61%
[rank0]:2024-03-26 18:53:08,386 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 18:53:15,552 - root - INFO - step: 151 loss: 2.6372 memory: 26.85GiB(41.97%) wps: 4,567 mfu: 10.60%
[rank0]:2024-03-26 18:53:51,395 - root - INFO - step: 161 loss: 2.5634 memory: 26.85GiB(41.97%) wps: 4,571 mfu: 10.61%
[rank0]:2024-03-26 18:54:27,231 - root - INFO - step: 171 loss: 2.4823 memory: 26.85GiB(41.97%) wps: 4,572 mfu: 10.61%
[rank0]:2024-03-26 18:55:03,058 - root - INFO - step: 181 loss: 2.4181 memory: 26.85GiB(41.97%) wps: 4,573 mfu: 10.61%
[rank0]:2024-03-26 18:55:38,897 - root - INFO - step: 191 loss: 2.3751 memory: 26.85GiB(41.97%) wps: 4,572 mfu: 10.61%
[rank0]:2024-03-26 18:56:04,075 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 18:56:16,253 - root - INFO - step: 201 loss: 2.2876 memory: 26.85GiB(41.97%) wps: 4,386 mfu: 10.18%
[rank0]:2024-03-26 18:56:52,089 - root - INFO - step: 211 loss: 2.1803 memory: 26.85GiB(41.97%) wps: 4,572 mfu: 10.61%
[rank0]:2024-03-26 18:57:27,947 - root - INFO - step: 221 loss: 2.1132 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.61%
[rank0]:2024-03-26 18:58:03,817 - root - INFO - step: 231 loss: 2.0633 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 18:58:39,664 - root - INFO - step: 241 loss: 1.9935 memory: 26.85GiB(41.97%) wps: 4,571 mfu: 10.61%
[rank0]:2024-03-26 18:59:04,789 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 18:59:15,544 - root - INFO - step: 251 loss: 1.9305 memory: 26.85GiB(41.97%) wps: 4,566 mfu: 10.60%
[rank0]:2024-03-26 18:59:51,396 - root - INFO - step: 261 loss: 1.8470 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:00:27,237 - root - INFO - step: 271 loss: 1.7749 memory: 26.85GiB(41.97%) wps: 4,571 mfu: 10.61%
[rank0]:2024-03-26 19:01:03,085 - root - INFO - step: 281 loss: 1.7331 memory: 26.85GiB(41.97%) wps: 4,571 mfu: 10.61%
[rank0]:2024-03-26 19:01:38,937 - root - INFO - step: 291 loss: 1.6711 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:02:04,095 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:02:16,170 - root - INFO - step: 301 loss: 1.6143 memory: 26.85GiB(41.97%) wps: 4,400 mfu: 10.21%
[rank0]:2024-03-26 19:02:52,038 - root - INFO - step: 311 loss: 1.5341 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:03:27,902 - root - INFO - step: 321 loss: 1.4734 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:04:03,751 - root - INFO - step: 331 loss: 1.4322 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:04:39,600 - root - INFO - step: 341 loss: 1.3775 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:05:04,707 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:05:15,487 - root - INFO - step: 351 loss: 1.3271 memory: 26.85GiB(41.97%) wps: 4,565 mfu: 10.60%
[rank0]:2024-03-26 19:05:51,333 - root - INFO - step: 361 loss: 1.2502 memory: 26.85GiB(41.97%) wps: 4,571 mfu: 10.61%
[rank0]:2024-03-26 19:06:27,182 - root - INFO - step: 371 loss: 1.1916 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:07:03,031 - root - INFO - step: 381 loss: 1.1476 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:07:38,878 - root - INFO - step: 391 loss: 1.1054 memory: 26.85GiB(41.97%) wps: 4,571 mfu: 10.61%
[rank0]:2024-03-26 19:08:00,427 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:08:16,177 - root - INFO - step: 401 loss: 1.0486 memory: 26.85GiB(41.97%) wps: 4,393 mfu: 10.20%
[rank0]:2024-03-26 19:08:52,013 - root - INFO - step: 411 loss: 0.9800 memory: 26.85GiB(41.97%) wps: 4,572 mfu: 10.61%
[rank0]:2024-03-26 19:09:27,860 - root - INFO - step: 421 loss: 0.9288 memory: 26.85GiB(41.97%) wps: 4,571 mfu: 10.61%
[rank0]:2024-03-26 19:10:03,713 - root - INFO - step: 431 loss: 0.8843 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:10:39,565 - root - INFO - step: 441 loss: 0.8435 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:11:01,108 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:11:15,464 - root - INFO - step: 451 loss: 0.7902 memory: 26.85GiB(41.97%) wps: 4,564 mfu: 10.59%
[rank0]:2024-03-26 19:11:51,347 - root - INFO - step: 461 loss: 0.7334 memory: 26.85GiB(41.97%) wps: 4,566 mfu: 10.60%
[rank0]:2024-03-26 19:12:27,191 - root - INFO - step: 471 loss: 0.6900 memory: 26.85GiB(41.97%) wps: 4,571 mfu: 10.61%
[rank0]:2024-03-26 19:13:03,050 - root - INFO - step: 481 loss: 0.6489 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.61%
[rank0]:2024-03-26 19:13:38,909 - root - INFO - step: 491 loss: 0.6124 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.61%
[rank0]:2024-03-26 19:14:00,448 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:14:16,291 - root - INFO - step: 501 loss: 0.5700 memory: 26.85GiB(41.97%) wps: 4,383 mfu: 10.17%
[rank0]:2024-03-26 19:14:52,165 - root - INFO - step: 511 loss: 0.5230 memory: 26.85GiB(41.97%) wps: 4,567 mfu: 10.60%
[rank0]:2024-03-26 19:15:28,038 - root - INFO - step: 521 loss: 0.4927 memory: 26.85GiB(41.97%) wps: 4,567 mfu: 10.60%
[rank0]:2024-03-26 19:16:03,904 - root - INFO - step: 531 loss: 0.4569 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:16:39,768 - root - INFO - step: 541 loss: 0.4269 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:17:01,292 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:17:15,666 - root - INFO - step: 551 loss: 0.3989 memory: 26.85GiB(41.97%) wps: 4,564 mfu: 10.59%
[rank0]:2024-03-26 19:17:51,578 - root - INFO - step: 561 loss: 0.3640 memory: 26.85GiB(41.97%) wps: 4,562 mfu: 10.59%
[rank0]:2024-03-26 19:18:27,445 - root - INFO - step: 571 loss: 0.3443 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:19:03,296 - root - INFO - step: 581 loss: 0.3173 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:19:39,133 - root - INFO - step: 591 loss: 0.2979 memory: 26.85GiB(41.97%) wps: 4,572 mfu: 10.61%
[rank0]:2024-03-26 19:19:57,099 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:20:16,560 - root - INFO - step: 601 loss: 0.2751 memory: 26.85GiB(41.97%) wps: 4,378 mfu: 10.16%
[rank0]:2024-03-26 19:20:52,421 - root - INFO - step: 611 loss: 0.2531 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.60%
[rank0]:2024-03-26 19:21:28,279 - root - INFO - step: 621 loss: 0.2405 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.61%
[rank0]:2024-03-26 19:22:04,130 - root - INFO - step: 631 loss: 0.2216 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:22:39,981 - root - INFO - step: 641 loss: 0.2065 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:22:57,943 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:23:15,891 - root - INFO - step: 651 loss: 0.1955 memory: 26.85GiB(41.97%) wps: 4,563 mfu: 10.59%
[rank0]:2024-03-26 19:23:51,762 - root - INFO - step: 661 loss: 0.1805 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:24:27,625 - root - INFO - step: 671 loss: 0.1703 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.60%
[rank0]:2024-03-26 19:25:03,493 - root - INFO - step: 681 loss: 0.1565 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:25:39,352 - root - INFO - step: 691 loss: 0.1462 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.61%
[rank0]:2024-03-26 19:25:57,299 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:26:16,693 - root - INFO - step: 701 loss: 0.1370 memory: 26.85GiB(41.97%) wps: 4,388 mfu: 10.18%
[rank0]:2024-03-26 19:26:52,546 - root - INFO - step: 711 loss: 0.1275 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:27:28,407 - root - INFO - step: 721 loss: 0.1209 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.60%
[rank0]:2024-03-26 19:28:04,267 - root - INFO - step: 731 loss: 0.1133 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.61%
[rank0]:2024-03-26 19:28:40,126 - root - INFO - step: 741 loss: 0.1089 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.61%
[rank0]:2024-03-26 19:28:58,065 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:29:16,029 - root - INFO - step: 751 loss: 0.1019 memory: 26.85GiB(41.97%) wps: 4,564 mfu: 10.59%
[rank0]:2024-03-26 19:29:51,889 - root - INFO - step: 761 loss: 0.0948 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.60%
[rank0]:2024-03-26 19:30:27,746 - root - INFO - step: 771 loss: 0.0898 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.61%
[rank0]:2024-03-26 19:31:03,597 - root - INFO - step: 781 loss: 0.0864 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:31:39,449 - root - INFO - step: 791 loss: 0.0802 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:31:53,831 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:32:16,827 - root - INFO - step: 801 loss: 0.0754 memory: 26.85GiB(41.97%) wps: 4,384 mfu: 10.17%
[rank0]:2024-03-26 19:32:52,704 - root - INFO - step: 811 loss: 0.0700 memory: 26.85GiB(41.97%) wps: 4,567 mfu: 10.60%
[rank0]:2024-03-26 19:33:28,556 - root - INFO - step: 821 loss: 0.0633 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:34:04,404 - root - INFO - step: 831 loss: 0.0613 memory: 26.85GiB(41.97%) wps: 4,571 mfu: 10.61%
[rank0]:2024-03-26 19:34:40,270 - root - INFO - step: 841 loss: 0.0570 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:34:54,649 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:35:16,170 - root - INFO - step: 851 loss: 0.0533 memory: 26.85GiB(41.97%) wps: 4,564 mfu: 10.59%
[rank0]:2024-03-26 19:35:52,037 - root - INFO - step: 861 loss: 0.0500 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:36:27,903 - root - INFO - step: 871 loss: 0.0459 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:37:03,775 - root - INFO - step: 881 loss: 0.0443 memory: 26.85GiB(41.97%) wps: 4,567 mfu: 10.60%
[rank0]:2024-03-26 19:37:39,643 - root - INFO - step: 891 loss: 0.0422 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:37:54,005 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:38:16,983 - root - INFO - step: 901 loss: 0.0396 memory: 26.85GiB(41.97%) wps: 4,388 mfu: 10.18%
[rank0]:2024-03-26 19:38:52,834 - root - INFO - step: 911 loss: 0.0380 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:39:28,690 - root - INFO - step: 921 loss: 0.0364 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.61%
[rank0]:2024-03-26 19:40:04,559 - root - INFO - step: 931 loss: 0.0351 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:40:40,412 - root - INFO - step: 941 loss: 0.0334 memory: 26.85GiB(41.97%) wps: 4,570 mfu: 10.61%
[rank0]:2024-03-26 19:40:54,765 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:41:16,326 - root - INFO - step: 951 loss: 0.0320 memory: 26.85GiB(41.97%) wps: 4,562 mfu: 10.59%
[rank0]:2024-03-26 19:41:52,188 - root - INFO - step: 961 loss: 0.0324 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.60%
[rank0]:2024-03-26 19:42:28,056 - root - INFO - step: 971 loss: 0.0314 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:43:03,927 - root - INFO - step: 981 loss: 0.0299 memory: 26.85GiB(41.97%) wps: 4,568 mfu: 10.60%
[rank0]:2024-03-26 19:43:39,788 - root - INFO - step: 991 loss: 0.0299 memory: 26.85GiB(41.97%) wps: 4,569 mfu: 10.60%
[rank0]:2024-03-26 19:43:50,584 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:44:12,171 - root - INFO - Saving a checkpoint at step 1000
[rank0]:[rank0]:[E326 19:44:37.537349911 ProcessGroupNCCL.cpp:1332] [PG 0 Rank 0] Received a global timeout from another rank and will start to dump the debug info. Last enqueued NCCL work: 57301, last completed NCCL work: 57301.
[rank0]:[rank0]:[E326 19:44:37.537555088 ProcessGroupNCCL.cpp:1167] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[rank0]:[F326 19:44:37.686925669 ProcessGroupNCCL.cpp:1185] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog detected a collective timeout on some other rank and notified current rank. This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. We tried our best to dump the debug info into the storage to help you debug the issue.
[rank0]:Fatal Python error: Aborted
[rank0]:
[rank0]:Thread 0x00007f01c17fa640 (most recent call first):
[rank0]: <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f01c0ff9640 (most recent call first):
[rank0]: <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f01c1ffb640 (most recent call first):
[rank0]: <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f01c27fc640 (most recent call first):
[rank0]: <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f01c2ffd640 (most recent call first):
[rank0]: <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f01c37fe640 (most recent call first):
[rank0]: <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f01c3fff640 (most recent call first):
[rank0]: <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f070cdfc640 (most recent call first):
[rank0]: <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f07a6ffd640 (most recent call first):
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/threading.py", line 324 in wait
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/queue.py", line 180 in get
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/threading.py", line 995 in _bootstrap
[rank0]:
[rank0]:Thread 0x00007f13c65f5400 (most recent call first):
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3230 in scatter
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75 in wrapper
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2770 in scatter_object_list
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75 in wrapper
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 133 in scatter_object
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 188 in reduce_scatter
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 274 in _save_state_dict
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 145 in save
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 427 in inner_func
[rank0]: File "/home/gchauhan/meta/torchtrain/torchtrain/checkpoint.py", line 114 in save
[rank0]: File "/home/gchauhan/meta/torchtrain/train.py", line 368 in main
[rank0]: File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347 in wrapper
[rank0]: File "/home/gchauhan/meta/torchtrain/train.py", line 389 in <module>
[rank0]:
[rank0]:Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, zstandard.backend_c, charset_normalizer.md, yaml._yaml, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, numexpr.interpreter, pyarrow._compute, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, google._upb._message, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils (total: 111)
[W326 19:44:39.322816983 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[W326 19:44:40.538868957 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
W0326 19:44:41.794000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 161714 closing signal SIGTERM
W0326 19:44:41.795000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 161715 closing signal SIGTERM
W0326 19:44:41.798000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 161716 closing signal SIGTERM
W0326 19:44:41.801000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 161717 closing signal SIGTERM
W0326 19:44:41.803000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 161718 closing signal SIGTERM
W0326 19:44:41.805000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 161722 closing signal SIGTERM
[W326 19:44:42.390774369 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[W326 19:44:42.416323971 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[W326 19:44:42.429864504 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[W326 19:44:42.437377152 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[W326 19:44:42.440171941 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[W326 19:44:42.440343134 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
E0326 19:44:42.186000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 161713) of binary: /home/gchauhan/my_envs/llm-amd/bin/python
Traceback (most recent call last):
File "/home/gchauhan/my_envs/llm-amd/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
[1]:
time : 2024-03-26_19:44:41
host : devgpu002.snc8.facebook.com
rank : 6 (local_rank: 6)
exitcode : -6 (pid: 161719)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 161719
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-26_19:44:41
host : devgpu002.snc8.facebook.com
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 161713)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 161713
=======================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment