chauhang/gist:bedba9ba588354d9b962359db844f674 Secret

## gistfile1.txt
CONFIG_FILE=./train_configs/llama_1b.toml ./run_llama_train.sh
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gchauhan/local/torchtrain
+ NGPU=8
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/llama_1b.toml
+ torchrun --nproc_per_node=8 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/llama_1b.toml
W0326 18:44:01.559000 140707157537792 torch/distributed/run.py:757]
W0326 18:44:01.559000 140707157537792 torch/distributed/run.py:757] *****************************************
W0326 18:44:01.559000 140707157537792 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0326 18:44:01.559000 140707157537792 torch/distributed/run.py:757] *****************************************
[rank0]:2024-03-26 18:44:05,682 - root - INFO - Starting job: LLaMA 1B training
[rank0]:2024-03-26 18:44:05,683 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-03-26 18:44:05,686 - root - INFO - Building 1-D device mesh with ['dp'], [8]
[rank0]:2024-03-26 18:44:05,689 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-26 18:44:05,698 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-03-26 18:44:05,698 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-03-26 18:44:07,994 - root - INFO - Building llama 1B with ModelArgs(dim=2048, n_layers=18, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-26 18:44:08,078 - root - INFO - Model llama 1B size: 1,055,991,808 total parameters
[rank0]:2024-03-26 18:44:08,079 - root - INFO - GPU capacity: AMD Instinct MI250X / MI250 (0) with 63.98GiB memory
[rank0]:2024-03-26 18:44:08,358 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-03-26 18:44:08,358 - root - INFO - Applied FSDP to the model
[rank0]:2024-03-26 18:44:08,386 - root - INFO - Model fully initialized via reset_parameters
[rank0]:2024-03-26 18:44:08,387 - root - INFO - Gradient scaling not enabled
[rank0]:2024-03-26 18:44:08,387 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs1b/tb/20240326-1844
[rank0]:2024-03-26 18:44:08,835 - root - INFO - Profiling active. Traces will be saved at ./outputs1b/profiling/traces
[rank0]:/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
[rank0]:  warnings.warn(
[rank0]:2024-03-26 18:44:17,066 - root - INFO - step:  1  loss: 10.9107  memory: 26.81GiB(41.90%)  wps: 1,991  mfu: 4.62%
[rank0]:2024-03-26 18:44:17,066 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-03-26 18:44:52,854 - root - INFO - step: 11  loss:  8.7908  memory: 26.84GiB(41.95%)  wps: 4,578  mfu: 10.63%
[rank0]:2024-03-26 18:45:28,646 - root - INFO - step: 21  loss:  6.1142  memory: 26.84GiB(41.95%)  wps: 4,578  mfu: 10.63%
[rank0]:2024-03-26 18:46:04,433 - root - INFO - step: 31  loss:  5.3807  memory: 26.84GiB(41.95%)  wps: 4,578  mfu: 10.63%
[rank0]:2024-03-26 18:46:40,243 - root - INFO - step: 41  loss:  4.7834  memory: 26.85GiB(41.97%)  wps: 4,575  mfu: 10.62%
[rank0]:2024-03-26 18:47:08,919 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 18:47:16,056 - root - INFO - step: 51  loss:  4.2560  memory: 26.85GiB(41.97%)  wps: 4,575  mfu: 10.62%
[rank0]:2024-03-26 18:47:51,871 - root - INFO - step: 61  loss:  3.9028  memory: 26.85GiB(41.97%)  wps: 4,575  mfu: 10.62%
[rank0]:2024-03-26 18:48:27,672 - root - INFO - step: 71  loss:  3.6946  memory: 26.85GiB(41.97%)  wps: 4,576  mfu: 10.62%
[rank0]:2024-03-26 18:49:03,476 - root - INFO - step: 81  loss:  3.4869  memory: 26.85GiB(41.97%)  wps: 4,576  mfu: 10.62%
[rank0]:2024-03-26 18:49:39,289 - root - INFO - step: 91  loss:  3.3043  memory: 26.85GiB(41.97%)  wps: 4,575  mfu: 10.62%
[rank0]:[rank0]:[W326 18:50:07.325342063 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-03-26 18:50:07,966 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:[rank0]:[W326 18:50:11.007441554 collection.cpp:1042] Warning: ROCTracer produced duplicate flow start: 4288 (function operator())
[rank0]:2024-03-26 18:50:16,359 - root - INFO - step: 101  loss:  3.1451  memory: 26.85GiB(41.97%)  wps: 4,420  mfu: 10.26%
[rank0]:2024-03-26 18:50:52,180 - root - INFO - step: 111  loss:  3.0083  memory: 26.85GiB(41.97%)  wps: 4,574  mfu: 10.62%
[rank0]:2024-03-26 18:51:28,004 - root - INFO - step: 121  loss:  2.9166  memory: 26.85GiB(41.97%)  wps: 4,574  mfu: 10.62%
[rank0]:2024-03-26 18:52:03,845 - root - INFO - step: 131  loss:  2.8386  memory: 26.85GiB(41.97%)  wps: 4,571  mfu: 10.61%
[rank0]:2024-03-26 18:52:39,680 - root - INFO - step: 141  loss:  2.7286  memory: 26.85GiB(41.97%)  wps: 4,572  mfu: 10.61%
[rank0]:2024-03-26 18:53:08,386 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 18:53:15,552 - root - INFO - step: 151  loss:  2.6372  memory: 26.85GiB(41.97%)  wps: 4,567  mfu: 10.60%
[rank0]:2024-03-26 18:53:51,395 - root - INFO - step: 161  loss:  2.5634  memory: 26.85GiB(41.97%)  wps: 4,571  mfu: 10.61%
[rank0]:2024-03-26 18:54:27,231 - root - INFO - step: 171  loss:  2.4823  memory: 26.85GiB(41.97%)  wps: 4,572  mfu: 10.61%
[rank0]:2024-03-26 18:55:03,058 - root - INFO - step: 181  loss:  2.4181  memory: 26.85GiB(41.97%)  wps: 4,573  mfu: 10.61%
[rank0]:2024-03-26 18:55:38,897 - root - INFO - step: 191  loss:  2.3751  memory: 26.85GiB(41.97%)  wps: 4,572  mfu: 10.61%
[rank0]:2024-03-26 18:56:04,075 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 18:56:16,253 - root - INFO - step: 201  loss:  2.2876  memory: 26.85GiB(41.97%)  wps: 4,386  mfu: 10.18%
[rank0]:2024-03-26 18:56:52,089 - root - INFO - step: 211  loss:  2.1803  memory: 26.85GiB(41.97%)  wps: 4,572  mfu: 10.61%
[rank0]:2024-03-26 18:57:27,947 - root - INFO - step: 221  loss:  2.1132  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.61%
[rank0]:2024-03-26 18:58:03,817 - root - INFO - step: 231  loss:  2.0633  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 18:58:39,664 - root - INFO - step: 241  loss:  1.9935  memory: 26.85GiB(41.97%)  wps: 4,571  mfu: 10.61%
[rank0]:2024-03-26 18:59:04,789 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 18:59:15,544 - root - INFO - step: 251  loss:  1.9305  memory: 26.85GiB(41.97%)  wps: 4,566  mfu: 10.60%
[rank0]:2024-03-26 18:59:51,396 - root - INFO - step: 261  loss:  1.8470  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:00:27,237 - root - INFO - step: 271  loss:  1.7749  memory: 26.85GiB(41.97%)  wps: 4,571  mfu: 10.61%
[rank0]:2024-03-26 19:01:03,085 - root - INFO - step: 281  loss:  1.7331  memory: 26.85GiB(41.97%)  wps: 4,571  mfu: 10.61%
[rank0]:2024-03-26 19:01:38,937 - root - INFO - step: 291  loss:  1.6711  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:02:04,095 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:02:16,170 - root - INFO - step: 301  loss:  1.6143  memory: 26.85GiB(41.97%)  wps: 4,400  mfu: 10.21%
[rank0]:2024-03-26 19:02:52,038 - root - INFO - step: 311  loss:  1.5341  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:03:27,902 - root - INFO - step: 321  loss:  1.4734  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:04:03,751 - root - INFO - step: 331  loss:  1.4322  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:04:39,600 - root - INFO - step: 341  loss:  1.3775  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:05:04,707 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:05:15,487 - root - INFO - step: 351  loss:  1.3271  memory: 26.85GiB(41.97%)  wps: 4,565  mfu: 10.60%
[rank0]:2024-03-26 19:05:51,333 - root - INFO - step: 361  loss:  1.2502  memory: 26.85GiB(41.97%)  wps: 4,571  mfu: 10.61%
[rank0]:2024-03-26 19:06:27,182 - root - INFO - step: 371  loss:  1.1916  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:07:03,031 - root - INFO - step: 381  loss:  1.1476  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:07:38,878 - root - INFO - step: 391  loss:  1.1054  memory: 26.85GiB(41.97%)  wps: 4,571  mfu: 10.61%
[rank0]:2024-03-26 19:08:00,427 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:08:16,177 - root - INFO - step: 401  loss:  1.0486  memory: 26.85GiB(41.97%)  wps: 4,393  mfu: 10.20%
[rank0]:2024-03-26 19:08:52,013 - root - INFO - step: 411  loss:  0.9800  memory: 26.85GiB(41.97%)  wps: 4,572  mfu: 10.61%
[rank0]:2024-03-26 19:09:27,860 - root - INFO - step: 421  loss:  0.9288  memory: 26.85GiB(41.97%)  wps: 4,571  mfu: 10.61%
[rank0]:2024-03-26 19:10:03,713 - root - INFO - step: 431  loss:  0.8843  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:10:39,565 - root - INFO - step: 441  loss:  0.8435  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:11:01,108 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:11:15,464 - root - INFO - step: 451  loss:  0.7902  memory: 26.85GiB(41.97%)  wps: 4,564  mfu: 10.59%
[rank0]:2024-03-26 19:11:51,347 - root - INFO - step: 461  loss:  0.7334  memory: 26.85GiB(41.97%)  wps: 4,566  mfu: 10.60%
[rank0]:2024-03-26 19:12:27,191 - root - INFO - step: 471  loss:  0.6900  memory: 26.85GiB(41.97%)  wps: 4,571  mfu: 10.61%
[rank0]:2024-03-26 19:13:03,050 - root - INFO - step: 481  loss:  0.6489  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.61%
[rank0]:2024-03-26 19:13:38,909 - root - INFO - step: 491  loss:  0.6124  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.61%
[rank0]:2024-03-26 19:14:00,448 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:14:16,291 - root - INFO - step: 501  loss:  0.5700  memory: 26.85GiB(41.97%)  wps: 4,383  mfu: 10.17%
[rank0]:2024-03-26 19:14:52,165 - root - INFO - step: 511  loss:  0.5230  memory: 26.85GiB(41.97%)  wps: 4,567  mfu: 10.60%
[rank0]:2024-03-26 19:15:28,038 - root - INFO - step: 521  loss:  0.4927  memory: 26.85GiB(41.97%)  wps: 4,567  mfu: 10.60%
[rank0]:2024-03-26 19:16:03,904 - root - INFO - step: 531  loss:  0.4569  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:16:39,768 - root - INFO - step: 541  loss:  0.4269  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:17:01,292 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:17:15,666 - root - INFO - step: 551  loss:  0.3989  memory: 26.85GiB(41.97%)  wps: 4,564  mfu: 10.59%
[rank0]:2024-03-26 19:17:51,578 - root - INFO - step: 561  loss:  0.3640  memory: 26.85GiB(41.97%)  wps: 4,562  mfu: 10.59%
[rank0]:2024-03-26 19:18:27,445 - root - INFO - step: 571  loss:  0.3443  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:19:03,296 - root - INFO - step: 581  loss:  0.3173  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:19:39,133 - root - INFO - step: 591  loss:  0.2979  memory: 26.85GiB(41.97%)  wps: 4,572  mfu: 10.61%
[rank0]:2024-03-26 19:19:57,099 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:20:16,560 - root - INFO - step: 601  loss:  0.2751  memory: 26.85GiB(41.97%)  wps: 4,378  mfu: 10.16%
[rank0]:2024-03-26 19:20:52,421 - root - INFO - step: 611  loss:  0.2531  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.60%
[rank0]:2024-03-26 19:21:28,279 - root - INFO - step: 621  loss:  0.2405  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.61%
[rank0]:2024-03-26 19:22:04,130 - root - INFO - step: 631  loss:  0.2216  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:22:39,981 - root - INFO - step: 641  loss:  0.2065  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:22:57,943 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:23:15,891 - root - INFO - step: 651  loss:  0.1955  memory: 26.85GiB(41.97%)  wps: 4,563  mfu: 10.59%
[rank0]:2024-03-26 19:23:51,762 - root - INFO - step: 661  loss:  0.1805  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:24:27,625 - root - INFO - step: 671  loss:  0.1703  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.60%
[rank0]:2024-03-26 19:25:03,493 - root - INFO - step: 681  loss:  0.1565  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:25:39,352 - root - INFO - step: 691  loss:  0.1462  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.61%
[rank0]:2024-03-26 19:25:57,299 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:26:16,693 - root - INFO - step: 701  loss:  0.1370  memory: 26.85GiB(41.97%)  wps: 4,388  mfu: 10.18%
[rank0]:2024-03-26 19:26:52,546 - root - INFO - step: 711  loss:  0.1275  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:27:28,407 - root - INFO - step: 721  loss:  0.1209  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.60%
[rank0]:2024-03-26 19:28:04,267 - root - INFO - step: 731  loss:  0.1133  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.61%
[rank0]:2024-03-26 19:28:40,126 - root - INFO - step: 741  loss:  0.1089  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.61%
[rank0]:2024-03-26 19:28:58,065 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:29:16,029 - root - INFO - step: 751  loss:  0.1019  memory: 26.85GiB(41.97%)  wps: 4,564  mfu: 10.59%
[rank0]:2024-03-26 19:29:51,889 - root - INFO - step: 761  loss:  0.0948  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.60%
[rank0]:2024-03-26 19:30:27,746 - root - INFO - step: 771  loss:  0.0898  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.61%
[rank0]:2024-03-26 19:31:03,597 - root - INFO - step: 781  loss:  0.0864  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:31:39,449 - root - INFO - step: 791  loss:  0.0802  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:31:53,831 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:32:16,827 - root - INFO - step: 801  loss:  0.0754  memory: 26.85GiB(41.97%)  wps: 4,384  mfu: 10.17%
[rank0]:2024-03-26 19:32:52,704 - root - INFO - step: 811  loss:  0.0700  memory: 26.85GiB(41.97%)  wps: 4,567  mfu: 10.60%
[rank0]:2024-03-26 19:33:28,556 - root - INFO - step: 821  loss:  0.0633  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:34:04,404 - root - INFO - step: 831  loss:  0.0613  memory: 26.85GiB(41.97%)  wps: 4,571  mfu: 10.61%
[rank0]:2024-03-26 19:34:40,270 - root - INFO - step: 841  loss:  0.0570  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:34:54,649 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:35:16,170 - root - INFO - step: 851  loss:  0.0533  memory: 26.85GiB(41.97%)  wps: 4,564  mfu: 10.59%
[rank0]:2024-03-26 19:35:52,037 - root - INFO - step: 861  loss:  0.0500  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:36:27,903 - root - INFO - step: 871  loss:  0.0459  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:37:03,775 - root - INFO - step: 881  loss:  0.0443  memory: 26.85GiB(41.97%)  wps: 4,567  mfu: 10.60%
[rank0]:2024-03-26 19:37:39,643 - root - INFO - step: 891  loss:  0.0422  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:37:54,005 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:38:16,983 - root - INFO - step: 901  loss:  0.0396  memory: 26.85GiB(41.97%)  wps: 4,388  mfu: 10.18%
[rank0]:2024-03-26 19:38:52,834 - root - INFO - step: 911  loss:  0.0380  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:39:28,690 - root - INFO - step: 921  loss:  0.0364  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.61%
[rank0]:2024-03-26 19:40:04,559 - root - INFO - step: 931  loss:  0.0351  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:40:40,412 - root - INFO - step: 941  loss:  0.0334  memory: 26.85GiB(41.97%)  wps: 4,570  mfu: 10.61%
[rank0]:2024-03-26 19:40:54,765 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:41:16,326 - root - INFO - step: 951  loss:  0.0320  memory: 26.85GiB(41.97%)  wps: 4,562  mfu: 10.59%
[rank0]:2024-03-26 19:41:52,188 - root - INFO - step: 961  loss:  0.0324  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.60%
[rank0]:2024-03-26 19:42:28,056 - root - INFO - step: 971  loss:  0.0314  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:43:03,927 - root - INFO - step: 981  loss:  0.0299  memory: 26.85GiB(41.97%)  wps: 4,568  mfu: 10.60%
[rank0]:2024-03-26 19:43:39,788 - root - INFO - step: 991  loss:  0.0299  memory: 26.85GiB(41.97%)  wps: 4,569  mfu: 10.60%
[rank0]:2024-03-26 19:43:50,584 - root - WARNING - Dataset alpaca is being re-looped. Loss related metrics might be misleading.
[rank0]:2024-03-26 19:44:12,171 - root - INFO - Saving a checkpoint at step 1000
[rank0]:[rank0]:[E326 19:44:37.537349911 ProcessGroupNCCL.cpp:1332] [PG 0 Rank 0] Received a global timeout from another rank and will start to dump the debug info. Last enqueued NCCL work: 57301, last completed NCCL work: 57301.
[rank0]:[rank0]:[E326 19:44:37.537555088 ProcessGroupNCCL.cpp:1167] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[rank0]:[F326 19:44:37.686925669 ProcessGroupNCCL.cpp:1185] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog detected a collective timeout on some other rank and notified current rank. This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. We tried our best to dump the debug info into the storage to help you debug the issue.
[rank0]:Fatal Python error: Aborted
[rank0]:
[rank0]:Thread 0x00007f01c17fa640 (most recent call first):
[rank0]:  <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f01c0ff9640 (most recent call first):
[rank0]:  <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f01c1ffb640 (most recent call first):
[rank0]:  <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f01c27fc640 (most recent call first):
[rank0]:  <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f01c2ffd640 (most recent call first):
[rank0]:  <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f01c37fe640 (most recent call first):
[rank0]:  <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f01c3fff640 (most recent call first):
[rank0]:  <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f070cdfc640 (most recent call first):
[rank0]:  <no Python frame>
[rank0]:
[rank0]:Thread 0x00007f07a6ffd640 (most recent call first):
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/threading.py", line 324 in wait
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/queue.py", line 180 in get
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/threading.py", line 995 in _bootstrap
[rank0]:
[rank0]:Thread 0x00007f13c65f5400 (most recent call first):
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3230 in scatter
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75 in wrapper
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2770 in scatter_object_list
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75 in wrapper
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 133 in scatter_object
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 188 in reduce_scatter
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 274 in _save_state_dict
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 145 in save
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 427 in inner_func
[rank0]:  File "/home/gchauhan/meta/torchtrain/torchtrain/checkpoint.py", line 114 in save
[rank0]:  File "/home/gchauhan/meta/torchtrain/train.py", line 368 in main
[rank0]:  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347 in wrapper
[rank0]:  File "/home/gchauhan/meta/torchtrain/train.py", line 389 in <module>
[rank0]:
[rank0]:Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, zstandard.backend_c, charset_normalizer.md, yaml._yaml, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, numexpr.interpreter, pyarrow._compute, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, google._upb._message, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils (total: 111)
[W326 19:44:39.322816983 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[W326 19:44:40.538868957 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

W0326 19:44:41.794000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 161714 closing signal SIGTERM
W0326 19:44:41.795000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 161715 closing signal SIGTERM
W0326 19:44:41.798000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 161716 closing signal SIGTERM
W0326 19:44:41.801000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 161717 closing signal SIGTERM
W0326 19:44:41.803000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 161718 closing signal SIGTERM
W0326 19:44:41.805000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 161722 closing signal SIGTERM
[W326 19:44:42.390774369 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[W326 19:44:42.416323971 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[W326 19:44:42.429864504 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[W326 19:44:42.437377152 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[W326 19:44:42.440171941 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[W326 19:44:42.440343134 Module.cpp:168] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

E0326 19:44:42.186000 140707157537792 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 161713) of binary: /home/gchauhan/my_envs/llm-amd/bin/python
Traceback (most recent call last):
  File "/home/gchauhan/my_envs/llm-amd/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gchauhan/my_envs/llm-amd/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-26_19:44:41
  host      : devgpu002.snc8.facebook.com
  rank      : 6 (local_rank: 6)
  exitcode  : -6 (pid: 161719)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 161719
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-26_19:44:41
  host      : devgpu002.snc8.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 161713)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 161713
=======================================================