Skip to content

Instantly share code, notes, and snippets.

@ethanabrooks
Created June 21, 2024 19:39
Show Gist options
  • Save ethanabrooks/bf75b1d76bb84e3eeb1a02b75ed16aec to your computer and use it in GitHub Desktop.
Save ethanabrooks/bf75b1d76bb84e3eeb1a02b75ed16aec to your computer and use it in GitHub Desktop.
nccl error
0%| | 0/4484 [00:00<?, ?it/s][rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
[rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600048 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=678, OpType=_ALLGATHER_BASE, NumelIn=4096, NumelOut=32768, Timeout(ms)=600000) ran for 600157 milliseconds before timing out.
[rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600063 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 680, last enqueued NCCL work: 680, last completed NCCL work: 679.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f03c09a3897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f03c1c7cc62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f03c1c81a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f03c1c82dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f040d736e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea7 (0x7f040e8f2ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7f040e6c3a6f in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f03c09a3897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f03c1c7cc62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f03c1c81a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f03c1c82dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f040d736e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea7 (0x7f040e8f2ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7f040e6c3a6f in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f03c09a3897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f03c1906119 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7f040d736e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x7ea7 (0x7f040e8f2ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x3f (0x7f040e6c3a6f in /lib/x86_64-linux-gnu/libc.so.6)
[rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 6] Timeout at NCCL work: 680, last enqueued NCCL work: 680, last completed NCCL work: 679.
[rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc466710897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc4679e9c62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc4679eea80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc4679efdcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7fc4b34a3e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea7 (0x7fc4b465fea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7fc4b4430a6f in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc466710897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc4679e9c62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc4679eea80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc4679efdcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7fc4b34a3e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea7 (0x7fc4b465fea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7fc4b4430a6f in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc466710897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7fc467673119 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7fc4b34a3e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x7ea7 (0x7fc4b465fea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x3f (0x7fc4b4430a6f in /lib/x86_64-linux-gnu/libc.so.6)
[rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 5] Timeout at NCCL work: 680, last enqueued NCCL work: 680, last completed NCCL work: 679.
[rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600048 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f3baa5897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8f3cd7ec62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8f3cd83a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8f3cd84dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f8f88838e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea7 (0x7f8f899f4ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7f8f897c5a6f in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600048 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f3baa5897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8f3cd7ec62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8f3cd83a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8f3cd84dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f8f88838e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea7 (0x7f8f899f4ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7f8f897c5a6f in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f3baa5897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f8f3ca08119 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7f8f88838e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x7ea7 (0x7f8f899f4ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x3f (0x7f8f897c5a6f in /lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 680, last enqueued NCCL work: 680, last completed NCCL work: 679.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4a7b412897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4a7c6ebc62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4a7c6f0a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4a7c6f1dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f4ac81a5e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea7 (0x7f4ac9361ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7f4ac9132a6f in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4a7b412897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4a7c6ebc62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4a7c6f0a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4a7c6f1dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f4ac81a5e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea7 (0x7f4ac9361ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7f4ac9132a6f in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4a7b412897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f4a7c375119 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7f4ac81a5e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x7ea7 (0x7f4ac9361ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x3f (0x7f4ac9132a6f in /lib/x86_64-linux-gnu/libc.so.6)
[rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 4] Timeout at NCCL work: 680, last enqueued NCCL work: 682, last completed NCCL work: 679.
[rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600063 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f59c1cd8897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f59c2fb1c62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f59c2fb6a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f59c2fb7dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f5a0ea6be95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea7 (0x7f5a0fc27ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7f5a0f9f8a6f in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600063 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f59c1cd8897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f59c2fb1c62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f59c2fb6a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f59c2fb7dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f5a0ea6be95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea7 (0x7f5a0fc27ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7f5a0f9f8a6f in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f59c1cd8897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f59c2c3b119 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7f5a0ea6be95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x7ea7 (0x7f5a0fc27ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x3f (0x7f5a0f9f8a6f in /lib/x86_64-linux-gnu/libc.so.6)
[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 678, last enqueued NCCL work: 680, last completed NCCL work: 677.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=678, OpType=_ALLGATHER_BASE, NumelIn=4096, NumelOut=32768, Timeout(ms)=600000) ran for 600157 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f92cae6a897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f92cc143c62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f92cc148a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f92cc149dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f9317bfde95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea7 (0x7f9318db9ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7f9318b8aa6f in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=678, OpType=_ALLGATHER_BASE, NumelIn=4096, NumelOut=32768, Timeout(ms)=600000) ran for 600157 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f92cae6a897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f92cc143c62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f92cc148a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f92cc149dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f9317bfde95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea7 (0x7f9318db9ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7f9318b8aa6f in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f92cae6a897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f92cbdcd119 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7f9317bfde95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x7ea7 (0x7f9318db9ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x3f (0x7f9318b8aa6f in /lib/x86_64-linux-gnu/libc.so.6)
W0621 19:37:31.172000 140614666856256 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1121510 closing signal SIGTERM
W0621 19:37:31.172000 140614666856256 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1121517 closing signal SIGTERM
E0621 19:37:32.302000 140614666856256 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 1121511) of binary: /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/bin/python
Traceback (most recent call last):aded
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment