Skip to content

Instantly share code, notes, and snippets.

@yukunlin
Created April 19, 2022 06:06
Show Gist options
  • Save yukunlin/dd5e5ee6e41f84a696e76b74e75c65d0 to your computer and use it in GitHub Desktop.
Save yukunlin/dd5e5ee6e41f84a696e76b74e75c65d0 to your computer and use it in GitHub Desktop.
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : fairseq_train_wrapped
min_nodes : 2
max_nodes : 2
nproc_per_node : 8
run_id : foobar
rdzv_backend : c10d
rdzv_endpoint : 10.0.0.115:29500
rdzv_configs : {'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu
INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result:
restart_count=0
master_addr=ip-10-0-0-115.us-west-2.compute.internal
master_port=38441
group_rank=0
group_world_size=2
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
global_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/7/error.json
[0]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[1]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[2]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[3]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[4]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[5]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[7]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[6]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[0]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 0): env://
[1]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 1): env://
[4]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 4): env://
[4]:2022-04-19 05:52:28 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 4
[2]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 2): env://
[7]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 7): env://
[3]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 3): env://
[5]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 5): env://
[6]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 6): env://
[6]:2022-04-19 05:52:28 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 6
[0]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0
[0]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[0]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 0
[1]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1
[1]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[1]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 1
[4]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[4]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 4
[2]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2
[2]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[2]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 2
[3]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 3
[3]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[3]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 3
[5]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 5
[5]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[5]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 5
[7]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 7
[7]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[7]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 7
[6]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[6]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 6
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Selected Provider is efa
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO Using network AWS Libfabric
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Selected Provider is efa
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO Using network AWS Libfabric
[0]:NCCL version 2.10.3+cuda11.3
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Selected Provider is efa
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO Using network AWS Libfabric
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Selected Provider is efa
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO Using network AWS Libfabric
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Selected Provider is efa
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO Using network AWS Libfabric
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Selected Provider is efa
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO Using network AWS Libfabric
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Selected Provider is efa
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO Using network AWS Libfabric
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Selected Provider is efa
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO Using network AWS Libfabric
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[6]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993
[7]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993
[2]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993
[0]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993
[1]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993
[4]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993
[3]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993
[5]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 78 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 80 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 81 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 74) of binary: /usr/bin/python
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.22238445281982422 seconds
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 899, in _exit_barrier
store_util.barrier(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/utils/store.py", line 53, in synchronize
agent_data = get_all(store, key_prefix, world_size)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/utils/store.py", line 31, in get_all
data = store.get(f"{prefix}{idx}")
RuntimeError: Stop_waiting response is expected
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'ip-10-0-0-115.us-west-2.compute.internal_1_0' has failed to send a keep-alive heartbeat to the rendezvous 'foobar' due to an error of type RendezvousStateError.
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
fairseq_train_wrapped FAILED
------------------------------------------------------
Failures:
[1]:
time : 2022-04-19_05:52:40
host : ip-10-0-0-115.us-west-2.compute.internal
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 75)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 75
[2]:
time : 2022-04-19_05:52:40
host : ip-10-0-0-115.us-west-2.compute.internal
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 77)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 77
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-04-19_05:52:40
host : ip-10-0-0-115.us-west-2.compute.internal
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 74)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 74
======================================================
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] 7/-1/-1->4->0 [2] 7/12/-1->4->-1 [3] 0/-1/-1->4->7 [4] -1/-1/-1->4->7 [5] 7/-1/-1->4->0 [6] 7/-1/-1->4->12 [7] 0/-1/-1->4->7
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Channel 03 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Channel 05 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Channel 07 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 2/9/-1->1->-1 [2] 2/-1/-1->1->5 [3] -1/-1/-1->1->2 [4] 5/-1/-1->1->2 [5] 2/-1/-1->1->9 [6] 2/-1/-1->1->5 [7] -1/-1/-1->1->2
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 03 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 05 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 07 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 01 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 1/-1/-1->2->3
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 02 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 05 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 06 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 02 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 6/-1/-1->7->4 [2] 6/-1/-1->7->4 [3] 4/-1/-1->7->6 [4] 4/-1/-1->7->6 [5] 6/-1/-1->7->4 [6] 6/-1/-1->7->4 [7] 4/-1/-1->7->6
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO Channel 03 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO Channel 07 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 00/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 01/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 02/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 03/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 04/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 05/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 06/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 07/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 4/-1/-1->0->3 [2] -1/-1/-1->0->3 [3] 3/-1/-1->0->4 [4] 3/-1/-1->0->8 [5] 4/-1/-1->0->3 [6] -1/-1/-1->0->3 [7] 3/-1/-1->0->4
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 02 : 0[160] -> 1[170] via P2P/IPC
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 03 : 0[160] -> 1[170] via P2P/IPC
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 06 : 0[160] -> 1[170] via P2P/IPC
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 07 : 0[160] -> 1[170] via P2P/IPC
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 04 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 0/-1/-1->3->2 [3] 2/-1/-1->3->0 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 0/-1/-1->3->2 [7] 2/-1/-1->3->0
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO Channel 03 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] -1/-1/-1->5->6 [2] 1/-1/-1->5->6 [3] 6/13/-1->5->-1 [4] 6/-1/-1->5->1 [5] -1/-1/-1->5->6 [6] 1/-1/-1->5->6 [7] 6/-1/-1->5->13
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Channel 02 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Channel 04 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Channel 06 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Channel 01 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Channel 05 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 5/-1/-1->6->7 [2] 5/-1/-1->6->7 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 5/-1/-1->6->7 [6] 5/-1/-1->6->7 [7] 7/-1/-1->6->5
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO Channel 04 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO Channel 02 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO Channel 06 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Channel 04 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 05 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 02 : 1[170] -> 5[1b0] via P2P/IPC
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 04 : 1[170] -> 5[1b0] via P2P/IPC
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 06 : 1[170] -> 5[1b0] via P2P/IPC
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 06 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 03 : 2[180] -> 6[1c0] via P2P/IPC
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 07 : 2[180] -> 6[1c0] via P2P/IPC
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 04 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 01 : 0[160] -> 4[1a0] via P2P/IPC
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 05 : 0[160] -> 4[1a0] via P2P/IPC
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO Channel 07 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO Channel 02 : 3[190] -> 7[1d0] via P2P/IPC
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO Channel 06 : 3[190] -> 7[1d0] via P2P/IPC
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment