Created
April 19, 2022 06:06
-
-
Save yukunlin/dd5e5ee6e41f84a696e76b74e75c65d0 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated | |
and will be removed in future. Use torchrun. | |
Note that --use_env is set by default in torchrun. | |
If your script expects `--local_rank` argument to be set, please | |
change it to read from `os.environ['LOCAL_RANK']` instead. See | |
https://pytorch.org/docs/stable/distributed.html#launch-utility for | |
further instructions | |
warnings.warn( | |
WARNING:torch.distributed.run: | |
***************************************** | |
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
***************************************** | |
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: | |
entrypoint : fairseq_train_wrapped | |
min_nodes : 2 | |
max_nodes : 2 | |
nproc_per_node : 8 | |
run_id : foobar | |
rdzv_backend : c10d | |
rdzv_endpoint : 10.0.0.115:29500 | |
rdzv_configs : {'timeout': 900} | |
max_restarts : 0 | |
monitor_interval : 5 | |
log_dir : None | |
metrics_cfg : {} | |
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu | |
INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python | |
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group | |
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result: | |
restart_count=0 | |
master_addr=ip-10-0-0-115.us-west-2.compute.internal | |
master_port=38441 | |
group_rank=0 | |
group_world_size=2 | |
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
role_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16] | |
global_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16] | |
INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/0/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/1/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/2/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/3/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/4/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/5/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/6/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_ld8yse9z/foobar_xotrmbnu/attempt_0/7/error.json | |
[0]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[1]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[2]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[3]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[4]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[5]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[7]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[6]:2022-04-19 05:52:27 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[0]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 0): env:// | |
[1]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 1): env:// | |
[4]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 4): env:// | |
[4]:2022-04-19 05:52:28 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 4 | |
[2]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 2): env:// | |
[7]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 7): env:// | |
[3]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 3): env:// | |
[5]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 5): env:// | |
[6]:2022-04-19 05:52:28 | INFO | fairseq.distributed.utils | distributed init (rank 6): env:// | |
[6]:2022-04-19 05:52:28 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 6 | |
[0]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0 | |
[0]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[0]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 0 | |
[1]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1 | |
[1]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[1]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 1 | |
[4]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[4]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 4 | |
[2]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2 | |
[2]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[2]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 2 | |
[3]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 3 | |
[3]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[3]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 3 | |
[5]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 5 | |
[5]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[5]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 5 | |
[7]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 7 | |
[7]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[7]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 7 | |
[6]:2022-04-19 05:52:29 | INFO | torch.distributed.distributed_c10d | Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[6]:2022-04-19 05:52:29 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 6 | |
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Selected Provider is efa | |
[4]:ip-10-0-0-115:78:78 [4] NCCL INFO Using network AWS Libfabric | |
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Selected Provider is efa | |
[0]:ip-10-0-0-115:74:74 [0] NCCL INFO Using network AWS Libfabric | |
[0]:NCCL version 2.10.3+cuda11.3 | |
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Selected Provider is efa | |
[1]:ip-10-0-0-115:75:75 [1] NCCL INFO Using network AWS Libfabric | |
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Selected Provider is efa | |
[2]:ip-10-0-0-115:76:76 [2] NCCL INFO Using network AWS Libfabric | |
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Selected Provider is efa | |
[7]:ip-10-0-0-115:81:81 [7] NCCL INFO Using network AWS Libfabric | |
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Selected Provider is efa | |
[3]:ip-10-0-0-115:77:77 [3] NCCL INFO Using network AWS Libfabric | |
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Selected Provider is efa | |
[5]:ip-10-0-0-115:79:79 [5] NCCL INFO Using network AWS Libfabric | |
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Selected Provider is efa | |
[6]:ip-10-0-0-115:80:80 [6] NCCL INFO Using network AWS Libfabric | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[6]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993 | |
[7]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993 | |
[2]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993 | |
[0]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993 | |
[1]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993 | |
[4]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993 | |
[3]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993 | |
[5]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993 | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76 closing signal SIGTERM | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 78 closing signal SIGTERM | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79 closing signal SIGTERM | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 80 closing signal SIGTERM | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 81 closing signal SIGTERM | |
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 74) of binary: /usr/bin/python | |
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish | |
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.22238445281982422 seconds | |
Traceback (most recent call last): | |
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 899, in _exit_barrier | |
store_util.barrier( | |
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier | |
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout) | |
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/utils/store.py", line 53, in synchronize | |
agent_data = get_all(store, key_prefix, world_size) | |
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/utils/store.py", line 31, in get_all | |
data = store.get(f"{prefix}{idx}") | |
RuntimeError: Stop_waiting response is expected | |
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'ip-10-0-0-115.us-west-2.compute.internal_1_0' has failed to send a keep-alive heartbeat to the rendezvous 'foobar' due to an error of type RendezvousStateError. | |
Traceback (most recent call last): | |
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main | |
return _run_code(code, main_globals, None, | |
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code | |
exec(code, run_globals) | |
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module> | |
main() | |
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main | |
launch(args) | |
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch | |
run(args) | |
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run | |
elastic_launch( | |
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__ | |
return launch_agent(self._config, self._entrypoint, list(args)) | |
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent | |
raise ChildFailedError( | |
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: | |
====================================================== | |
fairseq_train_wrapped FAILED | |
------------------------------------------------------ | |
Failures: | |
[1]: | |
time : 2022-04-19_05:52:40 | |
host : ip-10-0-0-115.us-west-2.compute.internal | |
rank : 1 (local_rank: 1) | |
exitcode : -6 (pid: 75) | |
error_file: <N/A> | |
traceback : Signal 6 (SIGABRT) received by PID 75 | |
[2]: | |
time : 2022-04-19_05:52:40 | |
host : ip-10-0-0-115.us-west-2.compute.internal | |
rank : 3 (local_rank: 3) | |
exitcode : -6 (pid: 77) | |
error_file: <N/A> | |
traceback : Signal 6 (SIGABRT) received by PID 77 | |
------------------------------------------------------ | |
Root Cause (first observed failure): | |
[0]: | |
time : 2022-04-19_05:52:40 | |
host : ip-10-0-0-115.us-west-2.compute.internal | |
rank : 0 (local_rank: 0) | |
exitcode : -6 (pid: 74) | |
error_file: <N/A> | |
traceback : Signal 6 (SIGABRT) received by PID 74 | |
====================================================== | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] 7/-1/-1->4->0 [2] 7/12/-1->4->-1 [3] 0/-1/-1->4->7 [4] -1/-1/-1->4->7 [5] 7/-1/-1->4->0 [6] 7/-1/-1->4->12 [7] 0/-1/-1->4->7 | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Channel 03 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Channel 05 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Channel 07 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0 | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 2/9/-1->1->-1 [2] 2/-1/-1->1->5 [3] -1/-1/-1->1->2 [4] 5/-1/-1->1->2 [5] 2/-1/-1->1->9 [6] 2/-1/-1->1->5 [7] -1/-1/-1->1->2 | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 03 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 05 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 07 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 01 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 1/-1/-1->2->3 | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 02 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 05 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 06 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 02 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2 | |
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 6/-1/-1->7->4 [2] 6/-1/-1->7->4 [3] 4/-1/-1->7->6 [4] 4/-1/-1->7->6 [5] 6/-1/-1->7->4 [6] 6/-1/-1->7->4 [7] 4/-1/-1->7->6 | |
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO Channel 03 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3 | |
[7]:ip-10-0-0-115:81:153 [7] NCCL INFO Channel 07 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 00/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 01/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 02/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 03/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 04/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 05/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 06/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 07/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 4/-1/-1->0->3 [2] -1/-1/-1->0->3 [3] 3/-1/-1->0->4 [4] 3/-1/-1->0->8 [5] 4/-1/-1->0->3 [6] -1/-1/-1->0->3 [7] 3/-1/-1->0->4 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 02 : 0[160] -> 1[170] via P2P/IPC | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 03 : 0[160] -> 1[170] via P2P/IPC | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 06 : 0[160] -> 1[170] via P2P/IPC | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 07 : 0[160] -> 1[170] via P2P/IPC | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 04 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 0/-1/-1->3->2 [3] 2/-1/-1->3->0 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 0/-1/-1->3->2 [7] 2/-1/-1->3->0 | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO Channel 03 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3 | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] -1/-1/-1->5->6 [2] 1/-1/-1->5->6 [3] 6/13/-1->5->-1 [4] 6/-1/-1->5->1 [5] -1/-1/-1->5->6 [6] 1/-1/-1->5->6 [7] 6/-1/-1->5->13 | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Channel 02 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Channel 04 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Channel 06 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Channel 01 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1 | |
[5]:ip-10-0-0-115:79:149 [5] NCCL INFO Channel 05 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1 | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 5/-1/-1->6->7 [2] 5/-1/-1->6->7 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 5/-1/-1->6->7 [6] 5/-1/-1->6->7 [7] 7/-1/-1->6->5 | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO Channel 04 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO Channel 02 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2 | |
[6]:ip-10-0-0-115:80:150 [6] NCCL INFO Channel 06 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2 | |
[4]:ip-10-0-0-115:78:152 [4] NCCL INFO Channel 04 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0 | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 05 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 02 : 1[170] -> 5[1b0] via P2P/IPC | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 04 : 1[170] -> 5[1b0] via P2P/IPC | |
[1]:ip-10-0-0-115:75:155 [1] NCCL INFO Channel 06 : 1[170] -> 5[1b0] via P2P/IPC | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 06 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2 | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 03 : 2[180] -> 6[1c0] via P2P/IPC | |
[2]:ip-10-0-0-115:76:154 [2] NCCL INFO Channel 07 : 2[180] -> 6[1c0] via P2P/IPC | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 04 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 01 : 0[160] -> 4[1a0] via P2P/IPC | |
[0]:ip-10-0-0-115:74:148 [0] NCCL INFO Channel 05 : 0[160] -> 4[1a0] via P2P/IPC | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO Channel 07 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3 | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO Channel 02 : 3[190] -> 7[1d0] via P2P/IPC | |
[3]:ip-10-0-0-115:77:151 [3] NCCL INFO Channel 06 : 3[190] -> 7[1d0] via P2P/IPC |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment