-
-
Save yukunlin/634c600a11e36d1384215ab08366e774 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated | |
and will be removed in future. Use torchrun. | |
Note that --use_env is set by default in torchrun. | |
If your script expects `--local_rank` argument to be set, please | |
change it to read from `os.environ['LOCAL_RANK']` instead. See | |
https://pytorch.org/docs/stable/distributed.html#launch-utility for | |
further instructions | |
warnings.warn( | |
WARNING:torch.distributed.run: | |
***************************************** | |
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
***************************************** | |
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: | |
entrypoint : fairseq_train_wrapped | |
min_nodes : 2 | |
max_nodes : 2 | |
nproc_per_node : 8 | |
run_id : foobar | |
rdzv_backend : c10d | |
rdzv_endpoint : 10.0.0.213:29500 | |
rdzv_configs : {'timeout': 900} | |
max_restarts : 0 | |
monitor_interval : 5 | |
log_dir : None | |
metrics_cfg : {} | |
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_ | |
INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python | |
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group | |
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result: | |
restart_count=0 | |
master_addr=ip-10-0-0-175.us-west-2.compute.internal | |
master_port=39177 | |
group_rank=0 | |
group_world_size=2 | |
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
role_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16] | |
global_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16] | |
INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/0/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/1/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/2/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/3/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/4/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/5/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/6/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/7/error.json | |
[1]:2022-04-18 23:27:56 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[7]:2022-04-18 23:27:56 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[2]:2022-04-18 23:27:56 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[3]:2022-04-18 23:27:57 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[0]:2022-04-18 23:27:57 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[5]:2022-04-18 23:27:57 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[4]:2022-04-18 23:27:57 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[6]:2022-04-18 23:27:57 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[1]:2022-04-18 23:28:01 | INFO | fairseq.distributed.utils | distributed init (rank 1): env:// | |
[7]:2022-04-18 23:28:01 | INFO | fairseq.distributed.utils | distributed init (rank 7): env:// | |
[2]:2022-04-18 23:28:01 | INFO | fairseq.distributed.utils | distributed init (rank 2): env:// | |
[5]:2022-04-18 23:28:02 | INFO | fairseq.distributed.utils | distributed init (rank 5): env:// | |
[3]:2022-04-18 23:28:02 | INFO | fairseq.distributed.utils | distributed init (rank 3): env:// | |
[4]:2022-04-18 23:28:02 | INFO | fairseq.distributed.utils | distributed init (rank 4): env:// | |
[0]:2022-04-18 23:28:02 | INFO | fairseq.distributed.utils | distributed init (rank 0): env:// | |
[6]:2022-04-18 23:28:02 | INFO | fairseq.distributed.utils | distributed init (rank 6): env:// | |
[6]:2022-04-18 23:28:02 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 6 | |
[1]:2022-04-18 23:28:02 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1 | |
[7]:2022-04-18 23:28:02 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 7 | |
[2]:2022-04-18 23:28:02 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2 | |
[5]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 5 | |
[3]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 3 | |
[3]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[3]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 3 | |
[7]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[7]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 7 | |
[4]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 4 | |
[4]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[4]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 4 | |
[0]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0 | |
[0]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[0]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 0 | |
[5]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[5]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 5 | |
[6]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[6]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 6 | |
[2]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[2]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 2 | |
[1]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[1]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 1 | |
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO NET/OFI Selected Provider is efa | |
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO Using network AWS Libfabric | |
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO NET/OFI Selected Provider is efa | |
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO Using network AWS Libfabric | |
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Selected Provider is efa | |
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Using network AWS Libfabric | |
[0]:NCCL version 2.10.3+cuda11.3 | |
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO NET/OFI Selected Provider is efa | |
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO Using network AWS Libfabric | |
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO NET/OFI Selected Provider is efa | |
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO Using network AWS Libfabric | |
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO NET/OFI Selected Provider is efa | |
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO Using network AWS Libfabric | |
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO NET/OFI Selected Provider is efa | |
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO Using network AWS Libfabric | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO NET/OFI Selected Provider is efa | |
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO Using network AWS Libfabric | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 5/-1/-1->6->7 [2] 5/-1/-1->6->7 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 5/-1/-1->6->7 [6] 5/-1/-1->6->7 [7] 7/-1/-1->6->5 | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 2/9/-1->1->-1 [2] 2/-1/-1->1->5 [3] -1/-1/-1->1->2 [4] 5/-1/-1->1->2 [5] 2/-1/-1->1->9 [6] 2/-1/-1->1->5 [7] -1/-1/-1->1->2 | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 0/-1/-1->3->2 [3] 2/-1/-1->3->0 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 0/-1/-1->3->2 [7] 2/-1/-1->3->0 | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 6/-1/-1->7->4 [2] 6/-1/-1->7->4 [3] 4/-1/-1->7->6 [4] 4/-1/-1->7->6 [5] 6/-1/-1->7->4 [6] 6/-1/-1->7->4 [7] 4/-1/-1->7->6 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 00/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 01/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 02/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 03/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 04/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 05/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 06/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 07/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 4/-1/-1->0->3 [2] -1/-1/-1->0->3 [3] 3/-1/-1->0->4 [4] 3/-1/-1->0->8 [5] 4/-1/-1->0->3 [6] -1/-1/-1->0->3 [7] 3/-1/-1->0->4 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 02 : 0[160] -> 1[170] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 03 : 0[160] -> 1[170] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 06 : 0[160] -> 1[170] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 07 : 0[160] -> 1[170] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 04 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] 7/-1/-1->4->0 [2] 7/12/-1->4->-1 [3] 0/-1/-1->4->7 [4] -1/-1/-1->4->7 [5] 7/-1/-1->4->0 [6] 7/-1/-1->4->12 [7] 0/-1/-1->4->7 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 03 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 05 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 07 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] -1/-1/-1->5->6 [2] 1/-1/-1->5->6 [3] 6/13/-1->5->-1 [4] 6/-1/-1->5->1 [5] -1/-1/-1->5->6 [6] 1/-1/-1->5->6 [7] 6/-1/-1->5->13 | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 02 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 04 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 06 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 01 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 04 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 1/-1/-1->2->3 | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 02 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 05 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 06 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 02 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2 | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 03 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 05 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 07 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 01 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 03 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3 | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 03 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3 | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 07 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 04 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 01 : 0[160] -> 4[1a0] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 05 : 0[160] -> 4[1a0] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 04 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0 | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 05 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 02 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 06 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2 | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 07 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3 | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 02 : 3[190] -> 7[1d0] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 06 : 3[190] -> 7[1d0] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 06 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2 | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 03 : 2[180] -> 6[1c0] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 07 : 2[180] -> 6[1c0] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 05 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 02 : 1[170] -> 5[1b0] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 04 : 1[170] -> 5[1b0] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 06 : 1[170] -> 5[1b0] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 02 : 4[1a0] -> 0[160] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 06 : 4[1a0] -> 0[160] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 03 : 5[1b0] -> 4[1a0] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 07 : 5[1b0] -> 4[1a0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 03 : 6[1c0] -> 5[1b0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 05 : 6[1c0] -> 5[1b0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 07 : 6[1c0] -> 5[1b0] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 03 : 3[190] -> 0[160] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 05 : 3[190] -> 0[160] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 07 : 3[190] -> 0[160] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 04 : 3[190] -> 2[180] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Connected all rings | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 02 : 7[1d0] -> 4[1a0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 04 : 7[1d0] -> 4[1a0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 06 : 7[1d0] -> 4[1a0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 05 : 7[1d0] -> 6[1c0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Connected all rings | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 03 : 7[1d0] -> 4[1a0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 05 : 7[1d0] -> 4[1a0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 07 : 7[1d0] -> 4[1a0] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Connected all rings | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 02 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 03 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 05 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 06 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 07 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 03 : 0[160] -> 4[1a0] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 07 : 0[160] -> 4[1a0] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Connected all rings | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 02 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 04 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 06 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Connected all rings | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 03 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 05 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 07 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 03 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Connected all rings | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 02 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 03 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 05 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 06 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 07 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 02 : 6[1c0] -> 5[1b0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 04 : 6[1c0] -> 5[1b0] via P2P/IPC | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 06 : 6[1c0] -> 5[1b0] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 04 : 2[180] -> 1[170] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Connected all rings | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 03 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 04 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 07 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 02 : 2[180] -> 1[170] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 03 : 2[180] -> 1[170] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 05 : 2[180] -> 1[170] via P2P/IPC | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 06 : 2[180] -> 1[170] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Connected all rings | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 02 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 04 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 06 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 01 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 05 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 02 : 3[190] -> 0[160] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 04 : 3[190] -> 0[160] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 06 : 3[190] -> 0[160] via P2P/IPC | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 04 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 04 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 02 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 06 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 02 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 06 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2 | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 07 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3 | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 03 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3 | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 07 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3 | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 07 : 2[180] -> 1[170] via P2P/IPC | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 01 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1 | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 05 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1 | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 02 : 3[190] -> 2[180] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 03 : 3[190] -> 2[180] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 05 : 3[190] -> 2[180] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 06 : 3[190] -> 2[180] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 07 : 3[190] -> 2[180] via P2P/IPC | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Connected all trees | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160] | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 02 : 3[190] -> 5[1b0] via P2P/indirect/1[170] | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 03 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0] | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 02 : 7[1d0] -> 6[1c0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 03 : 7[1d0] -> 6[1c0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 04 : 7[1d0] -> 6[1c0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 06 : 7[1d0] -> 6[1c0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 07 : 7[1d0] -> 6[1c0] via P2P/IPC | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Connected all trees | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0] | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 02 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0] | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 03 : 7[1d0] -> 2[180] via P2P/indirect/3[190] | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Connected all trees | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 05 : 0[160] -> 5[1b0] via P2P/indirect/1[170] | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 06 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0] | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0] | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 01 : 4[1a0] -> 0[160] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 03 : 4[1a0] -> 0[160] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 05 : 4[1a0] -> 0[160] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 07 : 4[1a0] -> 0[160] via P2P/IPC | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Connected all trees | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 05 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0] | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 06 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0] | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 02 : 5[1b0] -> 1[170] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 04 : 5[1b0] -> 1[170] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 06 : 5[1b0] -> 1[170] via P2P/IPC | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Connected all trees | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 03 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0] | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 05 : 5[1b0] -> 2[180] via P2P/indirect/1[170] | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 06 : 5[1b0] -> 3[190] via P2P/indirect/1[170] | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Connected all trees | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 02 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0] | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 03 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0] | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 05 : 6[1c0] -> 3[190] via P2P/indirect/2[180] | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Connected all trees | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 02 : 2[180] -> 4[1a0] via P2P/indirect/0[160] | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 03 : 2[180] -> 5[1b0] via P2P/indirect/1[170] | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 05 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0] | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Connected all trees | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 03 : 1[170] -> 4[1a0] via P2P/indirect/0[160] | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 05 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0] | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 06 : 1[170] -> 7[1d0] via P2P/indirect/3[190] | |
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO comm 0x7f7244002fb0 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE | |
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO comm 0x7f2b7c002fb0 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE | |
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO comm 0x7f8d04002fb0 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE | |
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Launch mode Parallel | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160] | |
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO comm 0x7fc8d0002fb0 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE | |
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO comm 0x7f36f4002fb0 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE | |
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO comm 0x7fb71c002fb0 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE | |
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO comm 0x7fc958002fb0 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE | |
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO comm 0x7f39ec002fb0 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE | |
[0]:2022-04-18 23:28:10 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 16, 'distributed_num_procs': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'env://', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_base_algorithm': 'localsgd', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'not_fsdp_flatten_parameters': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': 2048, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 2048, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0, 'grouped_shuffling': False, 'update_epoch_batch_itr': False, 'update_ordered_indices_seed': False}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 50000, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0005], 'stop_min_lr': -1.0, 'use_bmuf': False, 'skip_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': '/job/fairseq/checkpoints/transformer_wikitext-103_manual_docker_al2', 'restore_file': 'checkpoint_last.pt', 'continue_once': None, 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 8}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': {'_name': 'transformer_lm', 'activation_fn': relu, 'dropout': 0.1, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'relu_dropout': 0.0, 'decoder_embed_dim': 512, 'decoder_output_dim': 512, 'decoder_input_dim': 512, 'decoder_ffn_embed_dim': 2048, 'decoder_layers': 6, 'decoder_attention_heads': 8, 'decoder_normalize_before': False, 'no_decoder_final_norm': False, 'adaptive_softmax_cutoff': None, 'adaptive_softmax_dropout': 0.0, 'adaptive_softmax_factor': 4.0, 'no_token_positional_embeddings': False, 'share_decoder_input_output_embed': True, 'character_embeddings': False, 'character_filters': '[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', 'character_embedding_dim': 4, 'char_embedder_highway_layers': 2, 'adaptive_input': False, 'adaptive_input_factor': 4.0, 'adaptive_input_cutoff': None, 'tie_adaptive_weights': False, 'tie_adaptive_proj': False, 'decoder_learned_pos': False, 'layernorm_embedding': False, 'no_scale_embedding': False, 'checkpoint_activations': False, 'offload_activations': False, 'decoder_layerdrop': 0.0, 'decoder_layers_to_keep': None, 'quant_noise_pq': 0.0, 'quant_noise_pq_block_size': 8, 'quant_noise_scalar': 0.0, 'min_params_to_wrap': 100000000, 'base_layers': 0, 'base_sublayers': 1, 'base_shuffle': 1, 'scale_fc': False, 'scale_attn': False, 'scale_heads': False, 'scale_resids': False, 'add_bos_token': False, 'tokens_per_sample': 512, 'max_target_positions': None, 'tpu': False}, 'task': {'_name': 'language_modeling', 'data': '/job/fairseq/data-bin/wikitext-103', 'sample_break_mode': none, 'tokens_per_sample': 512, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': None, 'shorten_method': none, 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-08, 'weight_decay': 0.01, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0005]}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 4000, 'warmup_init_lr': 1e-07, 'lr': [0.0005]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}} | |
[0]:2022-04-18 23:28:11 | INFO | fairseq.tasks.language_modeling | dictionary: 267744 types | |
[0]:2022-04-18 23:28:16 | INFO | fairseq_cli.train | TransformerLanguageModel( | |
[0]: (decoder): TransformerDecoder( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (embed_tokens): Embedding(267744, 512, padding_idx=1) | |
[0]: (embed_positions): SinusoidalPositionalEmbedding() | |
[0]: (layers): ModuleList( | |
[0]: (0): TransformerDecoderLayerBase( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (self_attn): MultiheadAttention( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: ) | |
[0]: (activation_dropout_module): FairseqDropout() | |
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True) | |
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True) | |
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: ) | |
[0]: (1): TransformerDecoderLayerBase( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (self_attn): MultiheadAttention( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: ) | |
[0]: (activation_dropout_module): FairseqDropout() | |
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True) | |
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True) | |
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: ) | |
[0]: (2): TransformerDecoderLayerBase( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (self_attn): MultiheadAttention( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: ) | |
[0]: (activation_dropout_module): FairseqDropout() | |
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True) | |
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True) | |
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: ) | |
[0]: (3): TransformerDecoderLayerBase( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (self_attn): MultiheadAttention( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: ) | |
[0]: (activation_dropout_module): FairseqDropout() | |
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True) | |
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True) | |
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: ) | |
[0]: (4): TransformerDecoderLayerBase( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (self_attn): MultiheadAttention( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: ) | |
[0]: (activation_dropout_module): FairseqDropout() | |
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True) | |
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True) | |
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: ) | |
[0]: (5): TransformerDecoderLayerBase( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (self_attn): MultiheadAttention( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: ) | |
[0]: (activation_dropout_module): FairseqDropout() | |
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True) | |
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True) | |
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: ) | |
[0]: ) | |
[0]: (output_projection): Linear(in_features=512, out_features=267744, bias=False) | |
[0]: ) | |
[0]:) | |
[0]:2022-04-18 23:28:16 | INFO | fairseq_cli.train | task: LanguageModelingTask | |
[0]:2022-04-18 23:28:16 | INFO | fairseq_cli.train | model: TransformerLanguageModel | |
[0]:2022-04-18 23:28:16 | INFO | fairseq_cli.train | criterion: CrossEntropyCriterion | |
[0]:2022-04-18 23:28:16 | INFO | fairseq_cli.train | num. shared model params: 155,999,232 (num. trained: 155,999,232) | |
[0]:2022-04-18 23:28:16 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0) | |
[0]:2022-04-18 23:28:16 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: /job/fairseq/data-bin/wikitext-103/valid | |
[0]:2022-04-18 23:28:16 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 0 | |
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[0]:2022-04-18 23:28:16 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 16 nodes. | |
[0]:2022-04-18 23:28:16 | INFO | fairseq.trainer | detected shared parameter: decoder.embed_tokens.weight <- decoder.output_projection.weight | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 6/-1/-1->7->4 [2] 6/-1/-1->7->4 [3] 4/-1/-1->7->6 [4] 4/-1/-1->7->6 [5] 6/-1/-1->7->4 [6] 6/-1/-1->7->4 [7] 4/-1/-1->7->6 | |
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] 7/-1/-1->4->0 [2] 7/12/-1->4->-1 [3] 0/-1/-1->4->7 [4] -1/-1/-1->4->7 [5] 7/-1/-1->4->0 [6] 7/-1/-1->4->12 [7] 0/-1/-1->4->7 | |
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] -1/-1/-1->5->6 [2] 1/-1/-1->5->6 [3] 6/13/-1->5->-1 [4] 6/-1/-1->5->1 [5] -1/-1/-1->5->6 [6] 1/-1/-1->5->6 [7] 6/-1/-1->5->13 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 00/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 01/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 02/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 03/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 04/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 05/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 06/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 07/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3 | |
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 4/-1/-1->0->3 [2] -1/-1/-1->0->3 [3] 3/-1/-1->0->4 [4] 3/-1/-1->0->8 [5] 4/-1/-1->0->3 [6] -1/-1/-1->0->3 [7] 3/-1/-1->0->4 | |