Skip to content

Instantly share code, notes, and snippets.

@yukunlin
Created April 18, 2022 23:54
Show Gist options
  • Save yukunlin/634c600a11e36d1384215ab08366e774 to your computer and use it in GitHub Desktop.
Save yukunlin/634c600a11e36d1384215ab08366e774 to your computer and use it in GitHub Desktop.
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : fairseq_train_wrapped
min_nodes : 2
max_nodes : 2
nproc_per_node : 8
run_id : foobar
rdzv_backend : c10d
rdzv_endpoint : 10.0.0.213:29500
rdzv_configs : {'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_
INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result:
restart_count=0
master_addr=ip-10-0-0-175.us-west-2.compute.internal
master_port=39177
group_rank=0
group_world_size=2
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
global_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_kj8v2v6p/foobar_rfbse5k_/attempt_0/7/error.json
[1]:2022-04-18 23:27:56 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[7]:2022-04-18 23:27:56 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[2]:2022-04-18 23:27:56 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[3]:2022-04-18 23:27:57 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[0]:2022-04-18 23:27:57 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[5]:2022-04-18 23:27:57 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[4]:2022-04-18 23:27:57 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[6]:2022-04-18 23:27:57 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
[1]:2022-04-18 23:28:01 | INFO | fairseq.distributed.utils | distributed init (rank 1): env://
[7]:2022-04-18 23:28:01 | INFO | fairseq.distributed.utils | distributed init (rank 7): env://
[2]:2022-04-18 23:28:01 | INFO | fairseq.distributed.utils | distributed init (rank 2): env://
[5]:2022-04-18 23:28:02 | INFO | fairseq.distributed.utils | distributed init (rank 5): env://
[3]:2022-04-18 23:28:02 | INFO | fairseq.distributed.utils | distributed init (rank 3): env://
[4]:2022-04-18 23:28:02 | INFO | fairseq.distributed.utils | distributed init (rank 4): env://
[0]:2022-04-18 23:28:02 | INFO | fairseq.distributed.utils | distributed init (rank 0): env://
[6]:2022-04-18 23:28:02 | INFO | fairseq.distributed.utils | distributed init (rank 6): env://
[6]:2022-04-18 23:28:02 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 6
[1]:2022-04-18 23:28:02 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1
[7]:2022-04-18 23:28:02 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 7
[2]:2022-04-18 23:28:02 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2
[5]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 5
[3]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 3
[3]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[3]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 3
[7]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[7]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 7
[4]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 4
[4]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[4]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 4
[0]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0
[0]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[0]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 0
[5]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[5]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 5
[6]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[6]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 6
[2]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[2]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 2
[1]:2022-04-18 23:28:03 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
[1]:2022-04-18 23:28:03 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 1
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO NET/OFI Selected Provider is efa
[7]:ip-10-0-0-175:23:23 [7] NCCL INFO Using network AWS Libfabric
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO NET/OFI Selected Provider is efa
[4]:ip-10-0-0-175:20:20 [4] NCCL INFO Using network AWS Libfabric
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Selected Provider is efa
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Using network AWS Libfabric
[0]:NCCL version 2.10.3+cuda11.3
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO NET/OFI Selected Provider is efa
[5]:ip-10-0-0-175:21:21 [5] NCCL INFO Using network AWS Libfabric
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO NET/OFI Selected Provider is efa
[6]:ip-10-0-0-175:22:22 [6] NCCL INFO Using network AWS Libfabric
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO NET/OFI Selected Provider is efa
[2]:ip-10-0-0-175:18:18 [2] NCCL INFO Using network AWS Libfabric
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO NET/OFI Selected Provider is efa
[1]:ip-10-0-0-175:17:17 [1] NCCL INFO Using network AWS Libfabric
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO NET/OFI Selected Provider is efa
[3]:ip-10-0-0-175:19:19 [3] NCCL INFO Using network AWS Libfabric
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 5/-1/-1->6->7 [2] 5/-1/-1->6->7 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 5/-1/-1->6->7 [6] 5/-1/-1->6->7 [7] 7/-1/-1->6->5
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 2/9/-1->1->-1 [2] 2/-1/-1->1->5 [3] -1/-1/-1->1->2 [4] 5/-1/-1->1->2 [5] 2/-1/-1->1->9 [6] 2/-1/-1->1->5 [7] -1/-1/-1->1->2
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 0/-1/-1->3->2 [3] 2/-1/-1->3->0 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 0/-1/-1->3->2 [7] 2/-1/-1->3->0
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 6/-1/-1->7->4 [2] 6/-1/-1->7->4 [3] 4/-1/-1->7->6 [4] 4/-1/-1->7->6 [5] 6/-1/-1->7->4 [6] 6/-1/-1->7->4 [7] 4/-1/-1->7->6
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 00/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 01/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 02/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 03/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 04/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 05/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 06/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 07/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 4/-1/-1->0->3 [2] -1/-1/-1->0->3 [3] 3/-1/-1->0->4 [4] 3/-1/-1->0->8 [5] 4/-1/-1->0->3 [6] -1/-1/-1->0->3 [7] 3/-1/-1->0->4
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 02 : 0[160] -> 1[170] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 03 : 0[160] -> 1[170] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 06 : 0[160] -> 1[170] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 07 : 0[160] -> 1[170] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 04 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] 7/-1/-1->4->0 [2] 7/12/-1->4->-1 [3] 0/-1/-1->4->7 [4] -1/-1/-1->4->7 [5] 7/-1/-1->4->0 [6] 7/-1/-1->4->12 [7] 0/-1/-1->4->7
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 03 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 05 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 07 : 4[1a0] -> 7[1d0] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] -1/-1/-1->5->6 [2] 1/-1/-1->5->6 [3] 6/13/-1->5->-1 [4] 6/-1/-1->5->1 [5] -1/-1/-1->5->6 [6] 1/-1/-1->5->6 [7] 6/-1/-1->5->13
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 02 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 04 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 06 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 01 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 04 : 6[1c0] -> 7[1d0] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 1/-1/-1->2->3
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 02 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 05 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 06 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 02 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 03 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 05 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 07 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 01 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 03 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 03 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 07 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 04 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 01 : 0[160] -> 4[1a0] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 05 : 0[160] -> 4[1a0] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 04 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 05 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 02 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 06 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 07 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 02 : 3[190] -> 7[1d0] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 06 : 3[190] -> 7[1d0] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 06 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 03 : 2[180] -> 6[1c0] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 07 : 2[180] -> 6[1c0] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 05 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 02 : 1[170] -> 5[1b0] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 04 : 1[170] -> 5[1b0] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 06 : 1[170] -> 5[1b0] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 02 : 4[1a0] -> 0[160] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 06 : 4[1a0] -> 0[160] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 03 : 5[1b0] -> 4[1a0] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 07 : 5[1b0] -> 4[1a0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 03 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 05 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 07 : 6[1c0] -> 5[1b0] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 03 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 05 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 07 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 04 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Connected all rings
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 02 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 04 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 06 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 05 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Connected all rings
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 03 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 05 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 07 : 7[1d0] -> 4[1a0] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Connected all rings
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 02 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 03 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 05 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 06 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 07 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 03 : 0[160] -> 4[1a0] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 07 : 0[160] -> 4[1a0] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Connected all rings
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 02 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 04 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 06 : 4[1a0] -> 7[1d0] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Connected all rings
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 03 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 05 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 07 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 03 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Connected all rings
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 02 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 03 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 05 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 06 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 07 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 02 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 04 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 06 : 6[1c0] -> 5[1b0] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 04 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Connected all rings
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 03 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 04 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 07 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 02 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 03 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 05 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 06 : 2[180] -> 1[170] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Connected all rings
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 02 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 04 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 06 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 01 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 05 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 02 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 04 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 06 : 3[190] -> 0[160] via P2P/IPC
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 04 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 04 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 02 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 06 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 02 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 06 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 07 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 03 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 07 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 07 : 2[180] -> 1[170] via P2P/IPC
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 01 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 05 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 02 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 03 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 05 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 06 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 07 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Connected all trees
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160]
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 02 : 3[190] -> 5[1b0] via P2P/indirect/1[170]
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO Channel 03 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0]
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 02 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 03 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 04 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 06 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 07 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Connected all trees
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0]
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 02 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0]
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO Channel 03 : 7[1d0] -> 2[180] via P2P/indirect/3[190]
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Connected all trees
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 05 : 0[160] -> 5[1b0] via P2P/indirect/1[170]
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 06 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0]
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 01 : 4[1a0] -> 0[160] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 03 : 4[1a0] -> 0[160] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 05 : 4[1a0] -> 0[160] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 07 : 4[1a0] -> 0[160] via P2P/IPC
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Connected all trees
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 05 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0]
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 06 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 02 : 5[1b0] -> 1[170] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 04 : 5[1b0] -> 1[170] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 06 : 5[1b0] -> 1[170] via P2P/IPC
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Connected all trees
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 03 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0]
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 05 : 5[1b0] -> 2[180] via P2P/indirect/1[170]
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO Channel 06 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Connected all trees
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 02 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0]
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 03 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0]
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO Channel 05 : 6[1c0] -> 3[190] via P2P/indirect/2[180]
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Connected all trees
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 02 : 2[180] -> 4[1a0] via P2P/indirect/0[160]
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 03 : 2[180] -> 5[1b0] via P2P/indirect/1[170]
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO Channel 05 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0]
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Connected all trees
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 03 : 1[170] -> 4[1a0] via P2P/indirect/0[160]
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 05 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0]
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO Channel 06 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
[3]:ip-10-0-0-175:19:97 [3] NCCL INFO comm 0x7f7244002fb0 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE
[7]:ip-10-0-0-175:23:95 [7] NCCL INFO comm 0x7f2b7c002fb0 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE
[0]:ip-10-0-0-175:16:90 [0] NCCL INFO comm 0x7f8d04002fb0 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Launch mode Parallel
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
[4]:ip-10-0-0-175:20:93 [4] NCCL INFO comm 0x7fc8d0002fb0 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE
[5]:ip-10-0-0-175:21:94 [5] NCCL INFO comm 0x7f36f4002fb0 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE
[6]:ip-10-0-0-175:22:96 [6] NCCL INFO comm 0x7fb71c002fb0 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE
[2]:ip-10-0-0-175:18:91 [2] NCCL INFO comm 0x7fc958002fb0 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE
[1]:ip-10-0-0-175:17:92 [1] NCCL INFO comm 0x7f39ec002fb0 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE
[0]:2022-04-18 23:28:10 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 16, 'distributed_num_procs': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'env://', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_base_algorithm': 'localsgd', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'not_fsdp_flatten_parameters': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': 2048, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 2048, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0, 'grouped_shuffling': False, 'update_epoch_batch_itr': False, 'update_ordered_indices_seed': False}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 50000, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0005], 'stop_min_lr': -1.0, 'use_bmuf': False, 'skip_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': '/job/fairseq/checkpoints/transformer_wikitext-103_manual_docker_al2', 'restore_file': 'checkpoint_last.pt', 'continue_once': None, 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 8}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': {'_name': 'transformer_lm', 'activation_fn': relu, 'dropout': 0.1, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'relu_dropout': 0.0, 'decoder_embed_dim': 512, 'decoder_output_dim': 512, 'decoder_input_dim': 512, 'decoder_ffn_embed_dim': 2048, 'decoder_layers': 6, 'decoder_attention_heads': 8, 'decoder_normalize_before': False, 'no_decoder_final_norm': False, 'adaptive_softmax_cutoff': None, 'adaptive_softmax_dropout': 0.0, 'adaptive_softmax_factor': 4.0, 'no_token_positional_embeddings': False, 'share_decoder_input_output_embed': True, 'character_embeddings': False, 'character_filters': '[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', 'character_embedding_dim': 4, 'char_embedder_highway_layers': 2, 'adaptive_input': False, 'adaptive_input_factor': 4.0, 'adaptive_input_cutoff': None, 'tie_adaptive_weights': False, 'tie_adaptive_proj': False, 'decoder_learned_pos': False, 'layernorm_embedding': False, 'no_scale_embedding': False, 'checkpoint_activations': False, 'offload_activations': False, 'decoder_layerdrop': 0.0, 'decoder_layers_to_keep': None, 'quant_noise_pq': 0.0, 'quant_noise_pq_block_size': 8, 'quant_noise_scalar': 0.0, 'min_params_to_wrap': 100000000, 'base_layers': 0, 'base_sublayers': 1, 'base_shuffle': 1, 'scale_fc': False, 'scale_attn': False, 'scale_heads': False, 'scale_resids': False, 'add_bos_token': False, 'tokens_per_sample': 512, 'max_target_positions': None, 'tpu': False}, 'task': {'_name': 'language_modeling', 'data': '/job/fairseq/data-bin/wikitext-103', 'sample_break_mode': none, 'tokens_per_sample': 512, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': None, 'shorten_method': none, 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-08, 'weight_decay': 0.01, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0005]}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 4000, 'warmup_init_lr': 1e-07, 'lr': [0.0005]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}}
[0]:2022-04-18 23:28:11 | INFO | fairseq.tasks.language_modeling | dictionary: 267744 types
[0]:2022-04-18 23:28:16 | INFO | fairseq_cli.train | TransformerLanguageModel(
[0]: (decoder): TransformerDecoder(
[0]: (dropout_module): FairseqDropout()
[0]: (embed_tokens): Embedding(267744, 512, padding_idx=1)
[0]: (embed_positions): SinusoidalPositionalEmbedding()
[0]: (layers): ModuleList(
[0]: (0): TransformerDecoderLayerBase(
[0]: (dropout_module): FairseqDropout()
[0]: (self_attn): MultiheadAttention(
[0]: (dropout_module): FairseqDropout()
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: )
[0]: (activation_dropout_module): FairseqDropout()
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True)
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True)
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
[0]: )
[0]: (1): TransformerDecoderLayerBase(
[0]: (dropout_module): FairseqDropout()
[0]: (self_attn): MultiheadAttention(
[0]: (dropout_module): FairseqDropout()
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: )
[0]: (activation_dropout_module): FairseqDropout()
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True)
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True)
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
[0]: )
[0]: (2): TransformerDecoderLayerBase(
[0]: (dropout_module): FairseqDropout()
[0]: (self_attn): MultiheadAttention(
[0]: (dropout_module): FairseqDropout()
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: )
[0]: (activation_dropout_module): FairseqDropout()
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True)
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True)
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
[0]: )
[0]: (3): TransformerDecoderLayerBase(
[0]: (dropout_module): FairseqDropout()
[0]: (self_attn): MultiheadAttention(
[0]: (dropout_module): FairseqDropout()
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: )
[0]: (activation_dropout_module): FairseqDropout()
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True)
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True)
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
[0]: )
[0]: (4): TransformerDecoderLayerBase(
[0]: (dropout_module): FairseqDropout()
[0]: (self_attn): MultiheadAttention(
[0]: (dropout_module): FairseqDropout()
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: )
[0]: (activation_dropout_module): FairseqDropout()
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True)
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True)
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
[0]: )
[0]: (5): TransformerDecoderLayerBase(
[0]: (dropout_module): FairseqDropout()
[0]: (self_attn): MultiheadAttention(
[0]: (dropout_module): FairseqDropout()
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True)
[0]: )
[0]: (activation_dropout_module): FairseqDropout()
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True)
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True)
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
[0]: )
[0]: )
[0]: (output_projection): Linear(in_features=512, out_features=267744, bias=False)
[0]: )
[0]:)
[0]:2022-04-18 23:28:16 | INFO | fairseq_cli.train | task: LanguageModelingTask
[0]:2022-04-18 23:28:16 | INFO | fairseq_cli.train | model: TransformerLanguageModel
[0]:2022-04-18 23:28:16 | INFO | fairseq_cli.train | criterion: CrossEntropyCriterion
[0]:2022-04-18 23:28:16 | INFO | fairseq_cli.train | num. shared model params: 155,999,232 (num. trained: 155,999,232)
[0]:2022-04-18 23:28:16 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0)
[0]:2022-04-18 23:28:16 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: /job/fairseq/data-bin/wikitext-103/valid
[0]:2022-04-18 23:28:16 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 0
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[0]:2022-04-18 23:28:16 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 16 nodes.
[0]:2022-04-18 23:28:16 | INFO | fairseq.trainer | detected shared parameter: decoder.embed_tokens.weight <- decoder.output_projection.weight
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 6/-1/-1->7->4 [2] 6/-1/-1->7->4 [3] 4/-1/-1->7->6 [4] 4/-1/-1->7->6 [5] 6/-1/-1->7->4 [6] 6/-1/-1->7->4 [7] 4/-1/-1->7->6
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] 7/-1/-1->4->0 [2] 7/12/-1->4->-1 [3] 0/-1/-1->4->7 [4] -1/-1/-1->4->7 [5] 7/-1/-1->4->0 [6] 7/-1/-1->4->12 [7] 0/-1/-1->4->7
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] -1/-1/-1->5->6 [2] 1/-1/-1->5->6 [3] 6/13/-1->5->-1 [4] 6/-1/-1->5->1 [5] -1/-1/-1->5->6 [6] 1/-1/-1->5->6 [7] 6/-1/-1->5->13
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 00/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 01/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 02/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 03/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 04/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 05/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 06/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 07/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 4/-1/-1->0->3 [2] -1/-1/-1->0->3 [3] 3/-1/-1->0->4 [4] 3/-1/-1->0->8 [5] 4/-1/-1->0->3 [6] -1/-1/-1->0->3 [7] 3/-1/-1->0->4
!≥[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 5/-1/-1->6->7 [2] 5/-1/-1->6->7 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 5/-1/-1->6->7 [6] 5/-1/-1->6->7 [7] 7/-1/-1->6->5
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 1/-1/-1->2->3
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 2/9/-1->1->-1 [2] 2/-1/-1->1->5 [3] -1/-1/-1->1->2 [4] 5/-1/-1->1->2 [5] 2/-1/-1->1->9 [6] 2/-1/-1->1->5 [7] -1/-1/-1->1->2
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 0/-1/-1->3->2 [3] 2/-1/-1->3->0 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 0/-1/-1->3->2 [7] 2/-1/-1->3->0
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 02 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 04 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 06 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 01 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 05 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 02 : 0[160] -> 1[170] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 03 : 0[160] -> 1[170] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 06 : 0[160] -> 1[170] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 07 : 0[160] -> 1[170] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 04 : 0[160] -> 3[190] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 04 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 02 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 02 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 05 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 06 : 2[180] -> 3[190] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 03 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 05 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 07 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 04 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 03 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 05 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 07 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 01 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 03 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 07 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 02 : 3[190] -> 7[1d0] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 06 : 3[190] -> 7[1d0] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 03 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 05 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 07 : 3[190] -> 0[160] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 03 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 07 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 02 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 04 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 06 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 05 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Connected all rings
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 03 : 5[1b0] -> 4[1a0] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 07 : 5[1b0] -> 4[1a0] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Connected all rings
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 03 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 05 : 5[1b0] -> 6[1c0] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 07 : 5[1b0] -> 6[1c0] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 04 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 01 : 0[160] -> 4[1a0] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 05 : 0[160] -> 4[1a0] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Connected all rings
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 02 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 03 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 05 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 06 : 0[160] -> 3[190] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 07 : 0[160] -> 3[190] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 06 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 03 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 05 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 07 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Connected all rings
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 02 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 03 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 05 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 06 : 6[1c0] -> 7[1d0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 07 : 6[1c0] -> 7[1d0] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 02 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 06 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 03 : 2[180] -> 6[1c0] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 07 : 2[180] -> 6[1c0] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 04 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Connected all rings
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 03 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 04 : 2[180] -> 3[190] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 07 : 2[180] -> 3[190] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 02 : 4[1a0] -> 0[160] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 06 : 4[1a0] -> 0[160] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Connected all rings
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 02 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 04 : 4[1a0] -> 7[1d0] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 06 : 4[1a0] -> 7[1d0] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 05 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 02 : 1[170] -> 5[1b0] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 04 : 1[170] -> 5[1b0] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 06 : 1[170] -> 5[1b0] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Connected all rings
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 02 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 04 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 06 : 1[170] -> 2[180] via P2P/IPC
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 01 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 05 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 04 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Connected all rings
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 02 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 04 : 3[190] -> 0[160] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 06 : 3[190] -> 0[160] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 03 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 05 : 7[1d0] -> 4[1a0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 07 : 7[1d0] -> 4[1a0] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 03 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 07 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 03 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 07 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 02 : 5[1b0] -> 1[170] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 04 : 5[1b0] -> 1[170] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 06 : 5[1b0] -> 1[170] via P2P/IPC
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Connected all trees
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 03 : 0[160] -> 4[1a0] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 07 : 0[160] -> 4[1a0] via P2P/IPC
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 04 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 02 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 04 : 6[1c0] -> 5[1b0] via P2P/IPC
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 06 : 6[1c0] -> 5[1b0] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 02 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 03 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 05 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 06 : 2[180] -> 1[170] via P2P/IPC
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 07 : 2[180] -> 1[170] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 02 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 06 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 02 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 06 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 01 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 05 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Connected all trees
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 03 : 1[170] -> 4[1a0] via P2P/indirect/0[160]
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 02 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 03 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 05 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 06 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 07 : 3[190] -> 2[180] via P2P/IPC
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Connected all trees
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160]
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 02 : 3[190] -> 5[1b0] via P2P/indirect/1[170]
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO Channel 03 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0]
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 02 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 03 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 04 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 06 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 07 : 7[1d0] -> 6[1c0] via P2P/IPC
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Connected all trees
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0]
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 02 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0]
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO Channel 03 : 7[1d0] -> 2[180] via P2P/indirect/3[190]
[7]:ip-10-0-0-175:23:135 [7] NCCL INFO comm 0x7f2b20002fb0 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 03 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0]
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 05 : 5[1b0] -> 2[180] via P2P/indirect/1[170]
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO Channel 06 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
[5]:ip-10-0-0-175:21:133 [5] NCCL INFO comm 0x7f3694002fb0 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 04 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Connected all trees
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 05 : 0[160] -> 5[1b0] via P2P/indirect/1[170]
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 06 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0]
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
[0]:ip-10-0-0-175:16:131 [0] NCCL INFO comm 0x7f8cb0002fb0 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Launch mode Parallel
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Connected all trees
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 02 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0]
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 03 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0]
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO Channel 05 : 6[1c0] -> 3[190] via P2P/indirect/2[180]
[6]:ip-10-0-0-175:22:132 [6] NCCL INFO comm 0x7fb6bc002fb0 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Connected all trees
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 02 : 2[180] -> 4[1a0] via P2P/indirect/0[160]
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 03 : 2[180] -> 5[1b0] via P2P/indirect/1[170]
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO Channel 05 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0]
[2]:ip-10-0-0-175:18:138 [2] NCCL INFO comm 0x7fc8f8002fb0 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 01 : 4[1a0] -> 0[160] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 03 : 4[1a0] -> 0[160] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 05 : 4[1a0] -> 0[160] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 07 : 4[1a0] -> 0[160] via P2P/IPC
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Connected all trees
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 05 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0]
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 06 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
[4]:ip-10-0-0-175:20:136 [4] NCCL INFO comm 0x7fc878002fb0 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 05 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0]
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO Channel 06 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
[1]:ip-10-0-0-175:17:134 [1] NCCL INFO comm 0x7f3998002fb0 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE
[3]:ip-10-0-0-175:19:137 [3] NCCL INFO comm 0x7f71e8002fb0 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers***********************
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 1: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 2: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 3: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 4: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 5: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 6: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 7: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 8: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 9: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 10: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 11: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 12: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 13: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 14: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | rank 15: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
[0]:2022-04-18 23:28:20 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers***********************
[0]:2022-04-18 23:28:20 | INFO | fairseq_cli.train | training on 16 devices (GPUs/TPUs)
[0]:2022-04-18 23:28:20 | INFO | fairseq_cli.train | max tokens per device = 2048 and max sentences per device = None
[0]:2022-04-18 23:28:20 | INFO | fairseq.trainer | Preparing to load checkpoint /job/fairseq/checkpoints/transformer_wikitext-103_manual_docker_al2/checkpoint_last.pt
[0]:2022-04-18 23:28:31 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 or --amp
[0]:2022-04-18 23:28:31 | INFO | fairseq.optim.adam | using FusedAdam
[0]:2022-04-18 23:28:33 | INFO | fairseq.trainer | Loaded checkpoint /job/fairseq/checkpoints/transformer_wikitext-103_manual_docker_al2/checkpoint_last.pt (epoch 8 @ 22057 updates)
[0]:2022-04-18 23:28:33 | INFO | fairseq.trainer | loading train data for epoch 8
[0]:2022-04-18 23:28:34 | INFO | fairseq.data.data_utils | loaded 1,801,350 examples from: /job/fairseq/data-bin/wikitext-103/train
[0]:2022-04-18 23:28:34 | INFO | fairseq.data.iterators | grouped total_num_itrs = 3151
[0]:2022-04-18 23:28:34 | INFO | fairseq.trainer | begin training epoch 8
[0]:2022-04-18 23:28:34 | INFO | fairseq_cli.train | Start iterating over samples
[0]:2022-04-18 23:28:36 | INFO | root | Reducer buckets have been rebuilt in this iteration.
[0]:2022-04-18 23:28:56 | INFO | train_inner | epoch 008: 43 / 3151 loss=5.215, ppl=37.15, wps=66122.9, ups=2.02, wpb=32768, bsz=64, num_updates=22100, lr=0.000212718, gnorm=0.682, train_wall=22, gb_free=20.6, wall=36
[0]:2022-04-18 23:29:46 | INFO | train_inner | epoch 008: 143 / 3151 loss=5.236, ppl=37.69, wps=66408.7, ups=2.03, wpb=32768, bsz=64, num_updates=22200, lr=0.000212238, gnorm=0.679, train_wall=49, gb_free=20.6, wall=86
[0]:2022-04-18 23:30:35 | INFO | train_inner | epoch 008: 243 / 3151 loss=5.24, ppl=37.8, wps=65885.3, ups=2.01, wpb=32768, bsz=64, num_updates=22300, lr=0.000211762, gnorm=0.645, train_wall=49, gb_free=20.6, wall=135
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment