Skip to content

Instantly share code, notes, and snippets.

@kiukchung
Created April 6, 2022 17:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kiukchung/230e1bc13d17ac275f5b053ff12d6534 to your computer and use it in GitHub Desktop.
Save kiukchung/230e1bc13d17ac275f5b053ff12d6534 to your computer and use it in GitHub Desktop.
Bare metal (NCCL_ALGO=TREE, NCCL_PROTO=simple)
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : /home/ubuntu/anaconda3/envs/pytorch_p38/bin/fairseq-train
min_nodes : 2
max_nodes : 2
nproc_per_node : 8
run_id : none
rdzv_backend : static
rdzv_endpoint : 10.0.0.163:12345
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_hebha6nc/none_rp5ooft0
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=10.0.0.163
master_port=12345
group_rank=0
group_world_size=2
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
global_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/7/error.json
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 6): env://
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 6
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 1): env://
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 2): env://
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 7): env://
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 7
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 3): env://
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 3
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 4): env://
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 4
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 0): env://
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 5): env://
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 5
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 0
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 5
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 2
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 3
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 6
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 1
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 7
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 4
ip-10-0-0-163:35459:35459 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0>
ip-10-0-0-163:35459:35459 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol.
ip-10-0-0-163:35459:35459 [0] NCCL INFO NET/IB : No device found.
ip-10-0-0-163:35459:35459 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0>
ip-10-0-0-163:35459:35459 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.1
ip-10-0-0-163:35461:35461 [2] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0>
ip-10-0-0-163:35466:35466 [7] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0>
ip-10-0-0-163:35464:35464 [5] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0>
ip-10-0-0-163:35462:35462 [3] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0>
ip-10-0-0-163:35460:35460 [1] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0>
ip-10-0-0-163:35464:35464 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol.
ip-10-0-0-163:35466:35466 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol.
ip-10-0-0-163:35462:35462 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol.
ip-10-0-0-163:35461:35461 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol.
ip-10-0-0-163:35460:35460 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol.
ip-10-0-0-163:35462:35462 [3] NCCL INFO NET/IB : No device found.
ip-10-0-0-163:35461:35461 [2] NCCL INFO NET/IB : No device found.
ip-10-0-0-163:35464:35464 [5] NCCL INFO NET/IB : No device found.
ip-10-0-0-163:35466:35466 [7] NCCL INFO NET/IB : No device found.
ip-10-0-0-163:35464:35464 [5] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0>
ip-10-0-0-163:35461:35461 [2] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0>
ip-10-0-0-163:35466:35466 [7] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0>
ip-10-0-0-163:35462:35462 [3] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0>
ip-10-0-0-163:35466:35466 [7] NCCL INFO Using network Socket
ip-10-0-0-163:35464:35464 [5] NCCL INFO Using network Socket
ip-10-0-0-163:35461:35461 [2] NCCL INFO Using network Socket
ip-10-0-0-163:35462:35462 [3] NCCL INFO Using network Socket
ip-10-0-0-163:35460:35460 [1] NCCL INFO NET/IB : No device found.
ip-10-0-0-163:35460:35460 [1] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0>
ip-10-0-0-163:35460:35460 [1] NCCL INFO Using network Socket
ip-10-0-0-163:35465:35465 [6] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0>
ip-10-0-0-163:35465:35465 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol.
ip-10-0-0-163:35465:35465 [6] NCCL INFO NET/IB : No device found.
ip-10-0-0-163:35465:35465 [6] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0>
ip-10-0-0-163:35465:35465 [6] NCCL INFO Using network Socket
ip-10-0-0-163:35463:35463 [4] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0>
ip-10-0-0-163:35463:35463 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol.
ip-10-0-0-163:35463:35463 [4] NCCL INFO NET/IB : No device found.
ip-10-0-0-163:35463:35463 [4] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0>
ip-10-0-0-163:35463:35463 [4] NCCL INFO Using network Socket
ip-10-0-0-163:35463:35585 [4] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35466:35579 [7] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35464:35580 [5] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35459:35578 [0] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35461:35582 [2] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35462:35581 [3] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35460:35583 [1] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35465:35584 [6] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35459:35578 [0] NCCL INFO Attribute coll of node net not found
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of 'Unknown speed' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 ===
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/160 (0)
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/170 (1)
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/180 (2)
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/190 (3)
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/1A0 (4)
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/1B0 (5)
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/1C0 (6)
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/1D0 (7)
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - NIC/50
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-0-163:35459:35578 [0] NCCL INFO ==========================================
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35459:35578 [0] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35459:35578 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35459:35578 [0] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35459:35578 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35459:35578 [0] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35459:35578 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35463:35585 [4] NCCL INFO Attribute coll of node net not found
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35463:35585 [4] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 ===
ip-10-0-0-163:35463:35585 [4] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/160 (0)
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/170 (1)
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/180 (2)
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/190 (3)
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/1A0 (4)
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/1B0 (5)
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/1C0 (6)
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/1D0 (7)
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - NIC/50
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-0-163:35463:35585 [4] NCCL INFO ==========================================
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35463:35585 [4] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35463:35585 [4] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35463:35585 [4] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35463:35585 [4] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35463:35585 [4] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35463:35585 [4] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35466:35579 [7] NCCL INFO Attribute coll of node net not found
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of 'Unknown speed' in dictionary, falling back to 60
ip-10-0-0-163:35466:35579 [7] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 ===
ip-10-0-0-163:35466:35579 [7] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/160 (0)
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/170 (1)
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/180 (2)
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/190 (3)
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/1A0 (4)
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/1B0 (5)
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/1C0 (6)
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/1D0 (7)
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - NIC/50
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-0-163:35466:35579 [7] NCCL INFO ==========================================
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35466:35579 [7] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ip-10-0-0-163:35466:35579 [7] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35466:35579 [7] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35466:35579 [7] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35466:35579 [7] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35466:35579 [7] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
ip-10-0-0-163:35464:35580 [5] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35464:35580 [5] NCCL INFO Attribute coll of node net not found
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of 'Unknown speed' in dictionary, falling back to 60
ip-10-0-0-163:35464:35580 [5] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 ===
ip-10-0-0-163:35464:35580 [5] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/160 (0)
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/170 (1)
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/180 (2)
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/190 (3)
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/1A0 (4)
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/1B0 (5)
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/1C0 (6)
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/1D0 (7)
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - NIC/50
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-0-163:35464:35580 [5] NCCL INFO ==========================================
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35464:35580 [5] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ip-10-0-0-163:35464:35580 [5] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35464:35580 [5] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35464:35580 [5] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35464:35580 [5] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35464:35580 [5] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
ip-10-0-0-163:35460:35583 [1] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO Attribute coll of node net not found
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35462:35581 [3] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35462:35581 [3] NCCL INFO Attribute coll of node net not found
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35461:35582 [2] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35460:35583 [1] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 ===
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/160 (0)
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/170 (1)
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/180 (2)
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/190 (3)
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/1A0 (4)
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/1B0 (5)
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/1C0 (6)
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/1D0 (7)
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - NIC/50
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-0-163:35460:35583 [1] NCCL INFO ==========================================
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35460:35583 [1] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35461:35582 [2] NCCL INFO Attribute coll of node net not found
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of 'Unknown speed' in dictionary, falling back to 60
ip-10-0-0-163:35462:35581 [3] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 ===
ip-10-0-0-163:35462:35581 [3] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/160 (0)
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/170 (1)
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/180 (2)
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/190 (3)
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/1A0 (4)
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/1B0 (5)
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/1C0 (6)
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/1D0 (7)
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - NIC/50
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-0-163:35462:35581 [3] NCCL INFO ==========================================
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35462:35581 [3] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ip-10-0-0-163:35460:35583 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35460:35583 [1] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35460:35583 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35460:35583 [1] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35460:35583 [1] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
ip-10-0-0-163:35461:35582 [2] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 ===
ip-10-0-0-163:35461:35582 [2] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/160 (0)
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/170 (1)
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/180 (2)
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/190 (3)
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/1A0 (4)
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/1B0 (5)
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/1C0 (6)
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/1D0 (7)
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - NIC/50
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-0-163:35461:35582 [2] NCCL INFO ==========================================
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35461:35582 [2] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35462:35581 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35462:35581 [3] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35462:35581 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35462:35581 [3] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35462:35581 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
ip-10-0-0-163:35461:35582 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35461:35582 [2] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35461:35582 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35461:35582 [2] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35461:35582 [2] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
ip-10-0-0-163:35465:35584 [6] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps.
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-0-163:35465:35584 [6] NCCL INFO Attribute coll of node net not found
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of 'Unknown speed' in dictionary, falling back to 60
ip-10-0-0-163:35465:35584 [6] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 ===
ip-10-0-0-163:35465:35584 [6] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/160 (0)
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/170 (1)
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/180 (2)
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/190
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/160
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/190 (3)
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/180
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/170
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/1A0 (4)
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/160
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1C0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/1B0 (5)
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/170
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1D0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/1C0 (6)
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1D0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1B0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1A0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/180
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/1D0 (7)
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1A0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1C0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/190
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1B0
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - NIC/50
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-0-163:35465:35584 [6] NCCL INFO ==========================================
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-0-163:35465:35584 [6] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ip-10-0-0-163:35465:35584 [6] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35465:35584 [6] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35465:35584 [6] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-0-163:35465:35584 [6] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0
ip-10-0-0-163:35465:35584 [6] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
ip-10-0-0-163:35459:35578 [0] NCCL INFO Tree 0 : -1 -> 0 -> 3/8/-1
ip-10-0-0-163:35459:35578 [0] NCCL INFO Tree 1 : 8 -> 0 -> 3/-1/-1
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 00/02 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01/02 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
ip-10-0-0-163:35459:35578 [0] NCCL INFO Ring 00 : 12 -> 0 -> 3
ip-10-0-0-163:35459:35578 [0] NCCL INFO Ring 01 : 12 -> 0 -> 3
ip-10-0-0-163:35459:35578 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 3/-1/-1->0->8
ip-10-0-0-163:35460:35583 [1] NCCL INFO Ring 00 : 2 -> 1 -> 5
ip-10-0-0-163:35461:35582 [2] NCCL INFO Ring 00 : 3 -> 2 -> 1
ip-10-0-0-163:35460:35583 [1] NCCL INFO Ring 01 : 2 -> 1 -> 5
ip-10-0-0-163:35462:35581 [3] NCCL INFO Tree 0 : 0 -> 3 -> 2/-1/-1
ip-10-0-0-163:35463:35585 [4] NCCL INFO Ring 00 : 7 -> 4 -> 8
ip-10-0-0-163:35460:35583 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 5/-1/-1->1->2
ip-10-0-0-163:35463:35585 [4] NCCL INFO Ring 01 : 7 -> 4 -> 8
ip-10-0-0-163:35462:35581 [3] NCCL INFO Tree 1 : 0 -> 3 -> 2/-1/-1
ip-10-0-0-163:35461:35582 [2] NCCL INFO Ring 01 : 3 -> 2 -> 1
ip-10-0-0-163:35463:35585 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] -1/-1/-1->4->7
ip-10-0-0-163:35461:35582 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 1/-1/-1->2->3
ip-10-0-0-163:35462:35581 [3] NCCL INFO Ring 00 : 0 -> 3 -> 2
ip-10-0-0-163:35462:35581 [3] NCCL INFO Ring 01 : 0 -> 3 -> 2
ip-10-0-0-163:35462:35581 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 2/-1/-1->3->0
ip-10-0-0-163:35464:35580 [5] NCCL INFO Ring 00 : 1 -> 5 -> 6
ip-10-0-0-163:35464:35580 [5] NCCL INFO Ring 01 : 1 -> 5 -> 6
ip-10-0-0-163:35464:35580 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] 6/-1/-1->5->1
ip-10-0-0-163:35466:35579 [7] NCCL INFO Ring 00 : 6 -> 7 -> 4
ip-10-0-0-163:35466:35579 [7] NCCL INFO Ring 01 : 6 -> 7 -> 4
ip-10-0-0-163:35466:35579 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 4/-1/-1->7->6
ip-10-0-0-163:35465:35584 [6] NCCL INFO Ring 00 : 5 -> 6 -> 7
ip-10-0-0-163:35465:35584 [6] NCCL INFO Ring 01 : 5 -> 6 -> 7
ip-10-0-0-163:35465:35584 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 01 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/Socket/0
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 01 : 4[1a0] -> 8[160] [send] via NET/Socket/0
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/Socket/0
ip-10-0-0-163:35459:35578 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-163:35465:35584 [6] NCCL INFO Connected all rings
ip-10-0-0-163:35462:35581 [3] NCCL INFO Connected all rings
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-163:35464:35580 [5] NCCL INFO Connected all rings
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01 : 12[1a0] -> 0[160] [receive] via NET/Socket/0
ip-10-0-0-163:35459:35578 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 01 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-163:35460:35583 [1] NCCL INFO Connected all rings
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-163:35459:35578 [0] NCCL INFO Connected all rings
ip-10-0-0-163:35461:35582 [2] NCCL INFO Connected all rings
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/Socket/0
ip-10-0-0-163:35459:35578 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-163:35461:35582 [2] NCCL INFO Connected all trees
ip-10-0-0-163:35461:35582 [2] NCCL INFO NCCL_PROTO set by environment to simple
ip-10-0-0-163:35461:35582 [2] NCCL INFO NCCL_ALGO set by environment to TREE
ip-10-0-0-163:35461:35582 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-163:35461:35582 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-163:35466:35579 [7] NCCL INFO Connected all rings
ip-10-0-0-163:35463:35585 [4] NCCL INFO Connected all rings
ip-10-0-0-163:35460:35583 [1] NCCL INFO Connected all trees
ip-10-0-0-163:35460:35583 [1] NCCL INFO NCCL_PROTO set by environment to simple
ip-10-0-0-163:35460:35583 [1] NCCL INFO NCCL_ALGO set by environment to TREE
ip-10-0-0-163:35460:35583 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-163:35460:35583 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01 : 8[160] -> 0[160] [receive] via NET/Socket/0
ip-10-0-0-163:35459:35578 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-163:35463:35585 [4] NCCL INFO Connected all trees
ip-10-0-0-163:35463:35585 [4] NCCL INFO NCCL_PROTO set by environment to simple
ip-10-0-0-163:35463:35585 [4] NCCL INFO NCCL_ALGO set by environment to TREE
ip-10-0-0-163:35463:35585 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-163:35463:35585 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-163:35464:35580 [5] NCCL INFO Connected all trees
ip-10-0-0-163:35464:35580 [5] NCCL INFO NCCL_PROTO set by environment to simple
ip-10-0-0-163:35464:35580 [5] NCCL INFO NCCL_ALGO set by environment to TREE
ip-10-0-0-163:35464:35580 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-163:35464:35580 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 00 : 2[180] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/Socket/0
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 01 : 1[170] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-163:35466:35579 [7] NCCL INFO Connected all trees
ip-10-0-0-163:35466:35579 [7] NCCL INFO NCCL_PROTO set by environment to simple
ip-10-0-0-163:35466:35579 [7] NCCL INFO NCCL_ALGO set by environment to TREE
ip-10-0-0-163:35466:35579 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-163:35466:35579 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-163:35465:35584 [6] NCCL INFO Connected all trees
ip-10-0-0-163:35465:35584 [6] NCCL INFO NCCL_PROTO set by environment to simple
ip-10-0-0-163:35465:35584 [6] NCCL INFO NCCL_ALGO set by environment to TREE
ip-10-0-0-163:35465:35584 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-163:35465:35584 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01 : 0[160] -> 8[160] [send] via NET/Socket/0
ip-10-0-0-163:35459:35578 [0] NCCL INFO Connected all trees
ip-10-0-0-163:35459:35578 [0] NCCL INFO NCCL_PROTO set by environment to simple
ip-10-0-0-163:35459:35578 [0] NCCL INFO NCCL_ALGO set by environment to TREE
ip-10-0-0-163:35459:35578 [0] NCCL INFO Latency/AlgBw | Tree/ LL | Tree/ LL128 | Tree/Simple | Ring/ LL | Ring/ LL128 | Ring/Simple | CollNet/ LL | CollNet/ LL128 | CollNet/Simple |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Max NThreads | 512 | 640 | 512 | 512 | 640 | 256 | 512 | 640 | 512 |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Broadcast | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | 6.3/ 0.0 | 14.0/ 0.0 | 18.0/ 1.2 | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Reduce | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | 6.3/ 0.0 | 14.0/ 0.0 | 18.0/ 1.2 | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 |
ip-10-0-0-163:35459:35578 [0] NCCL INFO AllGather | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | 18.1/ 0.0 | 40.6/ 0.0 | 65.6/ 1.3 | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 |
ip-10-0-0-163:35459:35578 [0] NCCL INFO ReduceScatter | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | 18.1/ 0.0 | 40.6/ 0.0 | 65.6/ 1.3 | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 |
ip-10-0-0-163:35459:35578 [0] NCCL INFO AllReduce | 27.5/ 0.0 | 38.9/ 0.0 | 448.0/ 1.1 | 32.7/ 0.0 | 71.2/ 0.0 | 122.8/ 0.0 | 18.2/ 0.0 | 18.8/ 0.0 | 33.7/ 0.0 |
ip-10-0-0-163:35459:35578 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-163:35459:35578 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-163:35462:35581 [3] NCCL INFO Connected all trees
ip-10-0-0-163:35462:35581 [3] NCCL INFO NCCL_PROTO set by environment to simple
ip-10-0-0-163:35462:35581 [3] NCCL INFO NCCL_ALGO set by environment to TREE
ip-10-0-0-163:35462:35581 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-163:35462:35581 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01 : 0[160] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 00 : 3[190] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 01 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0]
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 01 : 2[180] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 01 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0]
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 01 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 01 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0]
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 00 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0]
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 00 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 01 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 00 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 00 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 01 : 7[1d0] -> 2[180] via P2P/indirect/3[190]
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 01 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 01 : 6[1c0] -> 3[190] via P2P/indirect/2[180]
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 01 : 5[1b0] -> 2[180] via P2P/indirect/1[170]
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 00 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 00 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 01 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
ip-10-0-0-163:35465:35584 [6] NCCL INFO comm 0x7f36d0002f70 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE
ip-10-0-0-163:35460:35583 [1] NCCL INFO comm 0x7f8d14002f70 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE
ip-10-0-0-163:35464:35580 [5] NCCL INFO comm 0x7fb9c8002f70 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE
ip-10-0-0-163:35466:35579 [7] NCCL INFO comm 0x7f4fe8002f70 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE
ip-10-0-0-163:35462:35581 [3] NCCL INFO comm 0x7f7dd8002f70 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE
ip-10-0-0-163:35461:35582 [2] NCCL INFO comm 0x7fb590002f70 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE
ip-10-0-0-163:35463:35585 [4] NCCL INFO comm 0x7f4840002f70 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE
ip-10-0-0-163:35459:35578 [0] NCCL INFO comm 0x7f9fcc002f70 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE
ip-10-0-0-163:35459:35459 [0] NCCL INFO Launch mode Parallel
2022-04-05 23:03:08 | INFO | fairseq_cli.train | Namespace(activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm', attention_dropout=0.0, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, character_embeddings=False, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='cross_entropy', curriculum=0, data='data-bin/wikitext-103', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='env://', distributed_no_spawn=True, distributed_port=-1, distributed_rank=0, distributed_world_size=16, distributed_wrapper='DDP', dropout=0.1, empty_cache_freq=0, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, future_target=False, gen_subset='test', keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, localsgd_frequency=3, log_format=None, log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', max_epoch=0, max_target_positions=None, max_tokens=2048, max_tokens_valid=2048, max_update=50000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1.0, model_parallel_size=1, no_decoder_final_norm=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=1, optimizer='adam', optimizer_overrides='{}', output_dictionary_size=-1, past_target=False, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_break_mode='none', save_dir='checkpoints/transformer_wikitext-103', save_interval=1, save_interval_updates=0, scoring='bleu', seed=1, self_target=False, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, shorten_data_split_list='', shorten_method='none', skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, stop_time_hours=0, task='language_modeling', tensorboard_logdir=None, threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=512, tpu=False, train_subset='train', update_freq=[1], use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.01, zero_sharding='none')
2022-04-05 23:03:08 | INFO | fairseq.tasks.language_modeling | dictionary: 267744 types
2022-04-05 23:03:08 | INFO | fairseq.data.data_utils | loaded 3760 examples from: data-bin/wikitext-103/valid
2022-04-05 23:03:12 | INFO | fairseq_cli.train | TransformerLanguageModel(
(decoder): TransformerDecoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(267744, 512, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(2): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(3): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(4): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(5): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(output_projection): Linear(in_features=512, out_features=267744, bias=False)
)
)
2022-04-05 23:03:12 | INFO | fairseq_cli.train | task: language_modeling (LanguageModelingTask)
2022-04-05 23:03:12 | INFO | fairseq_cli.train | model: transformer_lm (TransformerLanguageModel)
2022-04-05 23:03:12 | INFO | fairseq_cli.train | criterion: cross_entropy (CrossEntropyCriterion)
2022-04-05 23:03:12 | INFO | fairseq_cli.train | num. model params: 156000256 (num. trained: 156000256)
2022-04-05 23:03:12 | INFO | fairseq.trainer | detected shared parameter: decoder.embed_tokens.weight <- decoder.output_projection.weight
2022-04-05 23:03:12 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers***********************
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 1: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 2: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 3: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 4: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 5: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 6: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 7: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 8: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 9: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 10: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 11: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 12: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 13: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 14: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 15: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-05 23:03:12 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers***********************
2022-04-05 23:03:12 | INFO | fairseq_cli.train | training on 16 devices (GPUs/TPUs)
2022-04-05 23:03:12 | INFO | fairseq_cli.train | max tokens per GPU = 2048 and max sentences per GPU = None
2022-04-05 23:03:12 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/transformer_wikitext-103/checkpoint_last.pt
2022-04-05 23:03:12 | INFO | fairseq.trainer | loading train data for epoch 1
2022-04-05 23:03:13 | INFO | fairseq.data.data_utils | loaded 1801350 examples from: data-bin/wikitext-103/train
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
warnings.warn(
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
warnings.warn(
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
warnings.warn(
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
warnings.warn(
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
warnings.warn(
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
warnings.warn(
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
warnings.warn(
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
warnings.warn(
2022-04-05 23:03:13 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16
2022-04-05 23:03:13 | INFO | fairseq.optim.adam | using FusedAdam
2022-04-05 23:03:13 | INFO | fairseq.trainer | begin training epoch 1
2022-04-05 23:03:15 | INFO | root | Reducer buckets have been rebuilt in this iteration.
2022-04-05 23:03:45 | INFO | train_inner | epoch 001: 100 / 3151 loss=17.823, ppl=231816, wps=106430, ups=3.25, wpb=32768, bsz=64, num_updates=100, lr=1.25975e-05, gnorm=4.425, train_wall=32, wall=33
2022-04-05 23:04:16 | INFO | train_inner | epoch 001: 200 / 3151 loss=14.53, ppl=23664.6, wps=107090, ups=3.27, wpb=32768, bsz=64, num_updates=200, lr=2.5095e-05, gnorm=1.523, train_wall=31, wall=63
2022-04-05 23:04:46 | INFO | train_inner | epoch 001: 300 / 3151 loss=12.429, ppl=5513.57, wps=108991, ups=3.33, wpb=32764.3, bsz=64, num_updates=300, lr=3.75925e-05, gnorm=1.014, train_wall=30, wall=93
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment