Created
April 6, 2022 17:40
-
-
Save kiukchung/230e1bc13d17ac275f5b053ff12d6534 to your computer and use it in GitHub Desktop.
Bare metal (NCCL_ALGO=TREE, NCCL_PROTO=simple)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated | |
and will be removed in future. Use torchrun. | |
Note that --use_env is set by default in torchrun. | |
If your script expects `--local_rank` argument to be set, please | |
change it to read from `os.environ['LOCAL_RANK']` instead. See | |
https://pytorch.org/docs/stable/distributed.html#launch-utility for | |
further instructions | |
warnings.warn( | |
WARNING:torch.distributed.run: | |
***************************************** | |
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
***************************************** | |
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: | |
entrypoint : /home/ubuntu/anaconda3/envs/pytorch_p38/bin/fairseq-train | |
min_nodes : 2 | |
max_nodes : 2 | |
nproc_per_node : 8 | |
run_id : none | |
rdzv_backend : static | |
rdzv_endpoint : 10.0.0.163:12345 | |
rdzv_configs : {'rank': 0, 'timeout': 900} | |
max_restarts : 0 | |
monitor_interval : 5 | |
log_dir : None | |
metrics_cfg : {} | |
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_hebha6nc/none_rp5ooft0 | |
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python | |
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group | |
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: | |
restart_count=0 | |
master_addr=10.0.0.163 | |
master_port=12345 | |
group_rank=0 | |
group_world_size=2 | |
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
role_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16] | |
global_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16] | |
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/0/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/1/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/2/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/3/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/4/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/5/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/6/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_hebha6nc/none_rp5ooft0/attempt_0/7/error.json | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 6): env:// | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 6 | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 1): env:// | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1 | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 2): env:// | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2 | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 7): env:// | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 7 | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 3): env:// | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 3 | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 4): env:// | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 4 | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 0): env:// | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0 | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | distributed init (rank 5): env:// | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 5 | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 0 | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 5 | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 2 | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 3 | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 6 | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 1 | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 7 | |
2022-04-05 23:03:04 | INFO | torch.distributed.distributed_c10d | Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-05 23:03:04 | INFO | fairseq.distributed_utils | initialized host ip-10-0-0-163 as rank 4 | |
ip-10-0-0-163:35459:35459 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0> | |
ip-10-0-0-163:35459:35459 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol. | |
ip-10-0-0-163:35459:35459 [0] NCCL INFO NET/IB : No device found. | |
ip-10-0-0-163:35459:35459 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0> | |
ip-10-0-0-163:35459:35459 [0] NCCL INFO Using network Socket | |
NCCL version 2.10.3+cuda11.1 | |
ip-10-0-0-163:35461:35461 [2] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0> | |
ip-10-0-0-163:35466:35466 [7] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0> | |
ip-10-0-0-163:35464:35464 [5] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0> | |
ip-10-0-0-163:35462:35462 [3] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0> | |
ip-10-0-0-163:35460:35460 [1] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0> | |
ip-10-0-0-163:35464:35464 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol. | |
ip-10-0-0-163:35466:35466 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol. | |
ip-10-0-0-163:35462:35462 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol. | |
ip-10-0-0-163:35461:35461 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol. | |
ip-10-0-0-163:35460:35460 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol. | |
ip-10-0-0-163:35462:35462 [3] NCCL INFO NET/IB : No device found. | |
ip-10-0-0-163:35461:35461 [2] NCCL INFO NET/IB : No device found. | |
ip-10-0-0-163:35464:35464 [5] NCCL INFO NET/IB : No device found. | |
ip-10-0-0-163:35466:35466 [7] NCCL INFO NET/IB : No device found. | |
ip-10-0-0-163:35464:35464 [5] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0> | |
ip-10-0-0-163:35461:35461 [2] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0> | |
ip-10-0-0-163:35466:35466 [7] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0> | |
ip-10-0-0-163:35462:35462 [3] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0> | |
ip-10-0-0-163:35466:35466 [7] NCCL INFO Using network Socket | |
ip-10-0-0-163:35464:35464 [5] NCCL INFO Using network Socket | |
ip-10-0-0-163:35461:35461 [2] NCCL INFO Using network Socket | |
ip-10-0-0-163:35462:35462 [3] NCCL INFO Using network Socket | |
ip-10-0-0-163:35460:35460 [1] NCCL INFO NET/IB : No device found. | |
ip-10-0-0-163:35460:35460 [1] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0> | |
ip-10-0-0-163:35460:35460 [1] NCCL INFO Using network Socket | |
ip-10-0-0-163:35465:35465 [6] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0> | |
ip-10-0-0-163:35465:35465 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol. | |
ip-10-0-0-163:35465:35465 [6] NCCL INFO NET/IB : No device found. | |
ip-10-0-0-163:35465:35465 [6] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0> | |
ip-10-0-0-163:35465:35465 [6] NCCL INFO Using network Socket | |
ip-10-0-0-163:35463:35463 [4] NCCL INFO Bootstrap : Using ens5:10.0.0.163<0> | |
ip-10-0-0-163:35463:35463 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v4 symbol. | |
ip-10-0-0-163:35463:35463 [4] NCCL INFO NET/IB : No device found. | |
ip-10-0-0-163:35463:35463 [4] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.163<0> | |
ip-10-0-0-163:35463:35463 [4] NCCL INFO Using network Socket | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:16.0/../max_link_width, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:17.0/../max_link_width, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:18.0/../max_link_width, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:19.0/../max_link_width, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Attribute coll of node net not found | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO KV Convert to int : could not find value of 'Unknown speed' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 === | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/160 (0) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/170 (1) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/180 (2) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/190 (3) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/1A0 (4) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/1B0 (5) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/1C0 (6) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - GPU/1D0 (7) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + PCI[12.0] - NIC/50 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO ========================================== | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1a.0/../max_link_width, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Attribute coll of node net not found | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 === | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/160 (0) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/170 (1) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/180 (2) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/190 (3) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/1A0 (4) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/1B0 (5) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/1C0 (6) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - GPU/1D0 (7) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + PCI[12.0] - NIC/50 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO ========================================== | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Attribute coll of node net not found | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO KV Convert to int : could not find value of 'Unknown speed' in dictionary, falling back to 60 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 === | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/160 (0) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/170 (1) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/180 (2) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/190 (3) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/1A0 (4) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/1B0 (5) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/1C0 (6) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - GPU/1D0 (7) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + PCI[12.0] - NIC/50 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO ========================================== | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC) | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Attribute coll of node net not found | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO KV Convert to int : could not find value of 'Unknown speed' in dictionary, falling back to 60 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 === | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/160 (0) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/170 (1) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/180 (2) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/190 (3) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/1A0 (4) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/1B0 (5) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/1C0 (6) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - GPU/1D0 (7) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + PCI[12.0] - NIC/50 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO ========================================== | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC) | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Attribute coll of node net not found | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Attribute coll of node net not found | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 === | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/160 (0) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/170 (1) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/180 (2) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/190 (3) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/1A0 (4) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/1B0 (5) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/1C0 (6) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - GPU/1D0 (7) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + PCI[12.0] - NIC/50 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO ========================================== | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Attribute coll of node net not found | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO KV Convert to int : could not find value of 'Unknown speed' in dictionary, falling back to 60 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 === | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/160 (0) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/170 (1) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/180 (2) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/190 (3) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/1A0 (4) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/1B0 (5) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/1C0 (6) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - GPU/1D0 (7) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + PCI[12.0] - NIC/50 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO ========================================== | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC) | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 === | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/160 (0) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/170 (1) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/180 (2) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/190 (3) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/1A0 (4) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/1B0 (5) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/1C0 (6) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - GPU/1D0 (7) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + PCI[12.0] - NIC/50 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO ========================================== | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Could not get speed from /sys/class/net/ens5/speed. Defaulting to 10 Gbps. | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Attribute coll of node net not found | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO KV Convert to int : could not find value of 'Unknown speed' in dictionary, falling back to 60 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO === System : maxWidth 1.2 totalWidth 132.0 === | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/160 (0) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/170 (1) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/180 (2) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/190 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/160 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/190 (3) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/180 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/170 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/1A0 (4) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/160 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1C0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/1B0 (5) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/170 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1D0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/1C0 (6) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1D0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1B0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1A0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/180 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - GPU/1D0 (7) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1A0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[44.0] - GPU/1C0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/190 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NVL[22.0] - GPU/1B0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + PCI[12.0] - NIC/50 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO ========================================== | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/160 :GPU/160 (0/5000.000000/LOC) GPU/170 (1/22.000000/NVL) GPU/180 (1/22.000000/NVL) GPU/190 (1/44.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (2/44.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/170 :GPU/160 (1/22.000000/NVL) GPU/170 (0/5000.000000/LOC) GPU/180 (1/44.000000/NVL) GPU/190 (1/22.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (2/44.000000/NVB) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/180 :GPU/160 (1/22.000000/NVL) GPU/170 (1/44.000000/NVL) GPU/180 (0/5000.000000/LOC) GPU/190 (1/44.000000/NVL) GPU/1A0 (2/22.000000/NVB) GPU/1B0 (2/44.000000/NVB) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (2/22.000000/NVB) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/190 :GPU/160 (1/44.000000/NVL) GPU/170 (1/22.000000/NVL) GPU/180 (1/44.000000/NVL) GPU/190 (0/5000.000000/LOC) GPU/1A0 (2/44.000000/NVB) GPU/1B0 (2/22.000000/NVB) GPU/1C0 (2/22.000000/NVB) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/1A0 :GPU/160 (1/44.000000/NVL) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (2/44.000000/NVB) GPU/1A0 (0/5000.000000/LOC) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/22.000000/NVL) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/1B0 :GPU/160 (2/22.000000/NVB) GPU/170 (1/44.000000/NVL) GPU/180 (2/44.000000/NVB) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (1/22.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/1C0 :GPU/160 (2/22.000000/NVB) GPU/170 (2/44.000000/NVB) GPU/180 (1/22.000000/NVL) GPU/190 (2/22.000000/NVB) GPU/1A0 (1/22.000000/NVL) GPU/1B0 (1/44.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/44.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO GPU/1D0 :GPU/160 (2/44.000000/NVB) GPU/170 (2/22.000000/NVB) GPU/180 (2/22.000000/NVB) GPU/190 (1/22.000000/NVL) GPU/1A0 (1/44.000000/NVL) GPU/1B0 (1/22.000000/NVL) GPU/1C0 (1/44.000000/NVL) GPU/1D0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO NET/0 :GPU/160 (3/1.250000/PHB) GPU/170 (3/1.250000/PHB) GPU/180 (3/1.250000/PHB) GPU/190 (3/1.250000/PHB) GPU/1A0 (3/1.250000/PHB) GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC) | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 1.200000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 2.400000/1.200000, type NVL/PHB, sameChannels 1 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO 0 : NET/0 GPU/0 GPU/3 GPU/2 GPU/1 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Tree 0 : -1 -> 0 -> 3/8/-1 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Tree 1 : 8 -> 0 -> 3/-1/-1 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 00/02 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01/02 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Ring 00 : 12 -> 0 -> 3 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Ring 01 : 12 -> 0 -> 3 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 3/-1/-1->0->8 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Ring 00 : 2 -> 1 -> 5 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Ring 00 : 3 -> 2 -> 1 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Ring 01 : 2 -> 1 -> 5 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Tree 0 : 0 -> 3 -> 2/-1/-1 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Ring 00 : 7 -> 4 -> 8 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 5/-1/-1->1->2 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Ring 01 : 7 -> 4 -> 8 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Tree 1 : 0 -> 3 -> 2/-1/-1 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Ring 01 : 3 -> 2 -> 1 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] -1/-1/-1->4->7 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 1/-1/-1->2->3 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Ring 00 : 0 -> 3 -> 2 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Ring 01 : 0 -> 3 -> 2 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 2/-1/-1->3->0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Ring 00 : 1 -> 5 -> 6 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Ring 01 : 1 -> 5 -> 6 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] 6/-1/-1->5->1 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Ring 00 : 6 -> 7 -> 4 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Ring 01 : 6 -> 7 -> 4 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 4/-1/-1->7->6 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Ring 00 : 5 -> 6 -> 7 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Ring 01 : 5 -> 6 -> 7 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 01 : 1[170] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/Socket/0 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 01 : 4[1a0] -> 8[160] [send] via NET/Socket/0 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/Socket/0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Connected all rings | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Connected all rings | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Connected all rings | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01 : 12[1a0] -> 0[160] [receive] via NET/Socket/0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 01 : 5[1b0] -> 1[170] via P2P/IPC | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Connected all rings | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Connected all rings | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Connected all rings | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/Socket/0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Connected all trees | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO NCCL_PROTO set by environment to simple | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO NCCL_ALGO set by environment to TREE | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Connected all rings | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Connected all rings | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Connected all trees | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO NCCL_PROTO set by environment to simple | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO NCCL_ALGO set by environment to TREE | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01 : 8[160] -> 0[160] [receive] via NET/Socket/0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Connected all trees | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO NCCL_PROTO set by environment to simple | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO NCCL_ALGO set by environment to TREE | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Connected all trees | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO NCCL_PROTO set by environment to simple | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO NCCL_ALGO set by environment to TREE | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 00 : 2[180] -> 4[1a0] via P2P/indirect/0[160] | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/Socket/0 | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 01 : 1[170] -> 4[1a0] via P2P/indirect/0[160] | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Connected all trees | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO NCCL_PROTO set by environment to simple | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO NCCL_ALGO set by environment to TREE | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Connected all trees | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO NCCL_PROTO set by environment to simple | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO NCCL_ALGO set by environment to TREE | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01 : 0[160] -> 8[160] [send] via NET/Socket/0 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Connected all trees | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO NCCL_PROTO set by environment to simple | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO NCCL_ALGO set by environment to TREE | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Latency/AlgBw | Tree/ LL | Tree/ LL128 | Tree/Simple | Ring/ LL | Ring/ LL128 | Ring/Simple | CollNet/ LL | CollNet/ LL128 | CollNet/Simple | | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Max NThreads | 512 | 640 | 512 | 512 | 640 | 256 | 512 | 640 | 512 | | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Broadcast | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | 6.3/ 0.0 | 14.0/ 0.0 | 18.0/ 1.2 | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Reduce | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | 6.3/ 0.0 | 14.0/ 0.0 | 18.0/ 1.2 | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO AllGather | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | 18.1/ 0.0 | 40.6/ 0.0 | 65.6/ 1.3 | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO ReduceScatter | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | 18.1/ 0.0 | 40.6/ 0.0 | 65.6/ 1.3 | 0.0/ 0.0 | 0.0/ 0.0 | 0.0/ 0.0 | | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO AllReduce | 27.5/ 0.0 | 38.9/ 0.0 | 448.0/ 1.1 | 32.7/ 0.0 | 71.2/ 0.0 | 122.8/ 0.0 | 18.2/ 0.0 | 18.8/ 0.0 | 33.7/ 0.0 | | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Connected all trees | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO NCCL_PROTO set by environment to simple | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO NCCL_ALGO set by environment to TREE | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01 : 0[160] -> 5[1b0] via P2P/indirect/1[170] | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160] | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 00 : 3[190] -> 5[1b0] via P2P/indirect/1[170] | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO Channel 01 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0] | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 01 : 2[180] -> 5[1b0] via P2P/indirect/1[170] | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO Channel 01 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0] | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 01 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0] | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 01 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0] | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 00 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0] | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO Channel 00 : 1[170] -> 7[1d0] via P2P/indirect/3[190] | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 01 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0] | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 00 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0] | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO Channel 01 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0] | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0] | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 00 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0] | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO Channel 01 : 7[1d0] -> 2[180] via P2P/indirect/3[190] | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 01 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0] | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO Channel 01 : 6[1c0] -> 3[190] via P2P/indirect/2[180] | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 01 : 5[1b0] -> 2[180] via P2P/indirect/1[170] | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 00 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0] | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO Channel 00 : 5[1b0] -> 3[190] via P2P/indirect/1[170] | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO Channel 01 : 4[1a0] -> 3[190] via P2P/indirect/0[160] | |
ip-10-0-0-163:35465:35584 [6] NCCL INFO comm 0x7f36d0002f70 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE | |
ip-10-0-0-163:35460:35583 [1] NCCL INFO comm 0x7f8d14002f70 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE | |
ip-10-0-0-163:35464:35580 [5] NCCL INFO comm 0x7fb9c8002f70 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE | |
ip-10-0-0-163:35466:35579 [7] NCCL INFO comm 0x7f4fe8002f70 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE | |
ip-10-0-0-163:35462:35581 [3] NCCL INFO comm 0x7f7dd8002f70 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE | |
ip-10-0-0-163:35461:35582 [2] NCCL INFO comm 0x7fb590002f70 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE | |
ip-10-0-0-163:35463:35585 [4] NCCL INFO comm 0x7f4840002f70 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE | |
ip-10-0-0-163:35459:35578 [0] NCCL INFO comm 0x7f9fcc002f70 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE | |
ip-10-0-0-163:35459:35459 [0] NCCL INFO Launch mode Parallel | |
2022-04-05 23:03:08 | INFO | fairseq_cli.train | Namespace(activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm', attention_dropout=0.0, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, character_embeddings=False, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='cross_entropy', curriculum=0, data='data-bin/wikitext-103', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='env://', distributed_no_spawn=True, distributed_port=-1, distributed_rank=0, distributed_world_size=16, distributed_wrapper='DDP', dropout=0.1, empty_cache_freq=0, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, future_target=False, gen_subset='test', keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, localsgd_frequency=3, log_format=None, log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', max_epoch=0, max_target_positions=None, max_tokens=2048, max_tokens_valid=2048, max_update=50000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1.0, model_parallel_size=1, no_decoder_final_norm=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=1, optimizer='adam', optimizer_overrides='{}', output_dictionary_size=-1, past_target=False, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_break_mode='none', save_dir='checkpoints/transformer_wikitext-103', save_interval=1, save_interval_updates=0, scoring='bleu', seed=1, self_target=False, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, shorten_data_split_list='', shorten_method='none', skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, stop_time_hours=0, task='language_modeling', tensorboard_logdir=None, threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=512, tpu=False, train_subset='train', update_freq=[1], use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.01, zero_sharding='none') | |
2022-04-05 23:03:08 | INFO | fairseq.tasks.language_modeling | dictionary: 267744 types | |
2022-04-05 23:03:08 | INFO | fairseq.data.data_utils | loaded 3760 examples from: data-bin/wikitext-103/valid | |
2022-04-05 23:03:12 | INFO | fairseq_cli.train | TransformerLanguageModel( | |
(decoder): TransformerDecoder( | |
(dropout_module): FairseqDropout() | |
(embed_tokens): Embedding(267744, 512, padding_idx=1) | |
(embed_positions): SinusoidalPositionalEmbedding() | |
(layers): ModuleList( | |
(0): TransformerDecoderLayer( | |
(dropout_module): FairseqDropout() | |
(self_attn): MultiheadAttention( | |
(dropout_module): FairseqDropout() | |
(k_proj): Linear(in_features=512, out_features=512, bias=True) | |
(v_proj): Linear(in_features=512, out_features=512, bias=True) | |
(q_proj): Linear(in_features=512, out_features=512, bias=True) | |
(out_proj): Linear(in_features=512, out_features=512, bias=True) | |
) | |
(activation_dropout_module): FairseqDropout() | |
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(fc1): Linear(in_features=512, out_features=2048, bias=True) | |
(fc2): Linear(in_features=2048, out_features=512, bias=True) | |
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
) | |
(1): TransformerDecoderLayer( | |
(dropout_module): FairseqDropout() | |
(self_attn): MultiheadAttention( | |
(dropout_module): FairseqDropout() | |
(k_proj): Linear(in_features=512, out_features=512, bias=True) | |
(v_proj): Linear(in_features=512, out_features=512, bias=True) | |
(q_proj): Linear(in_features=512, out_features=512, bias=True) | |
(out_proj): Linear(in_features=512, out_features=512, bias=True) | |
) | |
(activation_dropout_module): FairseqDropout() | |
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(fc1): Linear(in_features=512, out_features=2048, bias=True) | |
(fc2): Linear(in_features=2048, out_features=512, bias=True) | |
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
) | |
(2): TransformerDecoderLayer( | |
(dropout_module): FairseqDropout() | |
(self_attn): MultiheadAttention( | |
(dropout_module): FairseqDropout() | |
(k_proj): Linear(in_features=512, out_features=512, bias=True) | |
(v_proj): Linear(in_features=512, out_features=512, bias=True) | |
(q_proj): Linear(in_features=512, out_features=512, bias=True) | |
(out_proj): Linear(in_features=512, out_features=512, bias=True) | |
) | |
(activation_dropout_module): FairseqDropout() | |
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(fc1): Linear(in_features=512, out_features=2048, bias=True) | |
(fc2): Linear(in_features=2048, out_features=512, bias=True) | |
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
) | |
(3): TransformerDecoderLayer( | |
(dropout_module): FairseqDropout() | |
(self_attn): MultiheadAttention( | |
(dropout_module): FairseqDropout() | |
(k_proj): Linear(in_features=512, out_features=512, bias=True) | |
(v_proj): Linear(in_features=512, out_features=512, bias=True) | |
(q_proj): Linear(in_features=512, out_features=512, bias=True) | |
(out_proj): Linear(in_features=512, out_features=512, bias=True) | |
) | |
(activation_dropout_module): FairseqDropout() | |
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(fc1): Linear(in_features=512, out_features=2048, bias=True) | |
(fc2): Linear(in_features=2048, out_features=512, bias=True) | |
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
) | |
(4): TransformerDecoderLayer( | |
(dropout_module): FairseqDropout() | |
(self_attn): MultiheadAttention( | |
(dropout_module): FairseqDropout() | |
(k_proj): Linear(in_features=512, out_features=512, bias=True) | |
(v_proj): Linear(in_features=512, out_features=512, bias=True) | |
(q_proj): Linear(in_features=512, out_features=512, bias=True) | |
(out_proj): Linear(in_features=512, out_features=512, bias=True) | |
) | |
(activation_dropout_module): FairseqDropout() | |
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(fc1): Linear(in_features=512, out_features=2048, bias=True) | |
(fc2): Linear(in_features=2048, out_features=512, bias=True) | |
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
) | |
(5): TransformerDecoderLayer( | |
(dropout_module): FairseqDropout() | |
(self_attn): MultiheadAttention( | |
(dropout_module): FairseqDropout() | |
(k_proj): Linear(in_features=512, out_features=512, bias=True) | |
(v_proj): Linear(in_features=512, out_features=512, bias=True) | |
(q_proj): Linear(in_features=512, out_features=512, bias=True) | |
(out_proj): Linear(in_features=512, out_features=512, bias=True) | |
) | |
(activation_dropout_module): FairseqDropout() | |
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(fc1): Linear(in_features=512, out_features=2048, bias=True) | |
(fc2): Linear(in_features=2048, out_features=512, bias=True) | |
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
) | |
) | |
(layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(output_projection): Linear(in_features=512, out_features=267744, bias=False) | |
) | |
) | |
2022-04-05 23:03:12 | INFO | fairseq_cli.train | task: language_modeling (LanguageModelingTask) | |
2022-04-05 23:03:12 | INFO | fairseq_cli.train | model: transformer_lm (TransformerLanguageModel) | |
2022-04-05 23:03:12 | INFO | fairseq_cli.train | criterion: cross_entropy (CrossEntropyCriterion) | |
2022-04-05 23:03:12 | INFO | fairseq_cli.train | num. model params: 156000256 (num. trained: 156000256) | |
2022-04-05 23:03:12 | INFO | fairseq.trainer | detected shared parameter: decoder.embed_tokens.weight <- decoder.output_projection.weight | |
2022-04-05 23:03:12 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers*********************** | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 1: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 2: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 3: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 4: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 5: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 6: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 7: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 8: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 9: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 10: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 11: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 12: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 13: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 14: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | rank 15: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-05 23:03:12 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers*********************** | |
2022-04-05 23:03:12 | INFO | fairseq_cli.train | training on 16 devices (GPUs/TPUs) | |
2022-04-05 23:03:12 | INFO | fairseq_cli.train | max tokens per GPU = 2048 and max sentences per GPU = None | |
2022-04-05 23:03:12 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/transformer_wikitext-103/checkpoint_last.pt | |
2022-04-05 23:03:12 | INFO | fairseq.trainer | loading train data for epoch 1 | |
2022-04-05 23:03:13 | INFO | fairseq.data.data_utils | loaded 1801350 examples from: data-bin/wikitext-103/train | |
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it. | |
warnings.warn( | |
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it. | |
warnings.warn( | |
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it. | |
warnings.warn( | |
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it. | |
warnings.warn( | |
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it. | |
warnings.warn( | |
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it. | |
warnings.warn( | |
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it. | |
warnings.warn( | |
/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:552: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it. | |
warnings.warn( | |
2022-04-05 23:03:13 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 | |
2022-04-05 23:03:13 | INFO | fairseq.optim.adam | using FusedAdam | |
2022-04-05 23:03:13 | INFO | fairseq.trainer | begin training epoch 1 | |
2022-04-05 23:03:15 | INFO | root | Reducer buckets have been rebuilt in this iteration. | |
2022-04-05 23:03:45 | INFO | train_inner | epoch 001: 100 / 3151 loss=17.823, ppl=231816, wps=106430, ups=3.25, wpb=32768, bsz=64, num_updates=100, lr=1.25975e-05, gnorm=4.425, train_wall=32, wall=33 | |
2022-04-05 23:04:16 | INFO | train_inner | epoch 001: 200 / 3151 loss=14.53, ppl=23664.6, wps=107090, ups=3.27, wpb=32768, bsz=64, num_updates=200, lr=2.5095e-05, gnorm=1.523, train_wall=31, wall=63 | |
2022-04-05 23:04:46 | INFO | train_inner | epoch 001: 300 / 3151 loss=12.429, ppl=5513.57, wps=108991, ups=3.33, wpb=32764.3, bsz=64, num_updates=300, lr=3.75925e-05, gnorm=1.014, train_wall=30, wall=93 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment