-
-
Save yukunlin/ba8e41131abc1a7e4fb288b480d94b8f to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated | |
and will be removed in future. Use torchrun. | |
Note that --use_env is set by default in torchrun. | |
If your script expects `--local_rank` argument to be set, please | |
change it to read from `os.environ['LOCAL_RANK']` instead. See | |
https://pytorch.org/docs/stable/distributed.html#launch-utility for | |
further instructions | |
warnings.warn( | |
WARNING:torch.distributed.run: | |
***************************************** | |
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
***************************************** | |
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: | |
entrypoint : fairseq_train_wrapped | |
min_nodes : 2 | |
max_nodes : 2 | |
nproc_per_node : 8 | |
run_id : foobar | |
rdzv_backend : c10d | |
rdzv_endpoint : 10.0.0.115:29500 | |
rdzv_configs : {'timeout': 900} | |
max_restarts : 0 | |
monitor_interval : 5 | |
log_dir : None | |
metrics_cfg : {} | |
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq | |
INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python | |
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group | |
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result: | |
restart_count=0 | |
master_addr=ip-10-0-0-115.us-west-2.compute.internal | |
master_port=47199 | |
group_rank=0 | |
group_world_size=2 | |
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
role_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16] | |
global_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16] | |
INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/0/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/1/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/2/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/3/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/4/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/5/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/6/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/7/error.json | |
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 0): env:// | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 5): env:// | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 5 | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 6): env:// | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 4): env:// | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 6 | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 4 | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 3): env:// | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 3 | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 7): env:// | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 1): env:// | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 2): env:// | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 7 | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2 | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1 | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0 | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 0 | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 5 | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 4 | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 6 | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 3 | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 7 | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 2 | |
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 1 | |
ip-10-0-0-115:74:74 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
ip-10-0-0-115:74:74 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Selected Provider is efa | |
ip-10-0-0-115:74:74 [0] NCCL INFO Using network AWS Libfabric | |
NCCL version 2.10.3+cuda11.3 | |
ip-10-0-0-115:78:78 [4] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
ip-10-0-0-115:78:78 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Selected Provider is efa | |
ip-10-0-0-115:78:78 [4] NCCL INFO Using network AWS Libfabric | |
ip-10-0-0-115:79:79 [5] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
ip-10-0-0-115:79:79 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Selected Provider is efa | |
ip-10-0-0-115:79:79 [5] NCCL INFO Using network AWS Libfabric | |
ip-10-0-0-115:77:77 [3] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
ip-10-0-0-115:81:81 [7] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
ip-10-0-0-115:77:77 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
ip-10-0-0-115:81:81 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
ip-10-0-0-115:76:76 [2] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Selected Provider is efa | |
ip-10-0-0-115:77:77 [3] NCCL INFO Using network AWS Libfabric | |
ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Selected Provider is efa | |
ip-10-0-0-115:81:81 [7] NCCL INFO Using network AWS Libfabric | |
ip-10-0-0-115:76:76 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Selected Provider is efa | |
ip-10-0-0-115:76:76 [2] NCCL INFO Using network AWS Libfabric | |
ip-10-0-0-115:75:75 [1] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
ip-10-0-0-115:75:75 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Selected Provider is efa | |
ip-10-0-0-115:75:75 [1] NCCL INFO Using network AWS Libfabric | |
ip-10-0-0-115:80:80 [6] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0> | |
ip-10-0-0-115:80:80 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Forcing AWS OFI ndev 4 | |
ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Selected Provider is efa | |
ip-10-0-0-115:80:80 [6] NCCL INFO Using network AWS Libfabric | |
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:78:133 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] 7/-1/-1->4->0 [2] 7/12/-1->4->-1 [3] 0/-1/-1->4->7 [4] -1/-1/-1->4->7 [5] 7/-1/-1->4->0 [6] 7/-1/-1->4->12 [7] 0/-1/-1->4->7 | |
ip-10-0-0-115:79:134 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] -1/-1/-1->5->6 [2] 1/-1/-1->5->6 [3] 6/13/-1->5->-1 [4] 6/-1/-1->5->1 [5] -1/-1/-1->5->6 [6] 1/-1/-1->5->6 [7] 6/-1/-1->5->13 | |
ip-10-0-0-115:80:139 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 5/-1/-1->6->7 [2] 5/-1/-1->6->7 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 5/-1/-1->6->7 [6] 5/-1/-1->6->7 [7] 7/-1/-1->6->5 | |
ip-10-0-0-115:81:136 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 6/-1/-1->7->4 [2] 6/-1/-1->7->4 [3] 4/-1/-1->7->6 [4] 4/-1/-1->7->6 [5] 6/-1/-1->7->4 [6] 6/-1/-1->7->4 [7] 4/-1/-1->7->6 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 00/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
ip-10-0-0-115:75:138 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 2/9/-1->1->-1 [2] 2/-1/-1->1->5 [3] -1/-1/-1->1->2 [4] 5/-1/-1->1->2 [5] 2/-1/-1->1->9 [6] 2/-1/-1->1->5 [7] -1/-1/-1->1->2 | |
ip-10-0-0-115:76:137 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 1/-1/-1->2->3 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 01/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 02/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4 | |
ip-10-0-0-115:77:135 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 0/-1/-1->3->2 [3] 2/-1/-1->3->0 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 0/-1/-1->3->2 [7] 2/-1/-1->3->0 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 03/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 04/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 05/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 06/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 07/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 4/-1/-1->0->3 [2] -1/-1/-1->0->3 [3] 3/-1/-1->0->4 [4] 3/-1/-1->0->8 [5] 4/-1/-1->0->3 [6] -1/-1/-1->0->3 [7] 3/-1/-1->0->4 | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 03 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 02 : 0[160] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 02 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 05 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 04 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 03 : 0[160] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 07 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 06 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 06 : 0[160] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 07 : 0[160] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 04 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 03 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 02 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 05 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 05 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 07 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 06 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 04 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 02 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 01 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 01 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 02 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 03 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 03 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 06 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 05 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 04 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 07 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 05 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 06 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 03 : 2[180] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 04 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 02 : 1[170] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 07 : 2[180] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 01 : 0[160] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 04 : 1[170] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 07 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 05 : 0[160] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 06 : 1[170] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 02 : 3[190] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 06 : 3[190] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 03 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 05 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 07 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 03 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 02 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 05 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 04 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 07 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 06 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 03 : 5[1b0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 07 : 5[1b0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 02 : 4[1a0] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 06 : 4[1a0] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Connected all rings | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 04 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 02 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 05 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 03 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 05 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Connected all rings | |
ip-10-0-0-115:78:133 [4] NCCL INFO Connected all rings | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 06 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 07 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 02 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Connected all rings | |
ip-10-0-0-115:79:134 [5] NCCL INFO Connected all rings | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 04 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 06 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 03 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 05 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 07 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 04 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 02 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 03 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Connected all rings | |
ip-10-0-0-115:75:138 [1] NCCL INFO Connected all rings | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 05 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Connected all rings | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 06 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 02 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 07 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 04 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 06 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 03 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 04 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 07 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 03 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 03 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 05 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 01 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 02 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 07 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 04 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 06 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 02 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 03 : 0[160] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 03 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 02 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 07 : 0[160] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 05 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 05 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 04 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 06 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 06 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 07 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 01 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 02 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 05 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 04 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 07 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 04 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 03 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 07 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 06 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 02 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 06 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 02 : 5[1b0] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 04 : 5[1b0] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 06 : 5[1b0] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:79:134 [5] NCCL INFO Connected all trees | |
ip-10-0-0-115:79:134 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:79:134 [5] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:75:138 [1] NCCL INFO Connected all trees | |
ip-10-0-0-115:75:138 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:75:138 [1] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 03 : 1[170] -> 4[1a0] via P2P/indirect/0[160] | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 01 : 4[1a0] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 03 : 4[1a0] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 05 : 4[1a0] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 07 : 4[1a0] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:78:133 [4] NCCL INFO Connected all trees | |
ip-10-0-0-115:78:133 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:78:133 [4] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:74:132 [0] NCCL INFO Connected all trees | |
ip-10-0-0-115:74:132 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:74:132 [0] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 05 : 0[160] -> 5[1b0] via P2P/indirect/1[170] | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 02 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 02 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 03 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 03 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 04 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 05 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 06 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 06 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 07 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 07 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:136 [7] NCCL INFO Connected all trees | |
ip-10-0-0-115:81:136 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:81:136 [7] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:77:135 [3] NCCL INFO Connected all trees | |
ip-10-0-0-115:77:135 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:77:135 [3] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160] | |
ip-10-0-0-115:80:139 [6] NCCL INFO Connected all trees | |
ip-10-0-0-115:80:139 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:80:139 [6] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 02 : 3[190] -> 5[1b0] via P2P/indirect/1[170] | |
ip-10-0-0-115:76:137 [2] NCCL INFO Connected all trees | |
ip-10-0-0-115:76:137 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:76:137 [2] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 02 : 2[180] -> 4[1a0] via P2P/indirect/0[160] | |
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 03 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0] | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 03 : 2[180] -> 5[1b0] via P2P/indirect/1[170] | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 05 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0] | |
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 05 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0] | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 05 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0] | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 06 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0] | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 03 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0] | |
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 06 : 1[170] -> 7[1d0] via P2P/indirect/3[190] | |
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0] | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 02 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0] | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0] | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 02 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0] | |
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 03 : 7[1d0] -> 2[180] via P2P/indirect/3[190] | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 03 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0] | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 05 : 5[1b0] -> 2[180] via P2P/indirect/1[170] | |
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 05 : 6[1c0] -> 3[190] via P2P/indirect/2[180] | |
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 06 : 5[1b0] -> 3[190] via P2P/indirect/1[170] | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 06 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0] | |
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160] | |
ip-10-0-0-115:78:133 [4] NCCL INFO comm 0x7fb580002fb0 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE | |
ip-10-0-0-115:74:132 [0] NCCL INFO comm 0x7f6200002fb0 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE | |
ip-10-0-0-115:77:135 [3] NCCL INFO comm 0x7f334c002fb0 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE | |
ip-10-0-0-115:81:136 [7] NCCL INFO comm 0x7fefdc002fb0 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE | |
ip-10-0-0-115:75:138 [1] NCCL INFO comm 0x7f3c68002fb0 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE | |
ip-10-0-0-115:76:137 [2] NCCL INFO comm 0x7f543c002fb0 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE | |
ip-10-0-0-115:79:134 [5] NCCL INFO comm 0x7f25e0002fb0 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE | |
ip-10-0-0-115:80:139 [6] NCCL INFO comm 0x7fa3a4002fb0 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE | |
ip-10-0-0-115:74:74 [0] NCCL INFO Launch mode Parallel | |
2022-04-19 23:36:35 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 16, 'distributed_num_procs': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'env://', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_base_algorithm': 'localsgd', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'not_fsdp_flatten_parameters': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': 2048, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 2048, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0, 'grouped_shuffling': False, 'update_epoch_batch_itr': False, 'update_ordered_indices_seed': False}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 50000, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0005], 'stop_min_lr': -1.0, 'use_bmuf': False, 'skip_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': '/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa', 'restore_file': 'checkpoint_last.pt', 'continue_once': None, 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 8}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': {'_name': 'transformer_lm', 'activation_fn': relu, 'dropout': 0.1, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'relu_dropout': 0.0, 'decoder_embed_dim': 512, 'decoder_output_dim': 512, 'decoder_input_dim': 512, 'decoder_ffn_embed_dim': 2048, 'decoder_layers': 6, 'decoder_attention_heads': 8, 'decoder_normalize_before': False, 'no_decoder_final_norm': False, 'adaptive_softmax_cutoff': None, 'adaptive_softmax_dropout': 0.0, 'adaptive_softmax_factor': 4.0, 'no_token_positional_embeddings': False, 'share_decoder_input_output_embed': True, 'character_embeddings': False, 'character_filters': '[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', 'character_embedding_dim': 4, 'char_embedder_highway_layers': 2, 'adaptive_input': False, 'adaptive_input_factor': 4.0, 'adaptive_input_cutoff': None, 'tie_adaptive_weights': False, 'tie_adaptive_proj': False, 'decoder_learned_pos': False, 'layernorm_embedding': False, 'no_scale_embedding': False, 'checkpoint_activations': False, 'offload_activations': False, 'decoder_layerdrop': 0.0, 'decoder_layers_to_keep': None, 'quant_noise_pq': 0.0, 'quant_noise_pq_block_size': 8, 'quant_noise_scalar': 0.0, 'min_params_to_wrap': 100000000, 'base_layers': 0, 'base_sublayers': 1, 'base_shuffle': 1, 'scale_fc': False, 'scale_attn': False, 'scale_heads': False, 'scale_resids': False, 'add_bos_token': False, 'tokens_per_sample': 512, 'max_target_positions': None, 'tpu': False}, 'task': {'_name': 'language_modeling', 'data': '/job/fairseq/data-bin/wikitext-103', 'sample_break_mode': none, 'tokens_per_sample': 512, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': None, 'shorten_method': none, 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-08, 'weight_decay': 0.01, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0005]}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 4000, 'warmup_init_lr': 1e-07, 'lr': [0.0005]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}} | |
2022-04-19 23:36:35 | INFO | fairseq.tasks.language_modeling | dictionary: 267744 types | |
2022-04-19 23:36:38 | INFO | fairseq_cli.train | TransformerLanguageModel( | |
(decoder): TransformerDecoder( | |
(dropout_module): FairseqDropout() | |
(embed_tokens): Embedding(267744, 512, padding_idx=1) | |
(embed_positions): SinusoidalPositionalEmbedding() | |
(layers): ModuleList( | |
(0): TransformerDecoderLayerBase( | |
(dropout_module): FairseqDropout() | |
(self_attn): MultiheadAttention( | |
(dropout_module): FairseqDropout() | |
(k_proj): Linear(in_features=512, out_features=512, bias=True) | |
(v_proj): Linear(in_features=512, out_features=512, bias=True) | |
(q_proj): Linear(in_features=512, out_features=512, bias=True) | |
(out_proj): Linear(in_features=512, out_features=512, bias=True) | |
) | |
(activation_dropout_module): FairseqDropout() | |
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(fc1): Linear(in_features=512, out_features=2048, bias=True) | |
(fc2): Linear(in_features=2048, out_features=512, bias=True) | |
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
) | |
(1): TransformerDecoderLayerBase( | |
(dropout_module): FairseqDropout() | |
(self_attn): MultiheadAttention( | |
(dropout_module): FairseqDropout() | |
(k_proj): Linear(in_features=512, out_features=512, bias=True) | |
(v_proj): Linear(in_features=512, out_features=512, bias=True) | |
(q_proj): Linear(in_features=512, out_features=512, bias=True) | |
(out_proj): Linear(in_features=512, out_features=512, bias=True) | |
) | |
(activation_dropout_module): FairseqDropout() | |
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(fc1): Linear(in_features=512, out_features=2048, bias=True) | |
(fc2): Linear(in_features=2048, out_features=512, bias=True) | |
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
) | |
(2): TransformerDecoderLayerBase( | |
(dropout_module): FairseqDropout() | |
(self_attn): MultiheadAttention( | |
(dropout_module): FairseqDropout() | |
(k_proj): Linear(in_features=512, out_features=512, bias=True) | |
(v_proj): Linear(in_features=512, out_features=512, bias=True) | |
(q_proj): Linear(in_features=512, out_features=512, bias=True) | |
(out_proj): Linear(in_features=512, out_features=512, bias=True) | |
) | |
(activation_dropout_module): FairseqDropout() | |
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(fc1): Linear(in_features=512, out_features=2048, bias=True) | |
(fc2): Linear(in_features=2048, out_features=512, bias=True) | |
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
) | |
(3): TransformerDecoderLayerBase( | |
(dropout_module): FairseqDropout() | |
(self_attn): MultiheadAttention( | |
(dropout_module): FairseqDropout() | |
(k_proj): Linear(in_features=512, out_features=512, bias=True) | |
(v_proj): Linear(in_features=512, out_features=512, bias=True) | |
(q_proj): Linear(in_features=512, out_features=512, bias=True) | |
(out_proj): Linear(in_features=512, out_features=512, bias=True) | |
) | |
(activation_dropout_module): FairseqDropout() | |
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(fc1): Linear(in_features=512, out_features=2048, bias=True) | |
(fc2): Linear(in_features=2048, out_features=512, bias=True) | |
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
) | |
(4): TransformerDecoderLayerBase( | |
(dropout_module): FairseqDropout() | |
(self_attn): MultiheadAttention( | |
(dropout_module): FairseqDropout() | |
(k_proj): Linear(in_features=512, out_features=512, bias=True) | |
(v_proj): Linear(in_features=512, out_features=512, bias=True) | |
(q_proj): Linear(in_features=512, out_features=512, bias=True) | |
(out_proj): Linear(in_features=512, out_features=512, bias=True) | |
) | |
(activation_dropout_module): FairseqDropout() | |
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(fc1): Linear(in_features=512, out_features=2048, bias=True) | |
(fc2): Linear(in_features=2048, out_features=512, bias=True) | |
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
) | |
(5): TransformerDecoderLayerBase( | |
(dropout_module): FairseqDropout() | |
(self_attn): MultiheadAttention( | |
(dropout_module): FairseqDropout() | |
(k_proj): Linear(in_features=512, out_features=512, bias=True) | |
(v_proj): Linear(in_features=512, out_features=512, bias=True) | |
(q_proj): Linear(in_features=512, out_features=512, bias=True) | |
(out_proj): Linear(in_features=512, out_features=512, bias=True) | |
) | |
(activation_dropout_module): FairseqDropout() | |
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
(fc1): Linear(in_features=512, out_features=2048, bias=True) | |
(fc2): Linear(in_features=2048, out_features=512, bias=True) | |
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
) | |
) | |
(output_projection): Linear(in_features=512, out_features=267744, bias=False) | |
) | |
) | |
2022-04-19 23:36:38 | INFO | fairseq_cli.train | task: LanguageModelingTask | |
2022-04-19 23:36:38 | INFO | fairseq_cli.train | model: TransformerLanguageModel | |
2022-04-19 23:36:38 | INFO | fairseq_cli.train | criterion: CrossEntropyCriterion | |
2022-04-19 23:36:38 | INFO | fairseq_cli.train | num. shared model params: 155,999,232 (num. trained: 155,999,232) | |
2022-04-19 23:36:38 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0) | |
2022-04-19 23:36:38 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: /job/fairseq/data-bin/wikitext-103/valid | |
2022-04-19 23:36:38 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 0 | |
2022-04-19 23:36:38 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 16 nodes. | |
2022-04-19 23:36:38 | INFO | fairseq.trainer | detected shared parameter: decoder.embed_tokens.weight <- decoder.output_projection.weight | |
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00 | |
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/ | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 00/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 01/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 02/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 03/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 04/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 05/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 06/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 07/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 4/-1/-1->0->3 [2] -1/-1/-1->0->3 [3] 3/-1/-1->0->4 [4] 3/-1/-1->0->8 [5] 4/-1/-1->0->3 [6] -1/-1/-1->0->3 [7] 3/-1/-1->0->4 | |
ip-10-0-0-115:76:178 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 1/-1/-1->2->3 | |
ip-10-0-0-115:75:179 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 2/9/-1->1->-1 [2] 2/-1/-1->1->5 [3] -1/-1/-1->1->2 [4] 5/-1/-1->1->2 [5] 2/-1/-1->1->9 [6] 2/-1/-1->1->5 [7] -1/-1/-1->1->2 | |
ip-10-0-0-115:77:176 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 0/-1/-1->3->2 [3] 2/-1/-1->3->0 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 0/-1/-1->3->2 [7] 2/-1/-1->3->0 | |
ip-10-0-0-115:78:180 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] 7/-1/-1->4->0 [2] 7/12/-1->4->-1 [3] 0/-1/-1->4->7 [4] -1/-1/-1->4->7 [5] 7/-1/-1->4->0 [6] 7/-1/-1->4->12 [7] 0/-1/-1->4->7 | |
ip-10-0-0-115:79:174 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] -1/-1/-1->5->6 [2] 1/-1/-1->5->6 [3] 6/13/-1->5->-1 [4] 6/-1/-1->5->1 [5] -1/-1/-1->5->6 [6] 1/-1/-1->5->6 [7] 6/-1/-1->5->13 | |
ip-10-0-0-115:80:175 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 5/-1/-1->6->7 [2] 5/-1/-1->6->7 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 5/-1/-1->6->7 [6] 5/-1/-1->6->7 [7] 7/-1/-1->6->5 | |
ip-10-0-0-115:81:177 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 6/-1/-1->7->4 [2] 6/-1/-1->7->4 [3] 4/-1/-1->7->6 [4] 4/-1/-1->7->6 [5] 6/-1/-1->7->4 [6] 6/-1/-1->7->4 [7] 4/-1/-1->7->6 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 02 : 0[160] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 03 : 0[160] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 02 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 06 : 0[160] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 03 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 04 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 07 : 0[160] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 05 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 06 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 07 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 02 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 04 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 03 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 05 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 05 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 06 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 07 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 04 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 02 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 01 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 02 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 01 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 03 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 03 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 06 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 05 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 06 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 03 : 2[180] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 05 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 04 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 07 : 2[180] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 01 : 0[160] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 07 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 04 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 02 : 1[170] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 07 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 05 : 0[160] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 02 : 3[190] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 04 : 1[170] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 06 : 3[190] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 06 : 1[170] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 02 : 4[1a0] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 06 : 4[1a0] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 03 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 04 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 03 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 02 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 05 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 05 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 04 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 07 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 07 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 06 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Connected all rings | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 03 : 5[1b0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 02 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 07 : 5[1b0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 04 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Connected all rings | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 06 : 1[170] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 02 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 04 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 05 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 03 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Connected all rings | |
ip-10-0-0-115:77:176 [3] NCCL INFO Connected all rings | |
ip-10-0-0-115:81:177 [7] NCCL INFO Connected all rings | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 05 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 06 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Connected all rings | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 02 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 07 : 0[160] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Connected all rings | |
ip-10-0-0-115:80:175 [6] NCCL INFO Connected all rings | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 04 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 06 : 4[1a0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 03 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 05 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 07 : 5[1b0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 03 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 04 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 02 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 07 : 2[180] -> 3[190] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 03 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 05 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 06 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 07 : 6[1c0] -> 7[1d0] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 01 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 03 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 05 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 02 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 03 : 0[160] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 07 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 03 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 02 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 07 : 0[160] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 05 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 01 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 04 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 03 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 06 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 06 : 3[190] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 05 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 07 : 2[180] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 07 : 7[1d0] -> 4[1a0] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 02 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 04 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 06 : 6[1c0] -> 5[1b0] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 03 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 05 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1 | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 02 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 07 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3 | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 02 : 5[1b0] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 04 : 5[1b0] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 06 : 5[1b0] -> 1[170] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 06 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 04 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:79:174 [5] NCCL INFO Connected all trees | |
ip-10-0-0-115:79:174 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:79:174 [5] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 02 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:75:179 [1] NCCL INFO Connected all trees | |
ip-10-0-0-115:75:179 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:75:179 [1] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 06 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2 | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 03 : 1[170] -> 4[1a0] via P2P/indirect/0[160] | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 04 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0 | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 01 : 4[1a0] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 03 : 4[1a0] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 05 : 4[1a0] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 07 : 4[1a0] -> 0[160] via P2P/IPC | |
ip-10-0-0-115:74:173 [0] NCCL INFO Connected all trees | |
ip-10-0-0-115:74:173 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:74:173 [0] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:78:180 [4] NCCL INFO Connected all trees | |
ip-10-0-0-115:78:180 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:78:180 [4] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 05 : 0[160] -> 5[1b0] via P2P/indirect/1[170] | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 02 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 02 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 03 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 03 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 05 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 04 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 06 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 06 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 07 : 3[190] -> 2[180] via P2P/IPC | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 07 : 7[1d0] -> 6[1c0] via P2P/IPC | |
ip-10-0-0-115:77:176 [3] NCCL INFO Connected all trees | |
ip-10-0-0-115:77:176 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:77:176 [3] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:81:177 [7] NCCL INFO Connected all trees | |
ip-10-0-0-115:81:177 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:81:177 [7] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160] | |
ip-10-0-0-115:76:178 [2] NCCL INFO Connected all trees | |
ip-10-0-0-115:76:178 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:76:178 [2] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:80:175 [6] NCCL INFO Connected all trees | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 02 : 3[190] -> 5[1b0] via P2P/indirect/1[170] | |
ip-10-0-0-115:80:175 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
ip-10-0-0-115:80:175 [6] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 02 : 2[180] -> 4[1a0] via P2P/indirect/0[160] | |
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 03 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0] | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 03 : 2[180] -> 5[1b0] via P2P/indirect/1[170] | |
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 05 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0] | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 05 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0] | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 05 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0] | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 06 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0] | |
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 06 : 1[170] -> 7[1d0] via P2P/indirect/3[190] | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 03 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0] | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 02 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0] | |
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0] | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0] | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 02 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0] | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 03 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0] | |
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 03 : 7[1d0] -> 2[180] via P2P/indirect/3[190] | |
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 05 : 6[1c0] -> 3[190] via P2P/indirect/2[180] | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 05 : 5[1b0] -> 2[180] via P2P/indirect/1[170] | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 06 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0] | |
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 06 : 5[1b0] -> 3[190] via P2P/indirect/1[170] | |
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160] | |
ip-10-0-0-115:78:180 [4] NCCL INFO comm 0x7fb524002fb0 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE | |
ip-10-0-0-115:76:178 [2] NCCL INFO comm 0x7f53e8002fb0 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE | |
ip-10-0-0-115:80:175 [6] NCCL INFO comm 0x7fa344002fb0 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE | |
ip-10-0-0-115:79:174 [5] NCCL INFO comm 0x7f2574002fb0 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE | |
ip-10-0-0-115:74:173 [0] NCCL INFO comm 0x7f61b8002fb0 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE | |
ip-10-0-0-115:75:179 [1] NCCL INFO comm 0x7f3c14002fb0 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE | |
ip-10-0-0-115:77:176 [3] NCCL INFO comm 0x7f32ec002fb0 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE | |
ip-10-0-0-115:81:177 [7] NCCL INFO comm 0x7fef80002fb0 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE | |
ip-10-0-0-115:74:74 [0] NCCL INFO Launch mode Parallel | |
2022-04-19 23:36:40 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers*********************** | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 1: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 2: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 3: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 4: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 5: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 6: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 7: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 8: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 9: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 10: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 11: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 12: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 13: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 14: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 15: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
2022-04-19 23:36:40 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers*********************** | |
2022-04-19 23:36:40 | INFO | fairseq_cli.train | training on 16 devices (GPUs/TPUs) | |
2022-04-19 23:36:40 | INFO | fairseq_cli.train | max tokens per device = 2048 and max sentences per device = None | |
2022-04-19 23:36:40 | INFO | fairseq.trainer | Preparing to load checkpoint /job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa/checkpoint_last.pt | |
2022-04-19 23:36:40 | INFO | fairseq.trainer | No existing checkpoint found /job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa/checkpoint_last.pt | |
2022-04-19 23:36:40 | INFO | fairseq.trainer | loading train data for epoch 1 | |
2022-04-19 23:36:40 | INFO | fairseq.data.data_utils | loaded 1,801,350 examples from: /job/fairseq/data-bin/wikitext-103/train | |
2022-04-19 23:36:40 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 or --amp | |
2022-04-19 23:36:40 | INFO | fairseq.optim.adam | using FusedAdam | |
2022-04-19 23:36:41 | INFO | fairseq.data.iterators | grouped total_num_itrs = 3151 | |
2022-04-19 23:36:41 | INFO | fairseq.trainer | begin training epoch 1 | |
2022-04-19 23:36:41 | INFO | fairseq_cli.train | Start iterating over samples | |
2022-04-19 23:36:42 | INFO | root | Reducer buckets have been rebuilt in this iteration. | |
2022-04-19 23:37:08 | INFO | train_inner | epoch 001: 100 / 3151 loss=16.618, ppl=100595, wps=125528, ups=3.83, wpb=32768, bsz=64, num_updates=100, lr=1.25975e-05, gnorm=3.652, train_wall=27, gb_free=20.6, wall=28 | |
2022-04-19 23:37:34 | INFO | train_inner | epoch 001: 200 / 3151 loss=14.226, ppl=19162.3, wps=125698, ups=3.84, wpb=32768, bsz=64, num_updates=200, lr=2.5095e-05, gnorm=1.593, train_wall=26, gb_free=20.6, wall=54 | |
2022-04-19 23:38:00 | INFO | train_inner | epoch 001: 300 / 3151 loss=12.192, ppl=4678.25, wps=125282, ups=3.82, wpb=32764.3, bsz=64, num_updates=300, lr=3.75925e-05, gnorm=1.079, train_wall=26, gb_free=20.6, wall=81 | |
2022-04-19 23:38:26 | INFO | train_inner | epoch 001: 400 / 3151 loss=10.734, ppl=1702.91, wps=124818, ups=3.81, wpb=32768, bsz=64, num_updates=400, lr=5.009e-05, gnorm=0.671, train_wall=26, gb_free=20.6, wall=107 | |
2022-04-19 23:38:53 | INFO | train_inner | epoch 001: 500 / 3151 loss=10.124, ppl=1116.01, wps=124568, ups=3.8, wpb=32768, bsz=64, num_updates=500, lr=6.25875e-05, gnorm=0.558, train_wall=26, gb_free=20.6, wall=133 | |
2022-04-19 23:39:19 | INFO | train_inner | epoch 001: 600 / 3151 loss=9.818, ppl=902.65, wps=124524, ups=3.8, wpb=32768, bsz=64, num_updates=600, lr=7.5085e-05, gnorm=0.643, train_wall=26, gb_free=20.6, wall=159 | |
2022-04-19 23:39:45 | INFO | train_inner | epoch 001: 700 / 3151 loss=9.564, ppl=756.76, wps=124318, ups=3.79, wpb=32768, bsz=64, num_updates=700, lr=8.75825e-05, gnorm=0.691, train_wall=26, gb_free=20.6, wall=186 | |
2022-04-19 23:40:12 | INFO | train_inner | epoch 001: 800 / 3151 loss=9.343, ppl=649.59, wps=124355, ups=3.79, wpb=32768, bsz=64, num_updates=800, lr=0.00010008, gnorm=0.76, train_wall=26, gb_free=20.6, wall=212 | |
2022-04-19 23:40:38 | INFO | train_inner | epoch 001: 900 / 3151 loss=9.153, ppl=569.44, wps=124157, ups=3.79, wpb=32768, bsz=64, num_updates=900, lr=0.000112578, gnorm=0.818, train_wall=26, gb_free=20.6, wall=239 | |
2022-04-19 23:41:05 | INFO | train_inner | epoch 001: 1000 / 3151 loss=8.946, ppl=493.1, wps=123768, ups=3.78, wpb=32768, bsz=64, num_updates=1000, lr=0.000125075, gnorm=0.899, train_wall=26, gb_free=20.6, wall=265 | |
2022-04-19 23:41:31 | INFO | train_inner | epoch 001: 1100 / 3151 loss=8.785, ppl=441.24, wps=123914, ups=3.78, wpb=32768, bsz=64, num_updates=1100, lr=0.000137573, gnorm=0.843, train_wall=26, gb_free=20.6, wall=291 | |
2022-04-19 23:41:57 | INFO | train_inner | epoch 001: 1200 / 3151 loss=8.636, ppl=397.75, wps=124121, ups=3.79, wpb=32768, bsz=64, num_updates=1200, lr=0.00015007, gnorm=0.927, train_wall=26, gb_free=20.6, wall=318 | |
2022-04-19 23:42:24 | INFO | train_inner | epoch 001: 1300 / 3151 loss=8.488, ppl=358.98, wps=123984, ups=3.78, wpb=32768, bsz=64, num_updates=1300, lr=0.000162568, gnorm=0.932, train_wall=26, gb_free=20.6, wall=344 | |
2022-04-19 23:42:50 | INFO | train_inner | epoch 001: 1400 / 3151 loss=8.375, ppl=331.91, wps=123750, ups=3.78, wpb=32768, bsz=64, num_updates=1400, lr=0.000175065, gnorm=0.935, train_wall=26, gb_free=20.6, wall=371 | |
2022-04-19 23:43:17 | INFO | train_inner | epoch 001: 1500 / 3151 loss=8.24, ppl=302.35, wps=123999, ups=3.78, wpb=32768, bsz=64, num_updates=1500, lr=0.000187563, gnorm=0.898, train_wall=26, gb_free=20.6, wall=397 | |
2022-04-19 23:43:43 | INFO | train_inner | epoch 001: 1600 / 3151 loss=8.137, ppl=281.44, wps=123722, ups=3.78, wpb=32768, bsz=64, num_updates=1600, lr=0.00020006, gnorm=0.925, train_wall=26, gb_free=20.6, wall=424 | |
2022-04-19 23:44:10 | INFO | train_inner | epoch 001: 1700 / 3151 loss=8.029, ppl=261.19, wps=123824, ups=3.78, wpb=32768, bsz=64, num_updates=1700, lr=0.000212558, gnorm=0.907, train_wall=26, gb_free=20.6, wall=450 | |
2022-04-19 23:44:36 | INFO | train_inner | epoch 001: 1800 / 3151 loss=7.932, ppl=244.27, wps=123732, ups=3.78, wpb=32768, bsz=64, num_updates=1800, lr=0.000225055, gnorm=0.903, train_wall=26, gb_free=20.6, wall=477 | |
2022-04-19 23:45:03 | INFO | train_inner | epoch 001: 1900 / 3151 loss=7.812, ppl=224.76, wps=123761, ups=3.78, wpb=32768, bsz=64, num_updates=1900, lr=0.000237553, gnorm=0.866, train_wall=26, gb_free=20.6, wall=503 | |
2022-04-19 23:45:29 | INFO | train_inner | epoch 001: 2000 / 3151 loss=7.734, ppl=212.94, wps=123887, ups=3.78, wpb=32768, bsz=64, num_updates=2000, lr=0.00025005, gnorm=0.868, train_wall=26, gb_free=20.6, wall=530 | |
2022-04-19 23:45:56 | INFO | train_inner | epoch 001: 2100 / 3151 loss=7.647, ppl=200.5, wps=123574, ups=3.77, wpb=32768, bsz=64, num_updates=2100, lr=0.000262548, gnorm=0.859, train_wall=26, gb_free=20.6, wall=556 | |
2022-04-19 23:46:22 | INFO | train_inner | epoch 001: 2200 / 3151 loss=7.566, ppl=189.5, wps=123272, ups=3.76, wpb=32768, bsz=64, num_updates=2200, lr=0.000275045, gnorm=0.83, train_wall=26, gb_free=20.6, wall=583 | |
2022-04-19 23:46:49 | INFO | train_inner | epoch 001: 2300 / 3151 loss=7.487, ppl=179.44, wps=122972, ups=3.75, wpb=32768, bsz=64, num_updates=2300, lr=0.000287543, gnorm=0.839, train_wall=27, gb_free=20.6, wall=609 | |
2022-04-19 23:47:15 | INFO | train_inner | epoch 001: 2400 / 3151 loss=7.419, ppl=171.1, wps=123344, ups=3.76, wpb=32768, bsz=64, num_updates=2400, lr=0.00030004, gnorm=0.798, train_wall=26, gb_free=20.6, wall=636 | |
2022-04-19 23:47:42 | INFO | train_inner | epoch 001: 2500 / 3151 loss=7.339, ppl=161.9, wps=123345, ups=3.76, wpb=32768, bsz=64, num_updates=2500, lr=0.000312538, gnorm=0.809, train_wall=26, gb_free=20.6, wall=662 | |
2022-04-19 23:48:09 | INFO | train_inner | epoch 001: 2600 / 3151 loss=7.277, ppl=155.14, wps=122950, ups=3.75, wpb=32768, bsz=64, num_updates=2600, lr=0.000325035, gnorm=0.773, train_wall=27, gb_free=20.6, wall=689 | |
2022-04-19 23:48:35 | INFO | train_inner | epoch 001: 2700 / 3151 loss=7.204, ppl=147.4, wps=122741, ups=3.75, wpb=32768, bsz=64, num_updates=2700, lr=0.000337533, gnorm=0.761, train_wall=27, gb_free=20.6, wall=716 | |
2022-04-19 23:49:02 | INFO | train_inner | epoch 001: 2800 / 3151 loss=7.141, ppl=141.17, wps=123019, ups=3.75, wpb=32768, bsz=64, num_updates=2800, lr=0.00035003, gnorm=0.772, train_wall=27, gb_free=20.6, wall=742 | |
2022-04-19 23:49:29 | INFO | train_inner | epoch 001: 2900 / 3151 loss=7.099, ppl=137.1, wps=122772, ups=3.75, wpb=32768, bsz=64, num_updates=2900, lr=0.000362528, gnorm=0.758, train_wall=27, gb_free=20.6, wall=769 | |
2022-04-19 23:49:55 | INFO | train_inner | epoch 001: 3000 / 3151 loss=7.021, ppl=129.88, wps=122758, ups=3.75, wpb=32768, bsz=64, num_updates=3000, lr=0.000375025, gnorm=0.719, train_wall=27, gb_free=20.6, wall=796 | |
2022-04-19 23:50:22 | INFO | train_inner | epoch 001: 3100 / 3151 loss=6.987, ppl=126.87, wps=122945, ups=3.75, wpb=32768, bsz=64, num_updates=3100, lr=0.000387523, gnorm=0.731, train_wall=27, gb_free=20.6, wall=822 | |
2022-04-19 23:50:36 | INFO | fairseq_cli.train | begin validation on "valid" subset | |
2022-04-19 23:50:36 | INFO | valid | epoch 001 | valid on 'valid' subset | loss 6.667 | ppl 101.62 | wps 369142 | wpb 31092.3 | bsz 60.9 | num_updates 3151 | |
2022-04-19 23:50:36 | INFO | fairseq.checkpoint_utils | Preparing to save checkpoint for epoch 1 @ 3151 updates | |
2022-04-19 23:50:36 | INFO | fairseq.trainer | Saving checkpoint to /job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa/checkpoint1.pt | |
2022-04-19 23:50:44 | INFO | fairseq.trainer | Finished saving checkpoint to /job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa/checkpoint1.pt | |
2022-04-19 23:50:53 | INFO | fairseq.checkpoint_utils | Saved checkpoint /job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa/checkpoint1.pt (epoch 1 @ 3151 updates, score 6.667) (writing took 16.46948338499351 seconds) | |
2022-04-19 23:50:53 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below) | |
2022-04-19 23:50:53 | INFO | train | epoch 001 | loss 8.779 | ppl 439.28 | wps 121311 | ups 3.7 | wpb 32760.1 | bsz 64 | num_updates 3151 | lr 0.000393896 | gnorm 0.933 | train_wall 831 | gb_free 20.6 | wall 853 | |
2022-04-19 23:50:53 | INFO | fairseq.data.iterators | grouped total_num_itrs = 3151 | |
2022-04-19 23:50:53 | INFO | fairseq.trainer | begin training epoch 2 | |
2022-04-19 23:50:53 | INFO | fairseq_cli.train | Start iterating over samples | |
2022-04-19 23:51:06 | INFO | train_inner | epoch 002: 49 / 3151 loss=6.918, ppl=120.93, wps=74243.5, ups=2.28, wpb=32522.2, bsz=63.5, num_updates=3200, lr=0.00040002, gnorm=0.733, train_wall=27, gb_free=20.6, wall=866 | |
2022-04-19 23:51:32 | INFO | train_inner | epoch 002: 149 / 3151 loss=6.849, ppl=115.31, wps=123288, ups=3.76, wpb=32768, bsz=64, num_updates=3300, lr=0.000412518, gnorm=0.721, train_wall=26, gb_free=20.6, wall=893 | |
2022-04-19 23:51:59 | INFO | train_inner | epoch 002: 249 / 3151 loss=6.803, ppl=111.66, wps=123003, ups=3.75, wpb=32768, bsz=64, num_updates=3400, lr=0.000425015, gnorm=0.702, train_wall=27, gb_free=20.6, wall=919 | |
2022-04-19 23:52:26 | INFO | train_inner | epoch 002: 349 / 3151 loss=6.763, ppl=108.6, wps=123206, ups=3.76, wpb=32768, bsz=64, num_updates=3500, lr=0.000437513, gnorm=0.721, train_wall=26, gb_free=20.6, wall=946 | |
2022-04-19 23:52:52 | INFO | train_inner | epoch 002: 449 / 3151 loss=6.721, ppl=105.53, wps=123142, ups=3.76, wpb=32768, bsz=64, num_updates=3600, lr=0.00045001, gnorm=0.703, train_wall=26, gb_free=20.6, wall=973 | |
2022-04-19 23:53:19 | INFO | train_inner | epoch 002: 549 / 3151 loss=6.687, ppl=103, wps=123147, ups=3.76, wpb=32768, bsz=64, num_updates=3700, lr=0.000462508, gnorm=0.689, train_wall=26, gb_free=20.6, wall=999 | |
2022-04-19 23:53:46 | INFO | train_inner | epoch 002: 649 / 3151 loss=6.642, ppl=99.9, wps=123093, ups=3.76, wpb=32768, bsz=64, num_updates=3800, lr=0.000475005, gnorm=0.694, train_wall=27, gb_free=20.6, wall=1026 | |
2022-04-19 23:54:12 | INFO | train_inner | epoch 002: 749 / 3151 loss=6.609, ppl=97.59, wps=122781, ups=3.75, wpb=32768, bsz=64, num_updates=3900, lr=0.000487503, gnorm=0.696, train_wall=27, gb_free=20.6, wall=1053 | |
2022-04-19 23:54:39 | INFO | train_inner | epoch 002: 849 / 3151 loss=6.598, ppl=96.88, wps=122812, ups=3.75, wpb=32768, bsz=64, num_updates=4000, lr=0.0005, gnorm=0.678, train_wall=27, gb_free=20.6, wall=1079 | |
2022-04-19 23:55:06 | INFO | train_inner | epoch 002: 949 / 3151 loss=6.554, ppl=93.97, wps=123015, ups=3.75, wpb=32768, bsz=64, num_updates=4100, lr=0.000493865, gnorm=0.683, train_wall=27, gb_free=20.6, wall=1106 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment