Skip to content

Instantly share code, notes, and snippets.

@yukunlin
Created April 19, 2022 23:56
Show Gist options
  • Save yukunlin/ba8e41131abc1a7e4fb288b480d94b8f to your computer and use it in GitHub Desktop.
Save yukunlin/ba8e41131abc1a7e4fb288b480d94b8f to your computer and use it in GitHub Desktop.
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : fairseq_train_wrapped
min_nodes : 2
max_nodes : 2
nproc_per_node : 8
run_id : foobar
rdzv_backend : c10d
rdzv_endpoint : 10.0.0.115:29500
rdzv_configs : {'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq
INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result:
restart_count=0
master_addr=ip-10-0-0-115.us-west-2.compute.internal
master_port=47199
group_rank=0
group_world_size=2
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
global_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_6jwy1afk/foobar_erba4aeq/attempt_0/7/error.json
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-19 23:36:29 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 0): env://
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 5): env://
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 5
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 6): env://
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 4): env://
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 6
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 4
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 3): env://
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 3
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 7): env://
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 1): env://
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | distributed init (rank 2): env://
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 7
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 0
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-19 23:36:30 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 5
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 4
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 6
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 3
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 7
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 2
2022-04-19 23:36:30 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 1
ip-10-0-0-115:74:74 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:74:74 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:74:74 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.3
ip-10-0-0-115:78:78 [4] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:78:78 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:78:78 [4] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:78:78 [4] NCCL INFO Using network AWS Libfabric
ip-10-0-0-115:79:79 [5] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:79:79 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:79:79 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:79:79 [5] NCCL INFO Using network AWS Libfabric
ip-10-0-0-115:77:77 [3] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:81:81 [7] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:77:77 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:81:81 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:76:76 [2] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:77:77 [3] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:77:77 [3] NCCL INFO Using network AWS Libfabric
ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:81:81 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:81:81 [7] NCCL INFO Using network AWS Libfabric
ip-10-0-0-115:76:76 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:76:76 [2] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:76:76 [2] NCCL INFO Using network AWS Libfabric
ip-10-0-0-115:75:75 [1] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:75:75 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:75:75 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:75:75 [1] NCCL INFO Using network AWS Libfabric
ip-10-0-0-115:80:80 [6] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:80:80 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:80:80 [6] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:80:80 [6] NCCL INFO Using network AWS Libfabric
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:74:132 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:79:134 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:76:137 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:78:133 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:136 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:75:138 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:135 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:139 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:78:133 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] 7/-1/-1->4->0 [2] 7/12/-1->4->-1 [3] 0/-1/-1->4->7 [4] -1/-1/-1->4->7 [5] 7/-1/-1->4->0 [6] 7/-1/-1->4->12 [7] 0/-1/-1->4->7
ip-10-0-0-115:79:134 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] -1/-1/-1->5->6 [2] 1/-1/-1->5->6 [3] 6/13/-1->5->-1 [4] 6/-1/-1->5->1 [5] -1/-1/-1->5->6 [6] 1/-1/-1->5->6 [7] 6/-1/-1->5->13
ip-10-0-0-115:80:139 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 5/-1/-1->6->7 [2] 5/-1/-1->6->7 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 5/-1/-1->6->7 [6] 5/-1/-1->6->7 [7] 7/-1/-1->6->5
ip-10-0-0-115:81:136 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 6/-1/-1->7->4 [2] 6/-1/-1->7->4 [3] 4/-1/-1->7->6 [4] 4/-1/-1->7->6 [5] 6/-1/-1->7->4 [6] 6/-1/-1->7->4 [7] 4/-1/-1->7->6
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 00/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
ip-10-0-0-115:75:138 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 2/9/-1->1->-1 [2] 2/-1/-1->1->5 [3] -1/-1/-1->1->2 [4] 5/-1/-1->1->2 [5] 2/-1/-1->1->9 [6] 2/-1/-1->1->5 [7] -1/-1/-1->1->2
ip-10-0-0-115:76:137 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 1/-1/-1->2->3
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 01/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 02/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4
ip-10-0-0-115:77:135 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 0/-1/-1->3->2 [3] 2/-1/-1->3->0 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 0/-1/-1->3->2 [7] 2/-1/-1->3->0
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 03/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 04/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 05/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 06/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 07/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3
ip-10-0-0-115:74:132 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 4/-1/-1->0->3 [2] -1/-1/-1->0->3 [3] 3/-1/-1->0->4 [4] 3/-1/-1->0->8 [5] 4/-1/-1->0->3 [6] -1/-1/-1->0->3 [7] 3/-1/-1->0->4
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 03 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 02 : 0[160] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 02 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 05 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 04 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 03 : 0[160] -> 1[170] via P2P/IPC
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 07 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 06 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 06 : 0[160] -> 1[170] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 07 : 0[160] -> 1[170] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 04 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 03 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 02 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 05 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 05 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 07 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 06 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 04 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 02 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 01 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 01 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 02 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 03 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 03 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 06 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 05 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 04 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 07 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 05 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 06 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 03 : 2[180] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 04 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 02 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 07 : 2[180] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 01 : 0[160] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 04 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 07 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 05 : 0[160] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 06 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 02 : 3[190] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 06 : 3[190] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 03 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 05 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 07 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 03 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 02 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 05 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 04 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 07 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 06 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 03 : 5[1b0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 07 : 5[1b0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 02 : 4[1a0] -> 0[160] via P2P/IPC
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 06 : 4[1a0] -> 0[160] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Connected all rings
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 04 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 02 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 05 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 03 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 05 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Connected all rings
ip-10-0-0-115:78:133 [4] NCCL INFO Connected all rings
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 06 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 07 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 02 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Connected all rings
ip-10-0-0-115:79:134 [5] NCCL INFO Connected all rings
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 04 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 06 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 03 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 05 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 07 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 04 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 02 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 03 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Connected all rings
ip-10-0-0-115:75:138 [1] NCCL INFO Connected all rings
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 05 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Connected all rings
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 06 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 02 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 07 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 04 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 06 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 03 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 04 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 07 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 03 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 03 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 05 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 01 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 02 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 07 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 04 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 06 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 02 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 03 : 0[160] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 03 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 02 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 07 : 0[160] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 05 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 05 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 04 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 06 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 06 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 07 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 01 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 02 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 05 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 04 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 07 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 04 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 03 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 07 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 06 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 02 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 06 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 02 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 04 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 06 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:134 [5] NCCL INFO Connected all trees
ip-10-0-0-115:79:134 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:79:134 [5] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:75:138 [1] NCCL INFO Connected all trees
ip-10-0-0-115:75:138 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:75:138 [1] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 03 : 1[170] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 01 : 4[1a0] -> 0[160] via P2P/IPC
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 03 : 4[1a0] -> 0[160] via P2P/IPC
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 05 : 4[1a0] -> 0[160] via P2P/IPC
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 07 : 4[1a0] -> 0[160] via P2P/IPC
ip-10-0-0-115:78:133 [4] NCCL INFO Connected all trees
ip-10-0-0-115:78:133 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:78:133 [4] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:74:132 [0] NCCL INFO Connected all trees
ip-10-0-0-115:74:132 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:74:132 [0] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 05 : 0[160] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 02 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 02 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 03 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 03 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 04 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 05 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 06 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 06 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 07 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 07 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:136 [7] NCCL INFO Connected all trees
ip-10-0-0-115:81:136 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:81:136 [7] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:77:135 [3] NCCL INFO Connected all trees
ip-10-0-0-115:77:135 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:77:135 [3] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-115:80:139 [6] NCCL INFO Connected all trees
ip-10-0-0-115:80:139 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:80:139 [6] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 02 : 3[190] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-115:76:137 [2] NCCL INFO Connected all trees
ip-10-0-0-115:76:137 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:76:137 [2] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 02 : 2[180] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-115:77:135 [3] NCCL INFO Channel 03 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0]
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 03 : 2[180] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 05 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-115:76:137 [2] NCCL INFO Channel 05 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0]
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 05 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0]
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 06 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0]
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 03 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-115:75:138 [1] NCCL INFO Channel 06 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
ip-10-0-0-115:74:132 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 02 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 02 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-115:81:136 [7] NCCL INFO Channel 03 : 7[1d0] -> 2[180] via P2P/indirect/3[190]
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 03 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 05 : 5[1b0] -> 2[180] via P2P/indirect/1[170]
ip-10-0-0-115:80:139 [6] NCCL INFO Channel 05 : 6[1c0] -> 3[190] via P2P/indirect/2[180]
ip-10-0-0-115:79:134 [5] NCCL INFO Channel 06 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 06 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
ip-10-0-0-115:78:133 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
ip-10-0-0-115:78:133 [4] NCCL INFO comm 0x7fb580002fb0 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE
ip-10-0-0-115:74:132 [0] NCCL INFO comm 0x7f6200002fb0 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE
ip-10-0-0-115:77:135 [3] NCCL INFO comm 0x7f334c002fb0 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE
ip-10-0-0-115:81:136 [7] NCCL INFO comm 0x7fefdc002fb0 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE
ip-10-0-0-115:75:138 [1] NCCL INFO comm 0x7f3c68002fb0 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE
ip-10-0-0-115:76:137 [2] NCCL INFO comm 0x7f543c002fb0 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE
ip-10-0-0-115:79:134 [5] NCCL INFO comm 0x7f25e0002fb0 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE
ip-10-0-0-115:80:139 [6] NCCL INFO comm 0x7fa3a4002fb0 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE
ip-10-0-0-115:74:74 [0] NCCL INFO Launch mode Parallel
2022-04-19 23:36:35 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 16, 'distributed_num_procs': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'env://', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_base_algorithm': 'localsgd', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'not_fsdp_flatten_parameters': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': 2048, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 2048, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0, 'grouped_shuffling': False, 'update_epoch_batch_itr': False, 'update_ordered_indices_seed': False}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 50000, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0005], 'stop_min_lr': -1.0, 'use_bmuf': False, 'skip_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': '/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa', 'restore_file': 'checkpoint_last.pt', 'continue_once': None, 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 8}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': {'_name': 'transformer_lm', 'activation_fn': relu, 'dropout': 0.1, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'relu_dropout': 0.0, 'decoder_embed_dim': 512, 'decoder_output_dim': 512, 'decoder_input_dim': 512, 'decoder_ffn_embed_dim': 2048, 'decoder_layers': 6, 'decoder_attention_heads': 8, 'decoder_normalize_before': False, 'no_decoder_final_norm': False, 'adaptive_softmax_cutoff': None, 'adaptive_softmax_dropout': 0.0, 'adaptive_softmax_factor': 4.0, 'no_token_positional_embeddings': False, 'share_decoder_input_output_embed': True, 'character_embeddings': False, 'character_filters': '[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', 'character_embedding_dim': 4, 'char_embedder_highway_layers': 2, 'adaptive_input': False, 'adaptive_input_factor': 4.0, 'adaptive_input_cutoff': None, 'tie_adaptive_weights': False, 'tie_adaptive_proj': False, 'decoder_learned_pos': False, 'layernorm_embedding': False, 'no_scale_embedding': False, 'checkpoint_activations': False, 'offload_activations': False, 'decoder_layerdrop': 0.0, 'decoder_layers_to_keep': None, 'quant_noise_pq': 0.0, 'quant_noise_pq_block_size': 8, 'quant_noise_scalar': 0.0, 'min_params_to_wrap': 100000000, 'base_layers': 0, 'base_sublayers': 1, 'base_shuffle': 1, 'scale_fc': False, 'scale_attn': False, 'scale_heads': False, 'scale_resids': False, 'add_bos_token': False, 'tokens_per_sample': 512, 'max_target_positions': None, 'tpu': False}, 'task': {'_name': 'language_modeling', 'data': '/job/fairseq/data-bin/wikitext-103', 'sample_break_mode': none, 'tokens_per_sample': 512, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': None, 'shorten_method': none, 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-08, 'weight_decay': 0.01, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0005]}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 4000, 'warmup_init_lr': 1e-07, 'lr': [0.0005]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}}
2022-04-19 23:36:35 | INFO | fairseq.tasks.language_modeling | dictionary: 267744 types
2022-04-19 23:36:38 | INFO | fairseq_cli.train | TransformerLanguageModel(
(decoder): TransformerDecoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(267744, 512, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerDecoderLayerBase(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayerBase(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(2): TransformerDecoderLayerBase(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(3): TransformerDecoderLayerBase(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(4): TransformerDecoderLayerBase(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(5): TransformerDecoderLayerBase(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
)
(output_projection): Linear(in_features=512, out_features=267744, bias=False)
)
)
2022-04-19 23:36:38 | INFO | fairseq_cli.train | task: LanguageModelingTask
2022-04-19 23:36:38 | INFO | fairseq_cli.train | model: TransformerLanguageModel
2022-04-19 23:36:38 | INFO | fairseq_cli.train | criterion: CrossEntropyCriterion
2022-04-19 23:36:38 | INFO | fairseq_cli.train | num. shared model params: 155,999,232 (num. trained: 155,999,232)
2022-04-19 23:36:38 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0)
2022-04-19 23:36:38 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: /job/fairseq/data-bin/wikitext-103/valid
2022-04-19 23:36:38 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 0
2022-04-19 23:36:38 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 16 nodes.
2022-04-19 23:36:38 | INFO | fairseq.trainer | detected shared parameter: decoder.embed_tokens.weight <- decoder.output_projection.weight
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:81:177 [7] NCCL INFO NET/OFI [7] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:74:173 [0] NCCL INFO NET/OFI [0] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:76:178 [2] NCCL INFO NET/OFI [2] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:78:180 [4] NCCL INFO NET/OFI [4] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:75:179 [1] NCCL INFO NET/OFI [1] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 0 busId 0000:00:16.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:77:176 [3] NCCL INFO NET/OFI [3] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00
ip-10-0-0-115:79:174 [5] NCCL INFO NET/OFI [5] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00
ip-10-0-0-115:80:175 [6] NCCL INFO NET/OFI [6] getCudaPath dev 3 busId 0000:00:19.0 path /sys/devices/pci0000:00/
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 00/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 01/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 02/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 03/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 04/08 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 05/08 : 0 4 7 6 5 9 10 11 8 12 15 14 13 1 2 3
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 06/08 : 0 1 5 6 10 11 15 12 8 9 13 14 2 3 7 4
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 07/08 : 0 1 2 6 5 4 7 11 8 9 10 14 13 12 15 3
ip-10-0-0-115:74:173 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 4/-1/-1->0->3 [2] -1/-1/-1->0->3 [3] 3/-1/-1->0->4 [4] 3/-1/-1->0->8 [5] 4/-1/-1->0->3 [6] -1/-1/-1->0->3 [7] 3/-1/-1->0->4
ip-10-0-0-115:76:178 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 1/-1/-1->2->3
ip-10-0-0-115:75:179 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 2/9/-1->1->-1 [2] 2/-1/-1->1->5 [3] -1/-1/-1->1->2 [4] 5/-1/-1->1->2 [5] 2/-1/-1->1->9 [6] 2/-1/-1->1->5 [7] -1/-1/-1->1->2
ip-10-0-0-115:77:176 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 0/-1/-1->3->2 [3] 2/-1/-1->3->0 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 0/-1/-1->3->2 [7] 2/-1/-1->3->0
ip-10-0-0-115:78:180 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] 7/-1/-1->4->0 [2] 7/12/-1->4->-1 [3] 0/-1/-1->4->7 [4] -1/-1/-1->4->7 [5] 7/-1/-1->4->0 [6] 7/-1/-1->4->12 [7] 0/-1/-1->4->7
ip-10-0-0-115:79:174 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] -1/-1/-1->5->6 [2] 1/-1/-1->5->6 [3] 6/13/-1->5->-1 [4] 6/-1/-1->5->1 [5] -1/-1/-1->5->6 [6] 1/-1/-1->5->6 [7] 6/-1/-1->5->13
ip-10-0-0-115:80:175 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 5/-1/-1->6->7 [2] 5/-1/-1->6->7 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 5/-1/-1->6->7 [6] 5/-1/-1->6->7 [7] 7/-1/-1->6->5
ip-10-0-0-115:81:177 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 6/-1/-1->7->4 [2] 6/-1/-1->7->4 [3] 4/-1/-1->7->6 [4] 4/-1/-1->7->6 [5] 6/-1/-1->7->4 [6] 6/-1/-1->7->4 [7] 4/-1/-1->7->6
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 02 : 0[160] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 03 : 0[160] -> 1[170] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 02 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 06 : 0[160] -> 1[170] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 03 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 04 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 07 : 0[160] -> 1[170] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 05 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 06 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 07 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 02 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 04 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 03 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 05 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 05 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 06 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 07 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 04 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 02 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 01 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 02 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 01 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 03 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 03 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 06 : 6[1c0] -> 10[180] [send] via NET/AWS Libfabric/2
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 05 : 5[1b0] -> 9[170] [send] via NET/AWS Libfabric/1
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 06 : 14[1c0] -> 2[180] [receive] via NET/AWS Libfabric/2
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 03 : 2[180] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 05 : 13[1b0] -> 1[170] [receive] via NET/AWS Libfabric/1
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 04 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 07 : 2[180] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 01 : 0[160] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 07 : 15[1d0] -> 3[190] [receive] via NET/AWS Libfabric/3
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 04 : 4[1a0] -> 8[160] [send] via NET/AWS Libfabric/0
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 02 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 07 : 7[1d0] -> 11[190] [send] via NET/AWS Libfabric/3
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 05 : 0[160] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 02 : 3[190] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 04 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 06 : 3[190] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 06 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 02 : 4[1a0] -> 0[160] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 06 : 4[1a0] -> 0[160] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 03 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 04 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 03 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 02 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 05 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 05 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 04 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 07 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 07 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 06 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Connected all rings
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 03 : 5[1b0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 02 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 07 : 5[1b0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 04 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Connected all rings
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 06 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 02 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 04 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 05 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 03 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Connected all rings
ip-10-0-0-115:77:176 [3] NCCL INFO Connected all rings
ip-10-0-0-115:81:177 [7] NCCL INFO Connected all rings
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 05 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 06 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Connected all rings
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 02 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 07 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Connected all rings
ip-10-0-0-115:80:175 [6] NCCL INFO Connected all rings
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 04 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 06 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 03 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 05 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 07 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 03 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 04 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 02 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 07 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 03 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 05 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 06 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 07 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 01 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 03 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 05 : 9[170] -> 1[170] [receive] via NET/AWS Libfabric/1
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 02 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 03 : 0[160] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 07 : 13[1b0] -> 5[1b0] [receive] via NET/AWS Libfabric/3
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 03 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 02 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 07 : 0[160] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 05 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 01 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 04 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 03 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 06 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 06 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 05 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 07 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 07 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 02 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 04 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 06 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 03 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 05 : 1[170] -> 9[170] [send] via NET/AWS Libfabric/1
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 02 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 07 : 5[1b0] -> 13[1b0] [send] via NET/AWS Libfabric/3
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 02 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 04 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 06 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 06 : 12[1a0] -> 4[1a0] [receive] via NET/AWS Libfabric/2
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 04 : 8[160] -> 0[160] [receive] via NET/AWS Libfabric/0
ip-10-0-0-115:79:174 [5] NCCL INFO Connected all trees
ip-10-0-0-115:79:174 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:79:174 [5] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 02 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2
ip-10-0-0-115:75:179 [1] NCCL INFO Connected all trees
ip-10-0-0-115:75:179 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:75:179 [1] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 06 : 4[1a0] -> 12[1a0] [send] via NET/AWS Libfabric/2
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 03 : 1[170] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 04 : 0[160] -> 8[160] [send] via NET/AWS Libfabric/0
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 01 : 4[1a0] -> 0[160] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 03 : 4[1a0] -> 0[160] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 05 : 4[1a0] -> 0[160] via P2P/IPC
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 07 : 4[1a0] -> 0[160] via P2P/IPC
ip-10-0-0-115:74:173 [0] NCCL INFO Connected all trees
ip-10-0-0-115:74:173 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:74:173 [0] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:78:180 [4] NCCL INFO Connected all trees
ip-10-0-0-115:78:180 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:78:180 [4] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 05 : 0[160] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 02 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 02 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 03 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 03 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 05 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 04 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 06 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 06 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 07 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 07 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:77:176 [3] NCCL INFO Connected all trees
ip-10-0-0-115:77:176 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:77:176 [3] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:81:177 [7] NCCL INFO Connected all trees
ip-10-0-0-115:81:177 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:81:177 [7] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-115:76:178 [2] NCCL INFO Connected all trees
ip-10-0-0-115:76:178 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:76:178 [2] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:80:175 [6] NCCL INFO Connected all trees
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 02 : 3[190] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-115:80:175 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:80:175 [6] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 02 : 2[180] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-115:77:176 [3] NCCL INFO Channel 03 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0]
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 03 : 2[180] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-115:76:178 [2] NCCL INFO Channel 05 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0]
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 05 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 05 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0]
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 06 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0]
ip-10-0-0-115:75:179 [1] NCCL INFO Channel 06 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 03 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 02 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-115:74:173 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 02 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 03 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-115:81:177 [7] NCCL INFO Channel 03 : 7[1d0] -> 2[180] via P2P/indirect/3[190]
ip-10-0-0-115:80:175 [6] NCCL INFO Channel 05 : 6[1c0] -> 3[190] via P2P/indirect/2[180]
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 05 : 5[1b0] -> 2[180] via P2P/indirect/1[170]
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 06 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
ip-10-0-0-115:79:174 [5] NCCL INFO Channel 06 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
ip-10-0-0-115:78:180 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
ip-10-0-0-115:78:180 [4] NCCL INFO comm 0x7fb524002fb0 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE
ip-10-0-0-115:76:178 [2] NCCL INFO comm 0x7f53e8002fb0 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE
ip-10-0-0-115:80:175 [6] NCCL INFO comm 0x7fa344002fb0 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE
ip-10-0-0-115:79:174 [5] NCCL INFO comm 0x7f2574002fb0 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE
ip-10-0-0-115:74:173 [0] NCCL INFO comm 0x7f61b8002fb0 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE
ip-10-0-0-115:75:179 [1] NCCL INFO comm 0x7f3c14002fb0 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE
ip-10-0-0-115:77:176 [3] NCCL INFO comm 0x7f32ec002fb0 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE
ip-10-0-0-115:81:177 [7] NCCL INFO comm 0x7fef80002fb0 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE
ip-10-0-0-115:74:74 [0] NCCL INFO Launch mode Parallel
2022-04-19 23:36:40 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers***********************
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 1: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 2: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 3: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 4: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 5: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 6: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 7: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 8: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 9: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 10: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 11: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 12: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 13: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 14: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | rank 15: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-19 23:36:40 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers***********************
2022-04-19 23:36:40 | INFO | fairseq_cli.train | training on 16 devices (GPUs/TPUs)
2022-04-19 23:36:40 | INFO | fairseq_cli.train | max tokens per device = 2048 and max sentences per device = None
2022-04-19 23:36:40 | INFO | fairseq.trainer | Preparing to load checkpoint /job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa/checkpoint_last.pt
2022-04-19 23:36:40 | INFO | fairseq.trainer | No existing checkpoint found /job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa/checkpoint_last.pt
2022-04-19 23:36:40 | INFO | fairseq.trainer | loading train data for epoch 1
2022-04-19 23:36:40 | INFO | fairseq.data.data_utils | loaded 1,801,350 examples from: /job/fairseq/data-bin/wikitext-103/train
2022-04-19 23:36:40 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 or --amp
2022-04-19 23:36:40 | INFO | fairseq.optim.adam | using FusedAdam
2022-04-19 23:36:41 | INFO | fairseq.data.iterators | grouped total_num_itrs = 3151
2022-04-19 23:36:41 | INFO | fairseq.trainer | begin training epoch 1
2022-04-19 23:36:41 | INFO | fairseq_cli.train | Start iterating over samples
2022-04-19 23:36:42 | INFO | root | Reducer buckets have been rebuilt in this iteration.
2022-04-19 23:37:08 | INFO | train_inner | epoch 001: 100 / 3151 loss=16.618, ppl=100595, wps=125528, ups=3.83, wpb=32768, bsz=64, num_updates=100, lr=1.25975e-05, gnorm=3.652, train_wall=27, gb_free=20.6, wall=28
2022-04-19 23:37:34 | INFO | train_inner | epoch 001: 200 / 3151 loss=14.226, ppl=19162.3, wps=125698, ups=3.84, wpb=32768, bsz=64, num_updates=200, lr=2.5095e-05, gnorm=1.593, train_wall=26, gb_free=20.6, wall=54
2022-04-19 23:38:00 | INFO | train_inner | epoch 001: 300 / 3151 loss=12.192, ppl=4678.25, wps=125282, ups=3.82, wpb=32764.3, bsz=64, num_updates=300, lr=3.75925e-05, gnorm=1.079, train_wall=26, gb_free=20.6, wall=81
2022-04-19 23:38:26 | INFO | train_inner | epoch 001: 400 / 3151 loss=10.734, ppl=1702.91, wps=124818, ups=3.81, wpb=32768, bsz=64, num_updates=400, lr=5.009e-05, gnorm=0.671, train_wall=26, gb_free=20.6, wall=107
2022-04-19 23:38:53 | INFO | train_inner | epoch 001: 500 / 3151 loss=10.124, ppl=1116.01, wps=124568, ups=3.8, wpb=32768, bsz=64, num_updates=500, lr=6.25875e-05, gnorm=0.558, train_wall=26, gb_free=20.6, wall=133
2022-04-19 23:39:19 | INFO | train_inner | epoch 001: 600 / 3151 loss=9.818, ppl=902.65, wps=124524, ups=3.8, wpb=32768, bsz=64, num_updates=600, lr=7.5085e-05, gnorm=0.643, train_wall=26, gb_free=20.6, wall=159
2022-04-19 23:39:45 | INFO | train_inner | epoch 001: 700 / 3151 loss=9.564, ppl=756.76, wps=124318, ups=3.79, wpb=32768, bsz=64, num_updates=700, lr=8.75825e-05, gnorm=0.691, train_wall=26, gb_free=20.6, wall=186
2022-04-19 23:40:12 | INFO | train_inner | epoch 001: 800 / 3151 loss=9.343, ppl=649.59, wps=124355, ups=3.79, wpb=32768, bsz=64, num_updates=800, lr=0.00010008, gnorm=0.76, train_wall=26, gb_free=20.6, wall=212
2022-04-19 23:40:38 | INFO | train_inner | epoch 001: 900 / 3151 loss=9.153, ppl=569.44, wps=124157, ups=3.79, wpb=32768, bsz=64, num_updates=900, lr=0.000112578, gnorm=0.818, train_wall=26, gb_free=20.6, wall=239
2022-04-19 23:41:05 | INFO | train_inner | epoch 001: 1000 / 3151 loss=8.946, ppl=493.1, wps=123768, ups=3.78, wpb=32768, bsz=64, num_updates=1000, lr=0.000125075, gnorm=0.899, train_wall=26, gb_free=20.6, wall=265
2022-04-19 23:41:31 | INFO | train_inner | epoch 001: 1100 / 3151 loss=8.785, ppl=441.24, wps=123914, ups=3.78, wpb=32768, bsz=64, num_updates=1100, lr=0.000137573, gnorm=0.843, train_wall=26, gb_free=20.6, wall=291
2022-04-19 23:41:57 | INFO | train_inner | epoch 001: 1200 / 3151 loss=8.636, ppl=397.75, wps=124121, ups=3.79, wpb=32768, bsz=64, num_updates=1200, lr=0.00015007, gnorm=0.927, train_wall=26, gb_free=20.6, wall=318
2022-04-19 23:42:24 | INFO | train_inner | epoch 001: 1300 / 3151 loss=8.488, ppl=358.98, wps=123984, ups=3.78, wpb=32768, bsz=64, num_updates=1300, lr=0.000162568, gnorm=0.932, train_wall=26, gb_free=20.6, wall=344
2022-04-19 23:42:50 | INFO | train_inner | epoch 001: 1400 / 3151 loss=8.375, ppl=331.91, wps=123750, ups=3.78, wpb=32768, bsz=64, num_updates=1400, lr=0.000175065, gnorm=0.935, train_wall=26, gb_free=20.6, wall=371
2022-04-19 23:43:17 | INFO | train_inner | epoch 001: 1500 / 3151 loss=8.24, ppl=302.35, wps=123999, ups=3.78, wpb=32768, bsz=64, num_updates=1500, lr=0.000187563, gnorm=0.898, train_wall=26, gb_free=20.6, wall=397
2022-04-19 23:43:43 | INFO | train_inner | epoch 001: 1600 / 3151 loss=8.137, ppl=281.44, wps=123722, ups=3.78, wpb=32768, bsz=64, num_updates=1600, lr=0.00020006, gnorm=0.925, train_wall=26, gb_free=20.6, wall=424
2022-04-19 23:44:10 | INFO | train_inner | epoch 001: 1700 / 3151 loss=8.029, ppl=261.19, wps=123824, ups=3.78, wpb=32768, bsz=64, num_updates=1700, lr=0.000212558, gnorm=0.907, train_wall=26, gb_free=20.6, wall=450
2022-04-19 23:44:36 | INFO | train_inner | epoch 001: 1800 / 3151 loss=7.932, ppl=244.27, wps=123732, ups=3.78, wpb=32768, bsz=64, num_updates=1800, lr=0.000225055, gnorm=0.903, train_wall=26, gb_free=20.6, wall=477
2022-04-19 23:45:03 | INFO | train_inner | epoch 001: 1900 / 3151 loss=7.812, ppl=224.76, wps=123761, ups=3.78, wpb=32768, bsz=64, num_updates=1900, lr=0.000237553, gnorm=0.866, train_wall=26, gb_free=20.6, wall=503
2022-04-19 23:45:29 | INFO | train_inner | epoch 001: 2000 / 3151 loss=7.734, ppl=212.94, wps=123887, ups=3.78, wpb=32768, bsz=64, num_updates=2000, lr=0.00025005, gnorm=0.868, train_wall=26, gb_free=20.6, wall=530
2022-04-19 23:45:56 | INFO | train_inner | epoch 001: 2100 / 3151 loss=7.647, ppl=200.5, wps=123574, ups=3.77, wpb=32768, bsz=64, num_updates=2100, lr=0.000262548, gnorm=0.859, train_wall=26, gb_free=20.6, wall=556
2022-04-19 23:46:22 | INFO | train_inner | epoch 001: 2200 / 3151 loss=7.566, ppl=189.5, wps=123272, ups=3.76, wpb=32768, bsz=64, num_updates=2200, lr=0.000275045, gnorm=0.83, train_wall=26, gb_free=20.6, wall=583
2022-04-19 23:46:49 | INFO | train_inner | epoch 001: 2300 / 3151 loss=7.487, ppl=179.44, wps=122972, ups=3.75, wpb=32768, bsz=64, num_updates=2300, lr=0.000287543, gnorm=0.839, train_wall=27, gb_free=20.6, wall=609
2022-04-19 23:47:15 | INFO | train_inner | epoch 001: 2400 / 3151 loss=7.419, ppl=171.1, wps=123344, ups=3.76, wpb=32768, bsz=64, num_updates=2400, lr=0.00030004, gnorm=0.798, train_wall=26, gb_free=20.6, wall=636
2022-04-19 23:47:42 | INFO | train_inner | epoch 001: 2500 / 3151 loss=7.339, ppl=161.9, wps=123345, ups=3.76, wpb=32768, bsz=64, num_updates=2500, lr=0.000312538, gnorm=0.809, train_wall=26, gb_free=20.6, wall=662
2022-04-19 23:48:09 | INFO | train_inner | epoch 001: 2600 / 3151 loss=7.277, ppl=155.14, wps=122950, ups=3.75, wpb=32768, bsz=64, num_updates=2600, lr=0.000325035, gnorm=0.773, train_wall=27, gb_free=20.6, wall=689
2022-04-19 23:48:35 | INFO | train_inner | epoch 001: 2700 / 3151 loss=7.204, ppl=147.4, wps=122741, ups=3.75, wpb=32768, bsz=64, num_updates=2700, lr=0.000337533, gnorm=0.761, train_wall=27, gb_free=20.6, wall=716
2022-04-19 23:49:02 | INFO | train_inner | epoch 001: 2800 / 3151 loss=7.141, ppl=141.17, wps=123019, ups=3.75, wpb=32768, bsz=64, num_updates=2800, lr=0.00035003, gnorm=0.772, train_wall=27, gb_free=20.6, wall=742
2022-04-19 23:49:29 | INFO | train_inner | epoch 001: 2900 / 3151 loss=7.099, ppl=137.1, wps=122772, ups=3.75, wpb=32768, bsz=64, num_updates=2900, lr=0.000362528, gnorm=0.758, train_wall=27, gb_free=20.6, wall=769
2022-04-19 23:49:55 | INFO | train_inner | epoch 001: 3000 / 3151 loss=7.021, ppl=129.88, wps=122758, ups=3.75, wpb=32768, bsz=64, num_updates=3000, lr=0.000375025, gnorm=0.719, train_wall=27, gb_free=20.6, wall=796
2022-04-19 23:50:22 | INFO | train_inner | epoch 001: 3100 / 3151 loss=6.987, ppl=126.87, wps=122945, ups=3.75, wpb=32768, bsz=64, num_updates=3100, lr=0.000387523, gnorm=0.731, train_wall=27, gb_free=20.6, wall=822
2022-04-19 23:50:36 | INFO | fairseq_cli.train | begin validation on "valid" subset
2022-04-19 23:50:36 | INFO | valid | epoch 001 | valid on 'valid' subset | loss 6.667 | ppl 101.62 | wps 369142 | wpb 31092.3 | bsz 60.9 | num_updates 3151
2022-04-19 23:50:36 | INFO | fairseq.checkpoint_utils | Preparing to save checkpoint for epoch 1 @ 3151 updates
2022-04-19 23:50:36 | INFO | fairseq.trainer | Saving checkpoint to /job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa/checkpoint1.pt
2022-04-19 23:50:44 | INFO | fairseq.trainer | Finished saving checkpoint to /job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa/checkpoint1.pt
2022-04-19 23:50:53 | INFO | fairseq.checkpoint_utils | Saved checkpoint /job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa/checkpoint1.pt (epoch 1 @ 3151 updates, score 6.667) (writing took 16.46948338499351 seconds)
2022-04-19 23:50:53 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below)
2022-04-19 23:50:53 | INFO | train | epoch 001 | loss 8.779 | ppl 439.28 | wps 121311 | ups 3.7 | wpb 32760.1 | bsz 64 | num_updates 3151 | lr 0.000393896 | gnorm 0.933 | train_wall 831 | gb_free 20.6 | wall 853
2022-04-19 23:50:53 | INFO | fairseq.data.iterators | grouped total_num_itrs = 3151
2022-04-19 23:50:53 | INFO | fairseq.trainer | begin training epoch 2
2022-04-19 23:50:53 | INFO | fairseq_cli.train | Start iterating over samples
2022-04-19 23:51:06 | INFO | train_inner | epoch 002: 49 / 3151 loss=6.918, ppl=120.93, wps=74243.5, ups=2.28, wpb=32522.2, bsz=63.5, num_updates=3200, lr=0.00040002, gnorm=0.733, train_wall=27, gb_free=20.6, wall=866
2022-04-19 23:51:32 | INFO | train_inner | epoch 002: 149 / 3151 loss=6.849, ppl=115.31, wps=123288, ups=3.76, wpb=32768, bsz=64, num_updates=3300, lr=0.000412518, gnorm=0.721, train_wall=26, gb_free=20.6, wall=893
2022-04-19 23:51:59 | INFO | train_inner | epoch 002: 249 / 3151 loss=6.803, ppl=111.66, wps=123003, ups=3.75, wpb=32768, bsz=64, num_updates=3400, lr=0.000425015, gnorm=0.702, train_wall=27, gb_free=20.6, wall=919
2022-04-19 23:52:26 | INFO | train_inner | epoch 002: 349 / 3151 loss=6.763, ppl=108.6, wps=123206, ups=3.76, wpb=32768, bsz=64, num_updates=3500, lr=0.000437513, gnorm=0.721, train_wall=26, gb_free=20.6, wall=946
2022-04-19 23:52:52 | INFO | train_inner | epoch 002: 449 / 3151 loss=6.721, ppl=105.53, wps=123142, ups=3.76, wpb=32768, bsz=64, num_updates=3600, lr=0.00045001, gnorm=0.703, train_wall=26, gb_free=20.6, wall=973
2022-04-19 23:53:19 | INFO | train_inner | epoch 002: 549 / 3151 loss=6.687, ppl=103, wps=123147, ups=3.76, wpb=32768, bsz=64, num_updates=3700, lr=0.000462508, gnorm=0.689, train_wall=26, gb_free=20.6, wall=999
2022-04-19 23:53:46 | INFO | train_inner | epoch 002: 649 / 3151 loss=6.642, ppl=99.9, wps=123093, ups=3.76, wpb=32768, bsz=64, num_updates=3800, lr=0.000475005, gnorm=0.694, train_wall=27, gb_free=20.6, wall=1026
2022-04-19 23:54:12 | INFO | train_inner | epoch 002: 749 / 3151 loss=6.609, ppl=97.59, wps=122781, ups=3.75, wpb=32768, bsz=64, num_updates=3900, lr=0.000487503, gnorm=0.696, train_wall=27, gb_free=20.6, wall=1053
2022-04-19 23:54:39 | INFO | train_inner | epoch 002: 849 / 3151 loss=6.598, ppl=96.88, wps=122812, ups=3.75, wpb=32768, bsz=64, num_updates=4000, lr=0.0005, gnorm=0.678, train_wall=27, gb_free=20.6, wall=1079
2022-04-19 23:55:06 | INFO | train_inner | epoch 002: 949 / 3151 loss=6.554, ppl=93.97, wps=123015, ups=3.75, wpb=32768, bsz=64, num_updates=4100, lr=0.000493865, gnorm=0.683, train_wall=27, gb_free=20.6, wall=1106
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment