Skip to content

Instantly share code, notes, and snippets.

@yukunlin
Last active April 18, 2022 23:39
Show Gist options
  • Save yukunlin/95a1036dba1c3a677f8f130e6cf23fbf to your computer and use it in GitHub Desktop.
Save yukunlin/95a1036dba1c3a677f8f130e6cf23fbf to your computer and use it in GitHub Desktop.
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : fairseq_train_wrapped
min_nodes : 2
max_nodes : 2
nproc_per_node : 8
run_id : none
rdzv_backend : static
rdzv_endpoint : 10.0.0.115:12347
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_mxzxvmjr/none_lwbywd8h
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=10.0.0.115
master_port=12347
group_rank=0
group_world_size=2
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
global_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mxzxvmjr/none_lwbywd8h/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_mxzxvmjr/none_lwbywd8h/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_mxzxvmjr/none_lwbywd8h/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_mxzxvmjr/none_lwbywd8h/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_mxzxvmjr/none_lwbywd8h/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_mxzxvmjr/none_lwbywd8h/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_mxzxvmjr/none_lwbywd8h/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_mxzxvmjr/none_lwbywd8h/attempt_0/7/error.json
2022-04-18 23:34:06 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-18 23:34:06 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-18 23:34:06 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-18 23:34:06 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-18 23:34:06 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-18 23:34:06 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-18 23:34:06 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-18 23:34:06 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | distributed init (rank 5): env://
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 5
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | distributed init (rank 0): env://
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | distributed init (rank 6): env://
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 6
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | distributed init (rank 7): env://
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | distributed init (rank 4): env://
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 7
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 4
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | distributed init (rank 3): env://
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 3
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | distributed init (rank 1): env://
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | distributed init (rank 2): env://
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-18 23:34:08 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes.
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 0
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 7
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 1
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 3
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 6
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 4
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 5
2022-04-18 23:34:08 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-115 as rank 2
ip-10-0-0-115:73:73 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:73:73 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:73:73 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:73:73 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:73:73 [0] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-0-0-115:73:73 [0] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:73:73 [0] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:73:73 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:73:73 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
ip-10-0-0-115:80:80 [7] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:77:77 [4] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:74:74 [1] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:78:78 [5] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:74:74 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:80:80 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:77:77 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:74:74 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:80:80 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:77:77 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:77:77 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:74:74 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:78:78 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:80:80 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:78:78 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:78:78 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:79:79 [6] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:79:79 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:79:79 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:79:79 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:76:76 [3] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:74:74 [1] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-0-0-115:77:77 [4] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-0-0-115:80:80 [7] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-0-0-115:78:78 [5] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-0-0-115:74:74 [1] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:80:80 [7] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:77:77 [4] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:78:78 [5] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:80:80 [7] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:77:77 [4] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:78:78 [5] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:74:74 [1] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:76:76 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:76:76 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:76:76 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:80:80 [7] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:74:74 [1] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:78:78 [5] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:80:80 [7] NCCL INFO Using network Socket
ip-10-0-0-115:77:77 [4] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:74:74 [1] NCCL INFO Using network Socket
ip-10-0-0-115:78:78 [5] NCCL INFO Using network Socket
ip-10-0-0-115:77:77 [4] NCCL INFO Using network Socket
ip-10-0-0-115:79:79 [6] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-0-0-115:79:79 [6] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:79:79 [6] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:79:79 [6] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:79:79 [6] NCCL INFO Using network Socket
ip-10-0-0-115:76:76 [3] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-0-0-115:76:76 [3] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:76:76 [3] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:76:76 [3] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:76:76 [3] NCCL INFO Using network Socket
ip-10-0-0-115:75:75 [2] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:75:75 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:75:75 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:75:75 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:75:75 [2] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-0-0-115:75:75 [2] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:75:75 [2] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:75:75 [2] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:75:75 [2] NCCL INFO Using network Socket
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 00/02 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 01/02 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
ip-10-0-0-115:73:130 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 3/-1/-1->0->8
ip-10-0-0-115:74:132 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 5/-1/-1->1->2
ip-10-0-0-115:75:137 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 1/-1/-1->2->3
ip-10-0-0-115:76:136 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 2/-1/-1->3->0
ip-10-0-0-115:77:133 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] -1/-1/-1->4->7
ip-10-0-0-115:78:134 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] 6/-1/-1->5->1
ip-10-0-0-115:79:135 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
ip-10-0-0-115:80:131 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 4/-1/-1->7->6
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:74:132 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:78:134 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:74:132 [1] NCCL INFO Channel 01 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:78:134 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:75:137 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:135 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:137 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:135 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:76:136 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:77:133 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/Socket/0
ip-10-0-0-115:76:136 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:76:136 [3] NCCL INFO Connected all rings
ip-10-0-0-115:80:131 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:79:135 [6] NCCL INFO Connected all rings
ip-10-0-0-115:80:131 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/Socket/0
ip-10-0-0-115:73:130 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-10-0-0-115:78:134 [5] NCCL INFO Connected all rings
ip-10-0-0-115:77:133 [4] NCCL INFO Channel 01 : 4[1a0] -> 8[160] [send] via NET/Socket/0
ip-10-0-0-115:79:135 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:78:134 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:135 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:78:134 [5] NCCL INFO Channel 01 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-115:74:132 [1] NCCL INFO Connected all rings
ip-10-0-0-115:74:132 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:75:137 [2] NCCL INFO Connected all rings
ip-10-0-0-115:74:132 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 01 : 12[1a0] -> 0[160] [receive] via NET/Socket/0
ip-10-0-0-115:75:137 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:73:130 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-10-0-0-115:75:137 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:76:136 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:75:137 [2] NCCL INFO Connected all trees
ip-10-0-0-115:75:137 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:75:137 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:76:136 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:75:137 [2] NCCL INFO Channel 00 : 2[180] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-115:74:132 [1] NCCL INFO Connected all trees
ip-10-0-0-115:74:132 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:74:132 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:73:130 [0] NCCL INFO Connected all rings
ip-10-0-0-115:74:132 [1] NCCL INFO Channel 01 : 1[170] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-115:78:134 [5] NCCL INFO Connected all trees
ip-10-0-0-115:78:134 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:78:134 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:77:133 [4] NCCL INFO Connected all rings
ip-10-0-0-115:80:131 [7] NCCL INFO Connected all rings
ip-10-0-0-115:77:133 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:77:133 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:77:133 [4] NCCL INFO Connected all trees
ip-10-0-0-115:77:133 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:77:133 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:80:131 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:80:131 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/Socket/0
ip-10-0-0-115:73:130 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-10-0-0-115:80:131 [7] NCCL INFO Connected all trees
ip-10-0-0-115:80:131 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:80:131 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:79:135 [6] NCCL INFO Connected all trees
ip-10-0-0-115:79:135 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:79:135 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 01 : 8[160] -> 0[160] [receive] via NET/Socket/0
ip-10-0-0-115:73:130 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/Socket/0
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 01 : 0[160] -> 8[160] [send] via NET/Socket/0
ip-10-0-0-115:73:130 [0] NCCL INFO Connected all trees
ip-10-0-0-115:73:130 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:76:136 [3] NCCL INFO Connected all trees
ip-10-0-0-115:73:130 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:76:136 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:76:136 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:76:136 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 01 : 0[160] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-115:76:136 [3] NCCL INFO Channel 00 : 3[190] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-115:76:136 [3] NCCL INFO Channel 01 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0]
ip-10-0-0-115:75:137 [2] NCCL INFO Channel 01 : 2[180] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-115:75:137 [2] NCCL INFO Channel 01 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0]
ip-10-0-0-115:77:133 [4] NCCL INFO Channel 01 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-115:74:132 [1] NCCL INFO Channel 01 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0]
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 00 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0]
ip-10-0-0-115:74:132 [1] NCCL INFO Channel 00 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
ip-10-0-0-115:78:134 [5] NCCL INFO Channel 01 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-115:79:135 [6] NCCL INFO Channel 00 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-115:73:130 [0] NCCL INFO Channel 01 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
ip-10-0-0-115:80:131 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-115:80:131 [7] NCCL INFO Channel 00 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-115:79:135 [6] NCCL INFO Channel 01 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-115:80:131 [7] NCCL INFO Channel 01 : 7[1d0] -> 2[180] via P2P/indirect/3[190]
ip-10-0-0-115:79:135 [6] NCCL INFO Channel 01 : 6[1c0] -> 3[190] via P2P/indirect/2[180]
ip-10-0-0-115:78:134 [5] NCCL INFO Channel 01 : 5[1b0] -> 2[180] via P2P/indirect/1[170]
ip-10-0-0-115:77:133 [4] NCCL INFO Channel 00 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
ip-10-0-0-115:78:134 [5] NCCL INFO Channel 00 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
ip-10-0-0-115:77:133 [4] NCCL INFO Channel 01 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
ip-10-0-0-115:77:133 [4] NCCL INFO comm 0x7effec002fb0 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE
ip-10-0-0-115:73:130 [0] NCCL INFO comm 0x7f50b4002fb0 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE
ip-10-0-0-115:79:135 [6] NCCL INFO comm 0x7ff9d0002fb0 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE
ip-10-0-0-115:75:137 [2] NCCL INFO comm 0x7f5340002fb0 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE
ip-10-0-0-115:78:134 [5] NCCL INFO comm 0x7fb09c002fb0 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE
ip-10-0-0-115:76:136 [3] NCCL INFO comm 0x7ffacc002fb0 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE
ip-10-0-0-115:80:131 [7] NCCL INFO comm 0x7fa574002fb0 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE
ip-10-0-0-115:74:132 [1] NCCL INFO comm 0x7f5570002fb0 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE
ip-10-0-0-115:73:73 [0] NCCL INFO Launch mode Parallel
2022-04-18 23:34:11 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 16, 'distributed_num_procs': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'env://', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_base_algorithm': 'localsgd', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'not_fsdp_flatten_parameters': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': 2048, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 2048, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0, 'grouped_shuffling': False, 'update_epoch_batch_itr': False, 'update_ordered_indices_seed': False}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 50000, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0005], 'stop_min_lr': -1.0, 'use_bmuf': False, 'skip_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': '/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu', 'restore_file': 'checkpoint_last.pt', 'continue_once': None, 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 8}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': {'_name': 'transformer_lm', 'activation_fn': relu, 'dropout': 0.1, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'relu_dropout': 0.0, 'decoder_embed_dim': 512, 'decoder_output_dim': 512, 'decoder_input_dim': 512, 'decoder_ffn_embed_dim': 2048, 'decoder_layers': 6, 'decoder_attention_heads': 8, 'decoder_normalize_before': False, 'no_decoder_final_norm': False, 'adaptive_softmax_cutoff': None, 'adaptive_softmax_dropout': 0.0, 'adaptive_softmax_factor': 4.0, 'no_token_positional_embeddings': False, 'share_decoder_input_output_embed': True, 'character_embeddings': False, 'character_filters': '[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', 'character_embedding_dim': 4, 'char_embedder_highway_layers': 2, 'adaptive_input': False, 'adaptive_input_factor': 4.0, 'adaptive_input_cutoff': None, 'tie_adaptive_weights': False, 'tie_adaptive_proj': False, 'decoder_learned_pos': False, 'layernorm_embedding': False, 'no_scale_embedding': False, 'checkpoint_activations': False, 'offload_activations': False, 'decoder_layerdrop': 0.0, 'decoder_layers_to_keep': None, 'quant_noise_pq': 0.0, 'quant_noise_pq_block_size': 8, 'quant_noise_scalar': 0.0, 'min_params_to_wrap': 100000000, 'base_layers': 0, 'base_sublayers': 1, 'base_shuffle': 1, 'scale_fc': False, 'scale_attn': False, 'scale_heads': False, 'scale_resids': False, 'add_bos_token': False, 'tokens_per_sample': 512, 'max_target_positions': None, 'tpu': False}, 'task': {'_name': 'language_modeling', 'data': '/job/fairseq/data-bin/wikitext-103', 'sample_break_mode': none, 'tokens_per_sample': 512, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': None, 'shorten_method': none, 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-08, 'weight_decay': 0.01, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0005]}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 4000, 'warmup_init_lr': 1e-07, 'lr': [0.0005]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}}
2022-04-18 23:34:12 | INFO | fairseq.tasks.language_modeling | dictionary: 267744 types
2022-04-18 23:34:15 | INFO | fairseq_cli.train | TransformerLanguageModel(
(decoder): TransformerDecoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(267744, 512, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerDecoderLayerBase(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayerBase(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(2): TransformerDecoderLayerBase(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(3): TransformerDecoderLayerBase(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(4): TransformerDecoderLayerBase(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(5): TransformerDecoderLayerBase(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
)
(output_projection): Linear(in_features=512, out_features=267744, bias=False)
)
)
2022-04-18 23:34:15 | INFO | fairseq_cli.train | task: LanguageModelingTask
2022-04-18 23:34:15 | INFO | fairseq_cli.train | model: TransformerLanguageModel
2022-04-18 23:34:15 | INFO | fairseq_cli.train | criterion: CrossEntropyCriterion
2022-04-18 23:34:15 | INFO | fairseq_cli.train | num. shared model params: 155,999,232 (num. trained: 155,999,232)
2022-04-18 23:34:15 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0)
2022-04-18 23:34:15 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: /job/fairseq/data-bin/wikitext-103/valid
2022-04-18 23:34:15 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 0
2022-04-18 23:34:15 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 16 nodes.
2022-04-18 23:34:15 | INFO | fairseq.trainer | detected shared parameter: decoder.embed_tokens.weight <- decoder.output_projection.weight
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 00/02 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 01/02 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12
ip-10-0-0-115:73:174 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 3/-1/-1->0->8
ip-10-0-0-115:74:179 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 5/-1/-1->1->2
ip-10-0-0-115:75:180 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 1/-1/-1->2->3
ip-10-0-0-115:77:181 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] -1/-1/-1->4->7
ip-10-0-0-115:76:177 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 2/-1/-1->3->0
ip-10-0-0-115:78:178 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] 6/-1/-1->5->1
ip-10-0-0-115:80:176 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 4/-1/-1->7->6
ip-10-0-0-115:79:175 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
ip-10-0-0-115:74:179 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:78:178 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:74:179 [1] NCCL INFO Channel 01 : 1[170] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC
ip-10-0-0-115:78:178 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:79:175 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:180 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:175 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:75:180 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC
ip-10-0-0-115:76:177 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:77:181 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/Socket/0
ip-10-0-0-115:76:177 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC
ip-10-0-0-115:80:176 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:79:175 [6] NCCL INFO Connected all rings
ip-10-0-0-115:76:177 [3] NCCL INFO Connected all rings
ip-10-0-0-115:80:176 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/Socket/0
ip-10-0-0-115:73:174 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-10-0-0-115:78:178 [5] NCCL INFO Connected all rings
ip-10-0-0-115:78:178 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:175 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:77:181 [4] NCCL INFO Channel 01 : 4[1a0] -> 8[160] [send] via NET/Socket/0
ip-10-0-0-115:78:178 [5] NCCL INFO Channel 01 : 5[1b0] -> 1[170] via P2P/IPC
ip-10-0-0-115:79:175 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC
ip-10-0-0-115:74:179 [1] NCCL INFO Connected all rings
ip-10-0-0-115:74:179 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:75:180 [2] NCCL INFO Connected all rings
ip-10-0-0-115:74:179 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC
ip-10-0-0-115:75:180 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:75:180 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 01 : 12[1a0] -> 0[160] [receive] via NET/Socket/0
ip-10-0-0-115:73:174 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-10-0-0-115:76:177 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:75:180 [2] NCCL INFO Connected all trees
ip-10-0-0-115:75:180 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:75:180 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:76:177 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC
ip-10-0-0-115:75:180 [2] NCCL INFO Channel 00 : 2[180] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-115:74:179 [1] NCCL INFO Connected all trees
ip-10-0-0-115:74:179 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:74:179 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:74:179 [1] NCCL INFO Channel 01 : 1[170] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-115:73:174 [0] NCCL INFO Connected all rings
ip-10-0-0-115:78:178 [5] NCCL INFO Connected all trees
ip-10-0-0-115:78:178 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:78:178 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:77:181 [4] NCCL INFO Connected all rings
ip-10-0-0-115:80:176 [7] NCCL INFO Connected all rings
ip-10-0-0-115:77:181 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:77:181 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC
ip-10-0-0-115:77:181 [4] NCCL INFO Connected all trees
ip-10-0-0-115:77:181 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:77:181 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:80:176 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/Socket/0
ip-10-0-0-115:73:174 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-10-0-0-115:80:176 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC
ip-10-0-0-115:80:176 [7] NCCL INFO Connected all trees
ip-10-0-0-115:80:176 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:80:176 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:79:175 [6] NCCL INFO Connected all trees
ip-10-0-0-115:79:175 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:79:175 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 01 : 8[160] -> 0[160] [receive] via NET/Socket/0
ip-10-0-0-115:73:174 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/Socket/0
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 01 : 0[160] -> 8[160] [send] via NET/Socket/0
ip-10-0-0-115:73:174 [0] NCCL INFO Connected all trees
ip-10-0-0-115:73:174 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:73:174 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:76:177 [3] NCCL INFO Connected all trees
ip-10-0-0-115:76:177 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
ip-10-0-0-115:76:177 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 01 : 0[160] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-115:76:177 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160]
ip-10-0-0-115:76:177 [3] NCCL INFO Channel 00 : 3[190] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-115:76:177 [3] NCCL INFO Channel 01 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0]
ip-10-0-0-115:75:180 [2] NCCL INFO Channel 01 : 2[180] -> 5[1b0] via P2P/indirect/1[170]
ip-10-0-0-115:75:180 [2] NCCL INFO Channel 01 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0]
ip-10-0-0-115:77:181 [4] NCCL INFO Channel 01 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-115:74:179 [1] NCCL INFO Channel 01 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0]
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 00 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0]
ip-10-0-0-115:78:178 [5] NCCL INFO Channel 01 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-115:74:179 [1] NCCL INFO Channel 00 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
ip-10-0-0-115:79:175 [6] NCCL INFO Channel 00 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-115:73:174 [0] NCCL INFO Channel 01 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
ip-10-0-0-115:80:176 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0]
ip-10-0-0-115:80:176 [7] NCCL INFO Channel 00 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-115:79:175 [6] NCCL INFO Channel 01 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0]
ip-10-0-0-115:80:176 [7] NCCL INFO Channel 01 : 7[1d0] -> 2[180] via P2P/indirect/3[190]
ip-10-0-0-115:78:178 [5] NCCL INFO Channel 01 : 5[1b0] -> 2[180] via P2P/indirect/1[170]
ip-10-0-0-115:79:175 [6] NCCL INFO Channel 01 : 6[1c0] -> 3[190] via P2P/indirect/2[180]
ip-10-0-0-115:78:178 [5] NCCL INFO Channel 00 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
ip-10-0-0-115:77:181 [4] NCCL INFO Channel 00 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
ip-10-0-0-115:77:181 [4] NCCL INFO Channel 01 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
ip-10-0-0-115:76:177 [3] NCCL INFO comm 0x7ffa8c002fb0 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE
ip-10-0-0-115:77:181 [4] NCCL INFO comm 0x7effa8002fb0 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE
ip-10-0-0-115:78:178 [5] NCCL INFO comm 0x7fb064002fb0 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE
ip-10-0-0-115:80:176 [7] NCCL INFO comm 0x7fa534002fb0 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE
ip-10-0-0-115:79:175 [6] NCCL INFO comm 0x7ff990002fb0 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE
ip-10-0-0-115:73:174 [0] NCCL INFO comm 0x7f5078002fb0 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE
ip-10-0-0-115:75:180 [2] NCCL INFO comm 0x7f5304002fb0 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE
ip-10-0-0-115:74:179 [1] NCCL INFO comm 0x7f5534002fb0 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE
ip-10-0-0-115:73:73 [0] NCCL INFO Launch mode Parallel
2022-04-18 23:34:15 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers***********************
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 1: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 2: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 3: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 4: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 5: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 6: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 7: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 8: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 9: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 10: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 11: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 12: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 13: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 14: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | rank 15: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB
2022-04-18 23:34:15 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers***********************
2022-04-18 23:34:15 | INFO | fairseq_cli.train | training on 16 devices (GPUs/TPUs)
2022-04-18 23:34:15 | INFO | fairseq_cli.train | max tokens per device = 2048 and max sentences per device = None
2022-04-18 23:34:15 | INFO | fairseq.trainer | Preparing to load checkpoint /job/fairseq/checkpoints/transformer_wikitext-103_ubuntu/checkpoint_last.pt
2022-04-18 23:34:15 | INFO | fairseq.trainer | No existing checkpoint found /job/fairseq/checkpoints/transformer_wikitext-103_ubuntu/checkpoint_last.pt
2022-04-18 23:34:15 | INFO | fairseq.trainer | loading train data for epoch 1
2022-04-18 23:34:16 | INFO | fairseq.data.data_utils | loaded 1,801,350 examples from: /job/fairseq/data-bin/wikitext-103/train
2022-04-18 23:34:17 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 or --amp
2022-04-18 23:34:17 | INFO | fairseq.optim.adam | using FusedAdam
2022-04-18 23:34:17 | INFO | fairseq.data.iterators | grouped total_num_itrs = 3151
2022-04-18 23:34:17 | INFO | fairseq.trainer | begin training epoch 1
2022-04-18 23:34:17 | INFO | fairseq_cli.train | Start iterating over samples
2022-04-18 23:34:18 | INFO | root | Reducer buckets have been rebuilt in this iteration.
2022-04-18 23:34:48 | INFO | train_inner | epoch 001: 100 / 3151 loss=16.618, ppl=100595, wps=108288, ups=3.3, wpb=32768, bsz=64, num_updates=100, lr=1.25975e-05, gnorm=3.652, train_wall=31, gb_free=20.6, wall=33
2022-04-18 23:35:18 | INFO | train_inner | epoch 001: 200 / 3151 loss=14.226, ppl=19162.3, wps=108465, ups=3.31, wpb=32768, bsz=64, num_updates=200, lr=2.5095e-05, gnorm=1.593, train_wall=30, gb_free=20.6, wall=63
2022-04-18 23:35:49 | INFO | train_inner | epoch 001: 300 / 3151 loss=12.192, ppl=4678.22, wps=106993, ups=3.27, wpb=32764.3, bsz=64, num_updates=300, lr=3.75925e-05, gnorm=1.079, train_wall=30, gb_free=20.6, wall=94
2022-04-18 23:36:19 | INFO | train_inner | epoch 001: 400 / 3151 loss=10.734, ppl=1702.93, wps=107755, ups=3.29, wpb=32768, bsz=64, num_updates=400, lr=5.009e-05, gnorm=0.671, train_wall=30, gb_free=20.6, wall=124
2022-04-18 23:36:51 | INFO | train_inner | epoch 001: 500 / 3151 loss=10.124, ppl=1116, wps=105255, ups=3.21, wpb=32768, bsz=64, num_updates=500, lr=6.25875e-05, gnorm=0.558, train_wall=31, gb_free=20.6, wall=155
2022-04-18 23:37:22 | INFO | train_inner | epoch 001: 600 / 3151 loss=9.818, ppl=902.66, wps=104790, ups=3.2, wpb=32768, bsz=64, num_updates=600, lr=7.5085e-05, gnorm=0.643, train_wall=31, gb_free=20.6, wall=186
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment