-
-
Save yukunlin/8c4298450299b33dd9a4c0559f50eccc to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated | |
and will be removed in future. Use torchrun. | |
Note that --use_env is set by default in torchrun. | |
If your script expects `--local_rank` argument to be set, please | |
change it to read from `os.environ['LOCAL_RANK']` instead. See | |
https://pytorch.org/docs/stable/distributed.html#launch-utility for | |
further instructions | |
warnings.warn( | |
WARNING:torch.distributed.run: | |
***************************************** | |
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
***************************************** | |
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: | |
entrypoint : fairseq_train_wrapped | |
min_nodes : 2 | |
max_nodes : 2 | |
nproc_per_node : 8 | |
run_id : foobar | |
rdzv_backend : c10d | |
rdzv_endpoint : 10.0.0.213:29500 | |
rdzv_configs : {'timeout': 900} | |
max_restarts : 0 | |
monitor_interval : 5 | |
log_dir : None | |
metrics_cfg : {} | |
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_deupnvg9/foobar_yi7_bw6q | |
INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python | |
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group | |
INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result: | |
restart_count=0 | |
master_addr=ip-10-0-0-175.us-west-2.compute.internal | |
master_port=47405 | |
group_rank=0 | |
group_world_size=2 | |
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7] | |
role_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16] | |
global_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16] | |
INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_deupnvg9/foobar_yi7_bw6q/attempt_0/0/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_deupnvg9/foobar_yi7_bw6q/attempt_0/1/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_deupnvg9/foobar_yi7_bw6q/attempt_0/2/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_deupnvg9/foobar_yi7_bw6q/attempt_0/3/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_deupnvg9/foobar_yi7_bw6q/attempt_0/4/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_deupnvg9/foobar_yi7_bw6q/attempt_0/5/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_deupnvg9/foobar_yi7_bw6q/attempt_0/6/error.json | |
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_deupnvg9/foobar_yi7_bw6q/attempt_0/7/error.json | |
[0]:2022-04-18 23:07:20 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[1]:2022-04-18 23:07:20 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[3]:2022-04-18 23:07:20 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[4]:2022-04-18 23:07:20 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[7]:2022-04-18 23:07:20 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[2]:2022-04-18 23:07:20 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[5]:2022-04-18 23:07:20 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[6]:2022-04-18 23:07:20 | WARNING | root | Pytorch pre-release version 1.10.0a0+git36449ea - assuming intent to test it | |
[1]:2022-04-18 23:07:23 | INFO | fairseq.distributed.utils | distributed init (rank 1): env:// | |
[3]:2022-04-18 23:07:23 | INFO | fairseq.distributed.utils | distributed init (rank 3): env:// | |
[2]:2022-04-18 23:07:23 | INFO | fairseq.distributed.utils | distributed init (rank 2): env:// | |
[5]:2022-04-18 23:07:23 | INFO | fairseq.distributed.utils | distributed init (rank 5): env:// | |
[6]:2022-04-18 23:07:23 | INFO | fairseq.distributed.utils | distributed init (rank 6): env:// | |
[7]:2022-04-18 23:07:23 | INFO | fairseq.distributed.utils | distributed init (rank 7): env:// | |
[0]:2022-04-18 23:07:23 | INFO | fairseq.distributed.utils | distributed init (rank 0): env:// | |
[4]:2022-04-18 23:07:23 | INFO | fairseq.distributed.utils | distributed init (rank 4): env:// | |
[3]:2022-04-18 23:07:24 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 3 | |
[1]:2022-04-18 23:07:24 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1 | |
[2]:2022-04-18 23:07:24 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2 | |
[6]:2022-04-18 23:07:24 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 6 | |
[7]:2022-04-18 23:07:24 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 7 | |
[5]:2022-04-18 23:07:24 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 5 | |
[4]:2022-04-18 23:07:24 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 4 | |
[0]:2022-04-18 23:07:27 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0 | |
[0]:2022-04-18 23:07:27 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[0]:2022-04-18 23:07:27 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 0 | |
[1]:2022-04-18 23:07:27 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[1]:2022-04-18 23:07:27 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 1 | |
[2]:2022-04-18 23:07:27 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[2]:2022-04-18 23:07:27 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 2 | |
[6]:2022-04-18 23:07:27 | INFO | torch.distributed.distributed_c10d | Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[6]:2022-04-18 23:07:27 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 6 | |
[5]:2022-04-18 23:07:27 | INFO | torch.distributed.distributed_c10d | Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[5]:2022-04-18 23:07:27 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 5 | |
[3]:2022-04-18 23:07:27 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[3]:2022-04-18 23:07:27 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 3 | |
[7]:2022-04-18 23:07:27 | INFO | torch.distributed.distributed_c10d | Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[7]:2022-04-18 23:07:27 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 7 | |
[4]:2022-04-18 23:07:27 | INFO | torch.distributed.distributed_c10d | Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 16 nodes. | |
[4]:2022-04-18 23:07:27 | INFO | fairseq.distributed.utils | initialized host ip-10-0-0-175.us-west-2.compute.internal as rank 4 | |
[3]:ip-10-0-0-175:20:20 [3] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[3]:ip-10-0-0-175:20:20 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[3]:ip-10-0-0-175:20:20 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[3]:ip-10-0-0-175:20:20 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[3]: | |
[3]:ip-10-0-0-175:20:20 [3] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported | |
[3]: | |
[3]:ip-10-0-0-175:20:20 [3] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed | |
[3]:ip-10-0-0-175:20:20 [3] NCCL INFO NET/IB : No device found. | |
[3]:ip-10-0-0-175:20:20 [3] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0> | |
[3]:ip-10-0-0-175:20:20 [3] NCCL INFO Using network Socket | |
[1]:ip-10-0-0-175:18:18 [1] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[1]:ip-10-0-0-175:18:18 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[1]:ip-10-0-0-175:18:18 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[1]:ip-10-0-0-175:18:18 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[1]: | |
[1]:ip-10-0-0-175:18:18 [1] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported | |
[1]: | |
[1]:ip-10-0-0-175:18:18 [1] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed | |
[1]:ip-10-0-0-175:18:18 [1] NCCL INFO NET/IB : No device found. | |
[1]:ip-10-0-0-175:18:18 [1] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0> | |
[1]:ip-10-0-0-175:18:18 [1] NCCL INFO Using network Socket | |
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[0]: | |
[0]:ip-10-0-0-175:17:17 [0] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported | |
[0]: | |
[0]:ip-10-0-0-175:17:17 [0] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed | |
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/IB : No device found. | |
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0> | |
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Using network Socket | |
[0]:NCCL version 2.10.3+cuda11.3 | |
[2]:ip-10-0-0-175:19:19 [2] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[2]:ip-10-0-0-175:19:19 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[2]:ip-10-0-0-175:19:19 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[2]:ip-10-0-0-175:19:19 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[2]: | |
[2]:ip-10-0-0-175:19:19 [2] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported | |
[2]: | |
[2]:ip-10-0-0-175:19:19 [2] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed | |
[2]:ip-10-0-0-175:19:19 [2] NCCL INFO NET/IB : No device found. | |
[2]:ip-10-0-0-175:19:19 [2] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0> | |
[2]:ip-10-0-0-175:19:19 [2] NCCL INFO Using network Socket | |
[6]:ip-10-0-0-175:23:23 [6] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[6]:ip-10-0-0-175:23:23 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[6]:ip-10-0-0-175:23:23 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[6]:ip-10-0-0-175:23:23 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[6]: | |
[6]:ip-10-0-0-175:23:23 [6] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported | |
[6]: | |
[6]:ip-10-0-0-175:23:23 [6] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed | |
[6]:ip-10-0-0-175:23:23 [6] NCCL INFO NET/IB : No device found. | |
[6]:ip-10-0-0-175:23:23 [6] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0> | |
[6]:ip-10-0-0-175:23:23 [6] NCCL INFO Using network Socket | |
[5]:ip-10-0-0-175:22:22 [5] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[5]:ip-10-0-0-175:22:22 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[5]:ip-10-0-0-175:22:22 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[5]:ip-10-0-0-175:22:22 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[5]: | |
[5]:ip-10-0-0-175:22:22 [5] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported | |
[5]: | |
[5]:ip-10-0-0-175:22:22 [5] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed | |
[5]:ip-10-0-0-175:22:22 [5] NCCL INFO NET/IB : No device found. | |
[5]:ip-10-0-0-175:22:22 [5] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0> | |
[5]:ip-10-0-0-175:22:22 [5] NCCL INFO Using network Socket | |
[4]:ip-10-0-0-175:21:21 [4] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[4]:ip-10-0-0-175:21:21 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[4]:ip-10-0-0-175:21:21 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[4]:ip-10-0-0-175:21:21 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[4]: | |
[4]:ip-10-0-0-175:21:21 [4] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported | |
[4]: | |
[4]:ip-10-0-0-175:21:21 [4] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed | |
[4]:ip-10-0-0-175:21:21 [4] NCCL INFO NET/IB : No device found. | |
[4]:ip-10-0-0-175:21:21 [4] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0> | |
[4]:ip-10-0-0-175:21:21 [4] NCCL INFO Using network Socket | |
[7]:ip-10-0-0-175:24:24 [7] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0> | |
[7]:ip-10-0-0-175:24:24 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. | |
[7]:ip-10-0-0-175:24:24 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws | |
[7]:ip-10-0-0-175:24:24 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 | |
[7]: | |
[7]:ip-10-0-0-175:24:24 [7] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported | |
[7]: | |
[7]:ip-10-0-0-175:24:24 [7] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed | |
[7]:ip-10-0-0-175:24:24 [7] NCCL INFO NET/IB : No device found. | |
[7]:ip-10-0-0-175:24:24 [7] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0> | |
[7]:ip-10-0-0-175:24:24 [7] NCCL INFO Using network Socket | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 5/-1/-1->1->2 | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO Channel 01 : 1[170] -> 5[1b0] via P2P/IPC | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 00/02 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 01/02 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 3/-1/-1->0->8 | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 1/-1/-1->2->3 | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] 6/-1/-1->5->1 | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 2/-1/-1->3->0 | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO Connected all rings | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] -1/-1/-1->4->7 | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/Socket/0 | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 4/-1/-1->7->6 | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO Connected all rings | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO Connected all trees | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO Channel 01 : 1[170] -> 4[1a0] via P2P/indirect/0[160] | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/Socket/0 | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 01 : 12[1a0] -> 0[160] [receive] via NET/Socket/0 | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Connected all rings | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/Socket/0 | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 01 : 8[160] -> 0[160] [receive] via NET/Socket/0 | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/Socket/0 | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 01 : 0[160] -> 8[160] [send] via NET/Socket/0 | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO Connected all rings | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO Connected all trees | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO Channel 00 : 2[180] -> 4[1a0] via P2P/indirect/0[160] | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO Connected all rings | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO Connected all trees | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO Connected all rings | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO Channel 01 : 5[1b0] -> 1[170] via P2P/IPC | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO Connected all trees | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO Channel 01 : 4[1a0] -> 8[160] [send] via NET/Socket/0 | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO Connected all rings | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO Connected all trees | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO Connected all rings | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO Connected all trees | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO Channel 01 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0] | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO Channel 00 : 1[170] -> 7[1d0] via P2P/indirect/3[190] | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Connected all trees | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 01 : 0[160] -> 5[1b0] via P2P/indirect/1[170] | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 00 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0] | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO Channel 01 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0] | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO Channel 01 : 2[180] -> 5[1b0] via P2P/indirect/1[170] | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO Channel 01 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0] | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO Channel 00 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0] | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO Channel 01 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0] | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO Channel 01 : 6[1c0] -> 3[190] via P2P/indirect/2[180] | |
[6]:ip-10-0-0-175:23:98 [6] NCCL INFO comm 0x7f8f58002fb0 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO Channel 01 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0] | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO Channel 01 : 5[1b0] -> 2[180] via P2P/indirect/1[170] | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO Channel 00 : 5[1b0] -> 3[190] via P2P/indirect/1[170] | |
[5]:ip-10-0-0-175:22:97 [5] NCCL INFO comm 0x7f4170002fb0 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO Connected all trees | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160] | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO Channel 00 : 3[190] -> 5[1b0] via P2P/indirect/1[170] | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO Channel 01 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0] | |
[3]:ip-10-0-0-175:20:92 [3] NCCL INFO comm 0x7f1560002fb0 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO Channel 01 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0] | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO Channel 00 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0] | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO Channel 01 : 4[1a0] -> 3[190] via P2P/indirect/0[160] | |
[4]:ip-10-0-0-175:21:95 [4] NCCL INFO comm 0x7f9b64002fb0 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0] | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO Channel 00 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0] | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO Channel 01 : 7[1d0] -> 2[180] via P2P/indirect/3[190] | |
[7]:ip-10-0-0-175:24:94 [7] NCCL INFO comm 0x7f1718002fb0 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE | |
[1]:ip-10-0-0-175:18:93 [1] NCCL INFO comm 0x7f7ae4002fb0 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE | |
[0]:ip-10-0-0-175:17:91 [0] NCCL INFO comm 0x7fb154002fb0 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE | |
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Launch mode Parallel | |
[2]:ip-10-0-0-175:19:96 [2] NCCL INFO comm 0x7f72a4002fb0 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE | |
[0]:2022-04-18 23:07:31 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 16, 'distributed_num_procs': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'env://', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_base_algorithm': 'localsgd', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'not_fsdp_flatten_parameters': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': 2048, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 2048, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0, 'grouped_shuffling': False, 'update_epoch_batch_itr': False, 'update_ordered_indices_seed': False}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 50000, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0005], 'stop_min_lr': -1.0, 'use_bmuf': False, 'skip_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': '/job/fairseq/checkpoints/transformer_wikitext-103_manual_docker_al2', 'restore_file': 'checkpoint_last.pt', 'continue_once': None, 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 8}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': {'_name': 'transformer_lm', 'activation_fn': relu, 'dropout': 0.1, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'relu_dropout': 0.0, 'decoder_embed_dim': 512, 'decoder_output_dim': 512, 'decoder_input_dim': 512, 'decoder_ffn_embed_dim': 2048, 'decoder_layers': 6, 'decoder_attention_heads': 8, 'decoder_normalize_before': False, 'no_decoder_final_norm': False, 'adaptive_softmax_cutoff': None, 'adaptive_softmax_dropout': 0.0, 'adaptive_softmax_factor': 4.0, 'no_token_positional_embeddings': False, 'share_decoder_input_output_embed': True, 'character_embeddings': False, 'character_filters': '[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', 'character_embedding_dim': 4, 'char_embedder_highway_layers': 2, 'adaptive_input': False, 'adaptive_input_factor': 4.0, 'adaptive_input_cutoff': None, 'tie_adaptive_weights': False, 'tie_adaptive_proj': False, 'decoder_learned_pos': False, 'layernorm_embedding': False, 'no_scale_embedding': False, 'checkpoint_activations': False, 'offload_activations': False, 'decoder_layerdrop': 0.0, 'decoder_layers_to_keep': None, 'quant_noise_pq': 0.0, 'quant_noise_pq_block_size': 8, 'quant_noise_scalar': 0.0, 'min_params_to_wrap': 100000000, 'base_layers': 0, 'base_sublayers': 1, 'base_shuffle': 1, 'scale_fc': False, 'scale_attn': False, 'scale_heads': False, 'scale_resids': False, 'add_bos_token': False, 'tokens_per_sample': 512, 'max_target_positions': None, 'tpu': False}, 'task': {'_name': 'language_modeling', 'data': '/job/fairseq/data-bin/wikitext-103', 'sample_break_mode': none, 'tokens_per_sample': 512, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': None, 'shorten_method': none, 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-08, 'weight_decay': 0.01, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0005]}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 4000, 'warmup_init_lr': 1e-07, 'lr': [0.0005]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}} | |
[0]:2022-04-18 23:07:32 | INFO | fairseq.tasks.language_modeling | dictionary: 267744 types | |
[0]:2022-04-18 23:07:37 | INFO | fairseq_cli.train | TransformerLanguageModel( | |
[0]: (decoder): TransformerDecoder( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (embed_tokens): Embedding(267744, 512, padding_idx=1) | |
[0]: (embed_positions): SinusoidalPositionalEmbedding() | |
[0]: (layers): ModuleList( | |
[0]: (0): TransformerDecoderLayerBase( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (self_attn): MultiheadAttention( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: ) | |
[0]: (activation_dropout_module): FairseqDropout() | |
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True) | |
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True) | |
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: ) | |
[0]: (1): TransformerDecoderLayerBase( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (self_attn): MultiheadAttention( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: ) | |
[0]: (activation_dropout_module): FairseqDropout() | |
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True) | |
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True) | |
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: ) | |
[0]: (2): TransformerDecoderLayerBase( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (self_attn): MultiheadAttention( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: ) | |
[0]: (activation_dropout_module): FairseqDropout() | |
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True) | |
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True) | |
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: ) | |
[0]: (3): TransformerDecoderLayerBase( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (self_attn): MultiheadAttention( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: ) | |
[0]: (activation_dropout_module): FairseqDropout() | |
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True) | |
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True) | |
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: ) | |
[0]: (4): TransformerDecoderLayerBase( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (self_attn): MultiheadAttention( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: ) | |
[0]: (activation_dropout_module): FairseqDropout() | |
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True) | |
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True) | |
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: ) | |
[0]: (5): TransformerDecoderLayerBase( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (self_attn): MultiheadAttention( | |
[0]: (dropout_module): FairseqDropout() | |
[0]: (k_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (v_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (q_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: (out_proj): Linear(in_features=512, out_features=512, bias=True) | |
[0]: ) | |
[0]: (activation_dropout_module): FairseqDropout() | |
[0]: (self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: (fc1): Linear(in_features=512, out_features=2048, bias=True) | |
[0]: (fc2): Linear(in_features=2048, out_features=512, bias=True) | |
[0]: (final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True) | |
[0]: ) | |
[0]: ) | |
[0]: (output_projection): Linear(in_features=512, out_features=267744, bias=False) | |
[0]: ) | |
[0]:) | |
[0]:2022-04-18 23:07:37 | INFO | fairseq_cli.train | task: LanguageModelingTask | |
[0]:2022-04-18 23:07:37 | INFO | fairseq_cli.train | model: TransformerLanguageModel | |
[0]:2022-04-18 23:07:37 | INFO | fairseq_cli.train | criterion: CrossEntropyCriterion | |
[0]:2022-04-18 23:07:37 | INFO | fairseq_cli.train | num. shared model params: 155,999,232 (num. trained: 155,999,232) | |
[0]:2022-04-18 23:07:37 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0) | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: /job/fairseq/data-bin/wikitext-103/valid | |
[0]:2022-04-18 23:07:37 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 0 | |
[0]:2022-04-18 23:07:37 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 16 nodes. | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.trainer | detected shared parameter: decoder.embed_tokens.weight <- decoder.output_projection.weight | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 5/-1/-1->1->2 | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO Channel 00 : 1[170] -> 5[1b0] via P2P/IPC | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO Channel 01 : 1[170] -> 5[1b0] via P2P/IPC | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO Connected all rings | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO Channel 00 : 1[170] -> 2[180] via P2P/IPC | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 00/02 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 01/02 : 0 3 2 1 5 6 7 4 8 11 10 9 13 14 15 12 | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Trees [0] 3/8/-1->0->-1 [1] 3/-1/-1->0->8 | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 1/-1/-1->2->3 | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO Channel 00 : 2[180] -> 1[170] via P2P/IPC | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO Channel 01 : 2[180] -> 1[170] via P2P/IPC | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 2/-1/-1->3->0 | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO Channel 00 : 3[190] -> 2[180] via P2P/IPC | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO Channel 01 : 3[190] -> 2[180] via P2P/IPC | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO Connected all rings | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO Channel 00 : 3[190] -> 0[160] via P2P/IPC | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO Channel 01 : 3[190] -> 0[160] via P2P/IPC | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO Channel 00 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO Channel 01 : 6[1c0] -> 7[1d0] via P2P/IPC | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO Connected all rings | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO Channel 00 : 6[1c0] -> 5[1b0] via P2P/IPC | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO Channel 01 : 6[1c0] -> 5[1b0] via P2P/IPC | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] 6/-1/-1->5->1 | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO Channel 00 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO Channel 01 : 5[1b0] -> 6[1c0] via P2P/IPC | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO Connected all rings | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO Channel 00 : 5[1b0] -> 1[170] via P2P/IPC | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO Channel 01 : 5[1b0] -> 1[170] via P2P/IPC | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] -1/-1/-1->4->7 | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO Channel 00 : 4[1a0] -> 8[160] [send] via NET/Socket/0 | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO Channel 01 : 4[1a0] -> 8[160] [send] via NET/Socket/0 | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 4/-1/-1->7->6 | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO Channel 00 : 7[1d0] -> 4[1a0] via P2P/IPC | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO Channel 01 : 7[1d0] -> 4[1a0] via P2P/IPC | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO Channel 01 : 1[170] -> 2[180] via P2P/IPC | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO Connected all trees | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO Channel 01 : 1[170] -> 4[1a0] via P2P/indirect/0[160] | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 00 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 01 : 0[160] -> 3[190] via P2P/IPC | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/Socket/0 | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 01 : 12[1a0] -> 0[160] [receive] via NET/Socket/0 | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Connected all rings | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 00 : 8[160] -> 0[160] [receive] via NET/Socket/0 | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 01 : 8[160] -> 0[160] [receive] via NET/Socket/0 | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 00 : 0[160] -> 8[160] [send] via NET/Socket/0 | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 01 : 0[160] -> 8[160] [send] via NET/Socket/0 | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO Connected all rings | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO Channel 00 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO Channel 01 : 2[180] -> 3[190] via P2P/IPC | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO Connected all trees | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO Channel 00 : 2[180] -> 4[1a0] via P2P/indirect/0[160] | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO Connected all trees | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO Connected all trees | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO Connected all rings | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO Channel 00 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO Channel 01 : 4[1a0] -> 7[1d0] via P2P/IPC | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO Connected all trees | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO Connected all rings | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO Channel 00 : 7[1d0] -> 6[1c0] via P2P/IPC | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO Channel 01 : 7[1d0] -> 6[1c0] via P2P/IPC | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO Connected all trees | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO Channel 01 : 1[170] -> 6[1c0] via P2P/indirect/5[1b0] | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO Channel 00 : 1[170] -> 7[1d0] via P2P/indirect/3[190] | |
[1]:ip-10-0-0-175:18:139 [1] NCCL INFO comm 0x7f7aa4002fb0 rank 1 nranks 16 cudaDev 1 busId 170 - Init COMPLETE | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Connected all trees | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 01 : 0[160] -> 5[1b0] via P2P/indirect/1[170] | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 00 : 0[160] -> 6[1c0] via P2P/indirect/4[1a0] | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO Channel 01 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0] | |
[0]:ip-10-0-0-175:17:135 [0] NCCL INFO comm 0x7fb118002fb0 rank 0 nranks 16 cudaDev 0 busId 160 - Init COMPLETE | |
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Launch mode Parallel | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO Channel 01 : 2[180] -> 5[1b0] via P2P/indirect/1[170] | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO Channel 01 : 2[180] -> 7[1d0] via P2P/indirect/6[1c0] | |
[2]:ip-10-0-0-175:19:136 [2] NCCL INFO comm 0x7f726c002fb0 rank 2 nranks 16 cudaDev 2 busId 180 - Init COMPLETE | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO Connected all trees | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512 | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO Channel 01 : 3[190] -> 4[1a0] via P2P/indirect/0[160] | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO Channel 00 : 3[190] -> 5[1b0] via P2P/indirect/1[170] | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO Channel 01 : 3[190] -> 6[1c0] via P2P/indirect/7[1d0] | |
[3]:ip-10-0-0-175:20:142 [3] NCCL INFO comm 0x7f1528002fb0 rank 3 nranks 16 cudaDev 3 busId 190 - Init COMPLETE | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO Channel 00 : 6[1c0] -> 0[160] via P2P/indirect/4[1a0] | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO Channel 01 : 6[1c0] -> 1[170] via P2P/indirect/5[1b0] | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO Channel 01 : 6[1c0] -> 3[190] via P2P/indirect/2[180] | |
[6]:ip-10-0-0-175:23:138 [6] NCCL INFO comm 0x7f8f18002fb0 rank 6 nranks 16 cudaDev 6 busId 1c0 - Init COMPLETE | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO Channel 01 : 5[1b0] -> 0[160] via P2P/indirect/4[1a0] | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO Channel 01 : 5[1b0] -> 2[180] via P2P/indirect/1[170] | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO Channel 00 : 5[1b0] -> 3[190] via P2P/indirect/1[170] | |
[5]:ip-10-0-0-175:22:141 [5] NCCL INFO comm 0x7f412c002fb0 rank 5 nranks 16 cudaDev 5 busId 1b0 - Init COMPLETE | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO Channel 01 : 4[1a0] -> 1[170] via P2P/indirect/5[1b0] | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO Channel 00 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0] | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO Channel 01 : 4[1a0] -> 3[190] via P2P/indirect/0[160] | |
[4]:ip-10-0-0-175:21:137 [4] NCCL INFO comm 0x7f9b1c002fb0 rank 4 nranks 16 cudaDev 4 busId 1a0 - Init COMPLETE | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO Channel 01 : 7[1d0] -> 0[160] via P2P/indirect/4[1a0] | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO Channel 00 : 7[1d0] -> 1[170] via P2P/indirect/5[1b0] | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO Channel 01 : 7[1d0] -> 2[180] via P2P/indirect/3[190] | |
[7]:ip-10-0-0-175:24:140 [7] NCCL INFO comm 0x7f16d8002fb0 rank 7 nranks 16 cudaDev 7 busId 1d0 - Init COMPLETE | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers*********************** | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 1: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 2: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 3: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 4: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 5: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 6: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 7: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 8: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 9: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 10: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 11: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 12: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 13: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 14: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | rank 15: capabilities = 7.0 ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers*********************** | |
[0]:2022-04-18 23:07:37 | INFO | fairseq_cli.train | training on 16 devices (GPUs/TPUs) | |
[0]:2022-04-18 23:07:37 | INFO | fairseq_cli.train | max tokens per device = 2048 and max sentences per device = None | |
[0]:2022-04-18 23:07:37 | INFO | fairseq.trainer | Preparing to load checkpoint /job/fairseq/checkpoints/transformer_wikitext-103_manual_docker_al2/checkpoint_last.pt | |
[0]:2022-04-18 23:07:44 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 or --amp | |
[0]:2022-04-18 23:07:44 | INFO | fairseq.optim.adam | using FusedAdam | |
[0]:2022-04-18 23:07:46 | INFO | fairseq.trainer | Loaded checkpoint /job/fairseq/checkpoints/transformer_wikitext-103_manual_docker_al2/checkpoint_last.pt (epoch 8 @ 22057 updates) | |
[0]:2022-04-18 23:07:46 | INFO | fairseq.trainer | loading train data for epoch 8 | |
[0]:2022-04-18 23:07:46 | INFO | fairseq.data.data_utils | loaded 1,801,350 examples from: /job/fairseq/data-bin/wikitext-103/train | |
[0]:2022-04-18 23:07:47 | INFO | fairseq.data.iterators | grouped total_num_itrs = 3151 | |
[0]:2022-04-18 23:07:47 | INFO | fairseq.trainer | begin training epoch 8 | |
[0]:2022-04-18 23:07:47 | INFO | fairseq_cli.train | Start iterating over samples | |
[0]:2022-04-18 23:07:49 | INFO | root | Reducer buckets have been rebuilt in this iteration. | |
[0]:2022-04-18 23:08:24 | INFO | train_inner | epoch 008: 43 / 3151 loss=5.215, ppl=37.15, wps=39610.6, ups=1.21, wpb=32768, bsz=64, num_updates=22100, lr=0.000212718, gnorm=0.682, train_wall=37, gb_free=20.6, wall=46 | |
[0]:2022-04-18 23:09:34 | INFO | train_inner | epoch 008: 143 / 3151 loss=5.236, ppl=37.69, wps=46427.2, ups=1.42, wpb=32768, bsz=64, num_updates=22200, lr=0.000212238, gnorm=0.679, train_wall=70, gb_free=20.6, wall=117 | |
[0]:2022-04-18 23:10:46 | INFO | train_inner | epoch 008: 243 / 3151 loss=5.24, ppl=37.8, wps=45564.2, ups=1.39, wpb=32768, bsz=64, num_updates=22300, lr=0.000211762, gnorm=0.645, train_wall=72, gb_free=20.6, wall=189 | |
[0]:2022-04-18 23:11:56 | INFO | train_inner | epoch 008: 343 / 3151 loss=5.248, ppl=37.99, wps=47210.9, ups=1.44, wpb=32768, bsz=64, num_updates=22400, lr=0.000211289, gnorm=0.691, train_wall=69, gb_free=20.6, wall=258 | |
[0]:2022-04-18 23:13:05 | INFO | train_inner | epoch 008: 443 / 3151 loss=5.253, ppl=38.13, wps=47045.7, ups=1.44, wpb=32768, bsz=64, num_updates=22500, lr=0.000210819, gnorm=0.662, train_wall=69, gb_free=20.6, wall=328 | |
[0]:2022-04-18 23:14:16 | INFO | train_inner | epoch 008: 543 / 3151 loss=5.267, ppl=38.51, wps=46143.4, ups=1.41, wpb=32768, bsz=64, num_updates=22600, lr=0.000210352, gnorm=0.672, train_wall=71, gb_free=20.6, wall=399 | |
[0]:2022-04-18 23:15:29 | INFO | train_inner | epoch 008: 643 / 3151 loss=5.257, ppl=38.23, wps=45269.2, ups=1.38, wpb=32768, bsz=64, num_updates=22700, lr=0.000209888, gnorm=0.668, train_wall=72, gb_free=20.6, wall=471 | |
[0]:2022-04-18 23:16:40 | INFO | train_inner | epoch 008: 743 / 3151 loss=5.263, ppl=38.39, wps=46129.8, ups=1.41, wpb=32768, bsz=64, num_updates=22800, lr=0.000209427, gnorm=0.709, train_wall=71, gb_free=20.6, wall=542 | |
[0]:2022-04-18 23:17:50 | INFO | train_inner | epoch 008: 843 / 3151 loss=5.258, ppl=38.27, wps=46394, ups=1.42, wpb=32768, bsz=64, num_updates=22900, lr=0.000208969, gnorm=0.664, train_wall=70, gb_free=20.6, wall=613 | |
[0]:2022-04-18 23:19:01 | INFO | train_inner | epoch 008: 943 / 3151 loss=5.279, ppl=38.82, wps=46158.8, ups=1.41, wpb=32768, bsz=64, num_updates=23000, lr=0.000208514, gnorm=0.716, train_wall=71, gb_free=20.6, wall=684 | |
[0]:2022-04-18 23:20:11 | INFO | train_inner | epoch 008: 1043 / 3151 loss=5.283, ppl=38.94, wps=47114.8, ups=1.44, wpb=32768, bsz=64, num_updates=23100, lr=0.000208063, gnorm=0.661, train_wall=69, gb_free=20.6, wall=753 | |
[0]:2022-04-18 23:21:24 | INFO | train_inner | epoch 008: 1143 / 3151 loss=5.281, ppl=38.89, wps=45072.8, ups=1.38, wpb=32768, bsz=64, num_updates=23200, lr=0.000207614, gnorm=0.701, train_wall=72, gb_free=20.6, wall=826 | |
[0]:2022-04-18 23:22:36 | INFO | train_inner | epoch 008: 1243 / 3151 loss=5.279, ppl=38.83, wps=45500.7, ups=1.39, wpb=32768, bsz=64, num_updates=23300, lr=0.000207168, gnorm=0.682, train_wall=72, gb_free=20.6, wall=898 | |
[0]:2022-04-18 23:23:46 | INFO | train_inner | epoch 008: 1343 / 3151 loss=5.285, ppl=38.99, wps=46510, ups=1.42, wpb=32768, bsz=64, num_updates=23400, lr=0.000206725, gnorm=0.683, train_wall=70, gb_free=20.6, wall=969 | |
[0]:2022-04-18 23:24:59 | INFO | train_inner | epoch 008: 1443 / 3151 loss=5.289, ppl=39.1, wps=44618.2, ups=1.36, wpb=32768, bsz=64, num_updates=23500, lr=0.000206284, gnorm=0.664, train_wall=73, gb_free=20.6, wall=1042 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment