Last active
November 13, 2017 06:48
-
-
Save tobyyouup/ca7ba6542deed1c6a473be70061b41a5 to your computer and use it in GitHub Desktop.
run distributed tensorflow_mnist.py on two machines, but it hangs up and dose not show log
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
m12:2037:2131 [0] INFO NET : Using interface eth2:172.23.233.77<0> | |
m12:2037:2131 [0] INFO NET/IB : Using interface eth2 for sideband communication | |
m12:2037:2131 [0] INFO NET/IB: [3] mlx5_0:1/IB | |
m12:2037:2131 [0] INFO Using internal Network IB | |
NCCL version 2.0.5 compiled with CUDA 8.0 | |
m12:2039:2134 [2] INFO NET : Using interface eth2:172.23.233.77<0> | |
m12:2039:2134 [2] INFO NET/IB : Using interface eth2 for sideband communication | |
m12:2041:2220 [4] INFO NET : Using interface eth2:172.23.233.77<0> | |
m12:2041:2220 [4] INFO NET/IB : Using interface eth2 for sideband communication | |
m13:42973:43395 [0] INFO NET : Using interface eth2:172.23.233.75<0> | |
m13:42973:43395 [0] INFO NET/IB : Using interface eth2 for sideband communication | |
m13:42975:43396 [2] INFO NET : Using interface eth2:172.23.233.75<0> | |
m13:42975:43396 [2] INFO NET/IB : Using interface eth2 for sideband communication | |
m13:42977:43394 [4] INFO NET : Using interface eth2:172.23.233.75<0> | |
m13:42977:43394 [4] INFO NET/IB : Using interface eth2 for sideband communication | |
m13:42973:43395 [0] INFO NET/IB: [3] mlx5_0:1/IB | |
m13:42973:43395 [0] INFO Using internal Network IB | |
m13:42977:43394 [4] INFO NET/IB: [3] mlx5_0:1/IB | |
m13:42975:43396 [2] INFO NET/IB: [3] mlx5_0:1/IB | |
m13:42975:43396 [2] INFO Using internal Network IB | |
m13:42977:43394 [4] INFO Using internal Network IB | |
m12:2041:2220 [4] INFO NET/IB: [3] mlx5_0:1/IB | |
m12:2041:2220 [4] INFO Using internal Network IB | |
m12:2039:2134 [2] INFO NET/IB: [3] mlx5_0:1/IB | |
m12:2039:2134 [2] INFO Using internal Network IB | |
m13:42974:43409 [1] INFO NET : Using interface eth2:172.23.233.75<0> | |
m13:42974:43409 [1] INFO NET/IB : Using interface eth2 for sideband communication | |
m13:42976:43403 [3] INFO NET : Using interface eth2:172.23.233.75<0> | |
m13:42976:43403 [3] INFO NET/IB : Using interface eth2 for sideband communication | |
m13:42978:43404 [5] INFO NET : Using interface eth2:172.23.233.75<0> | |
m13:42978:43404 [5] INFO NET/IB : Using interface eth2 for sideband communication | |
m12:2038:2123 [1] INFO NET : Using interface eth2:172.23.233.77<0> | |
m12:2038:2123 [1] INFO NET/IB : Using interface eth2 for sideband communication | |
m12:2040:2128 [3] INFO NET : Using interface eth2:172.23.233.77<0> | |
m12:2040:2128 [3] INFO NET/IB : Using interface eth2 for sideband communication | |
m12:2042:2122 [5] INFO NET : Using interface eth2:172.23.233.77<0> | |
m12:2042:2122 [5] INFO NET/IB : Using interface eth2 for sideband communication | |
m12:2040:2128 [3] INFO NET/IB: [3] mlx5_0:1/IB | |
m12:2040:2128 [3] INFO Using internal Network IB | |
m12:2038:2123 [1] INFO NET/IB: [3] mlx5_0:1/IB | |
m12:2042:2122 [5] INFO NET/IB: [3] mlx5_0:1/IB | |
m12:2042:2122 [5] INFO Using internal Network IB | |
m13:42974:43409 [1] INFO NET/IB: [3] mlx5_0:1/IB | |
m13:42974:43409 [1] INFO Using internal Network IB | |
m13:42976:43403 [3] INFO NET/IB: [3] mlx5_0:1/IB | |
m13:42978:43404 [5] INFO NET/IB: [3] mlx5_0:1/IB | |
m13:42978:43404 [5] INFO Using internal Network IB | |
m12:2038:2123 [1] INFO Using internal Network IB | |
m13:42976:43403 [3] INFO Using internal Network IB | |
m13:42973:43395 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(PIX) | |
m12:2041:2220 [4] INFO CUDA Dev 4, IB Ports : mlx5_0/1(SOC) | |
m12:2037:2131 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(PIX) | |
m13:42975:43396 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC) | |
m13:42977:43394 [4] INFO CUDA Dev 4, IB Ports : mlx5_0/1(SOC) | |
m12:2039:2134 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC) | |
m13:42976:43403 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC) | |
m13:42974:43409 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(PIX) | |
m13:42978:43404 [5] INFO CUDA Dev 5, IB Ports : mlx5_0/1(SOC) | |
m12:2040:2128 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC) | |
m12:2042:2122 [5] INFO CUDA Dev 5, IB Ports : mlx5_0/1(SOC) | |
m12:2038:2123 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(PIX) | |
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2037:2131 [0] INFO Using 256 threads | |
m12:2037:2131 [0] INFO [0] Ring 0 : 0 1 2 3 4 5 6 7 8 9 10 11 | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2042:2122 [5] INFO 5 -> 4 via P2P/IPC | |
m12:2038:2123 [1] INFO 1 -> 0 via P2P/IPC | |
m12:2040:2128 [3] INFO 3 -> 2 via P2P/IPC | |
m12:2041:2220 [4] INFO 4 -> 3 via P2P/IPC | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2040:2128 [3] INFO 3 -> 4 via P2P/IPC | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2041:2220 [4] INFO 4 -> 5 via P2P/IPC | |
m12:2038:2123 [1] INFO 1 -> 2 via direct shared memory | |
m12:2037:2131 [0] INFO 11 -> 0 via NET/IB/0/GDRDMA | |
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2037:2131 [0] INFO 0 -> 1 via P2P/IPC | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO 9 -> 8 via P2P/IPC | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:2039:2134 [2] INFO 2 -> 3 via P2P/IPC | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO 10 -> 9 via P2P/IPC | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42978:43404 [5] INFO 11 -> 10 via P2P/IPC | |
m13:42976:43403 [3] INFO 9 -> 10 via P2P/IPC | |
m13:42977:43394 [4] INFO 10 -> 11 via P2P/IPC | |
m13:42974:43409 [1] INFO 7 -> 6 via P2P/IPC | |
m13:42974:43409 [1] INFO 7 -> 8 via direct shared memory | |
m13:42973:43395 [0] INFO 5 -> 6 via NET/IB/0/GDRDMA | |
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42973:43395 [0] INFO 6 -> 7 via P2P/IPC | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:42975:43396 [2] INFO 8 -> 9 via P2P/IPC |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
m12:25666:25697 [0] INFO NET : Using interface eth2:172.23.233.77<0> | |
m12:25666:25697 [0] INFO NET/IB : Using interface eth2 for sideband communication | |
m12:25666:25697 [0] INFO Using internal Network Socket | |
m12:25666:25697 [0] INFO NET : Using interface eth2:172.23.233.77<0> | |
m12:25666:25697 [0] INFO NET/Socket : 1 interfaces found | |
NCCL version 2.0.5 compiled with CUDA 8.0 | |
m12:25667:25694 [1] INFO NET : Using interface eth2:172.23.233.77<0> | |
m12:25667:25694 [1] INFO NET/IB : Using interface eth2 for sideband communication | |
m13:4865:5012 [0] INFO NET : Using interface eth2:172.23.233.75<0> | |
m13:4865:5012 [0] INFO NET/IB : Using interface eth2 for sideband communication | |
m13:4866:5009 [1] INFO NET : Using interface eth2:172.23.233.75<0> | |
m13:4866:5009 [1] INFO NET/IB : Using interface eth2 for sideband communication | |
m12:25667:25694 [1] INFO Using internal Network Socket | |
m13:4865:5012 [0] INFO Using internal Network Socket | |
m13:4866:5009 [1] INFO Using internal Network Socket | |
m13:4866:5009 [1] INFO NET : Using interface eth2:172.23.233.75<0> | |
m13:4866:5009 [1] INFO NET/Socket : 1 interfaces found | |
m13:4865:5012 [0] INFO NET : Using interface eth2:172.23.233.75<0> | |
m13:4865:5012 [0] INFO NET/Socket : 1 interfaces found | |
m12:25667:25694 [1] INFO NET : Using interface eth2:172.23.233.77<0> | |
m12:25667:25694 [1] INFO NET/Socket : 1 interfaces found | |
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25666:25697 [0] INFO Using 256 threads | |
m12:25666:25697 [0] INFO [0] Ring 0 : 0 1 2 3 | |
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25667:25694 [1] INFO 1 -> 0 via P2P/IPC | |
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4866:5009 [1] INFO 3 -> 2 via P2P/IPC | |
m12:25666:25697 [0] INFO 3 -> 0 via NET/Socket/0 | |
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m12:25666:25697 [0] INFO 0 -> 1 via P2P/IPC | |
m13:4865:5012 [0] INFO 1 -> 2 via NET/Socket/0 | |
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported | |
m13:4865:5012 [0] INFO 2 -> 3 via P2P/IPC | |
**training log for model** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hca_id: mlx5_3 | |
transport: InfiniBand (0) | |
fw_ver: 12.18.1000 | |
node_guid: 9cdc:71ff:ff42:f5d1 | |
sys_image_guid: 9cdc:71ff:ff42:f5d0 | |
vendor_id: 0x02c9 | |
vendor_part_id: 4115 | |
hw_ver: 0x0 | |
board_id: HP_2190110032 | |
phys_port_cnt: 1 | |
max_mr_size: 0xffffffffffffffff | |
page_size_cap: 0xfffffffffffff000 | |
max_qp: 262144 | |
max_qp_wr: 32768 | |
device_cap_flags: 0xc17e1c36 | |
BAD_PKEY_CNTR | |
BAD_QKEY_CNTR | |
AUTO_PATH_MIG | |
CHANGE_PHY_PORT | |
PORT_ACTIVE_EVENT | |
SYS_IMAGE_GUID | |
RC_RNR_NAK_GEN | |
XRC | |
Unknown flags: 0xc16e0000 | |
device_cap_exp_flags: 0x504060F100000000 | |
EXP_DC_TRANSPORT | |
EXP_CROSS_CHANNEL | |
EXP_MR_ALLOCATE | |
EXT_ATOMICS | |
EXT_SEND NOP | |
EXP_UMR | |
EXP_ODP | |
EXP_DC_INFO | |
EXP_MASKED_ATOMICS | |
Unknown flags: 0x200000000000 | |
max_sge: 30 | |
max_sge_rd: 30 | |
max_cq: 16777216 | |
max_cqe: 4194303 | |
max_mr: 16777216 | |
max_pd: 16777216 | |
max_qp_rd_atom: 16 | |
max_ee_rd_atom: 0 | |
max_res_rd_atom: 4194304 | |
max_qp_init_rd_atom: 16 | |
max_ee_init_rd_atom: 0 | |
atomic_cap: ATOMIC_HCA (1) | |
log atomic arg sizes (mask) 0x8 | |
masked_log_atomic_arg_sizes (mask) 0x3c | |
masked_log_atomic_arg_sizes_network_endianness (mask) 0x34 | |
max fetch and add bit boundary 64 | |
log max atomic inline 5 | |
max_ee: 0 | |
max_rdd: 0 | |
max_mw: 16777216 | |
max_raw_ipv6_qp: 0 | |
max_raw_ethy_qp: 0 | |
max_mcast_grp: 2097152 | |
max_mcast_qp_attach: 48 | |
max_total_mcast_qp_attach: 100663296 | |
max_ah: 2147483647 | |
max_fmr: 0 | |
max_srq: 8388608 | |
max_srq_wr: 32767 | |
max_srq_sge: 31 | |
max_pkeys: 128 | |
local_ca_ack_delay: 16 | |
hca_core_clock: 156250 | |
max_klm_list_size: 65536 | |
max_send_wqe_inline_klms: 20 | |
max_umr_recursion_depth: 4 | |
max_umr_stride_dimension: 1 | |
general_odp_caps: | |
ODP_SUPPORT | |
rc_odp_caps: | |
SUPPORT_SEND | |
SUPPORT_RECV | |
SUPPORT_WRITE | |
SUPPORT_READ | |
uc_odp_caps: | |
NO SUPPORT | |
ud_odp_caps: | |
SUPPORT_SEND | |
dc_odp_caps: | |
NO SUPPORT | |
xrc_odp_caps: | |
NO SUPPORT | |
raw_eth_odp_caps: | |
NO SUPPORT | |
max_dct: 262144 | |
max_device_ctx: 1020 | |
Multi-Packet RQ is not supported | |
rx_pad_end_addr_align: 64 | |
tso_caps: | |
max_tso: 0 | |
packet_pacing_caps: | |
qp_rate_limit_min: 0kbps | |
qp_rate_limit_max: 0kbps | |
Device ports: | |
port: 1 | |
state: PORT_DOWN (1) | |
max_mtu: 4096 (5) | |
active_mtu: 4096 (5) | |
sm_lid: 0 | |
port_lid: 65535 | |
port_lmc: 0x00 | |
link_layer: InfiniBand | |
max_msg_sz: 0x40000000 | |
port_cap_flags: 0x2651e848 | |
max_vl_num: 4 (3) | |
bad_pkey_cntr: 0x0 | |
qkey_viol_cntr: 0x0 | |
sm_sl: 0 | |
pkey_tbl_len: 128 | |
gid_tbl_len: 8 | |
subnet_timeout: 0 | |
init_type_reply: 0 | |
active_width: 4X (2) | |
active_speed: invalid speed (0) | |
phys_state: DISABLED (3) | |
GID[ 0]: fe80:0000:0000:0000:9cdc:71ff:ff42:f5d1 | |
hca_id: mlx5_2 | |
transport: InfiniBand (0) | |
fw_ver: 12.18.1000 | |
node_guid: 9cdc:71ff:ff42:f5d0 | |
sys_image_guid: 9cdc:71ff:ff42:f5d0 | |
vendor_id: 0x02c9 | |
vendor_part_id: 4115 | |
hw_ver: 0x0 | |
board_id: HP_2190110032 | |
phys_port_cnt: 1 | |
max_mr_size: 0xffffffffffffffff | |
page_size_cap: 0xfffffffffffff000 | |
max_qp: 262144 | |
max_qp_wr: 32768 | |
device_cap_flags: 0xc17e1c36 | |
BAD_PKEY_CNTR | |
BAD_QKEY_CNTR | |
AUTO_PATH_MIG | |
CHANGE_PHY_PORT | |
PORT_ACTIVE_EVENT | |
SYS_IMAGE_GUID | |
RC_RNR_NAK_GEN | |
XRC | |
Unknown flags: 0xc16e0000 | |
device_cap_exp_flags: 0x504060F100000000 | |
EXP_DC_TRANSPORT | |
EXP_CROSS_CHANNEL | |
EXP_MR_ALLOCATE | |
EXT_ATOMICS | |
EXT_SEND NOP | |
EXP_UMR | |
EXP_ODP | |
EXP_DC_INFO | |
EXP_MASKED_ATOMICS | |
Unknown flags: 0x200000000000 | |
max_sge: 30 | |
max_sge_rd: 30 | |
max_cq: 16777216 | |
max_cqe: 4194303 | |
max_mr: 16777216 | |
max_pd: 16777216 | |
max_qp_rd_atom: 16 | |
max_ee_rd_atom: 0 | |
max_res_rd_atom: 4194304 | |
max_qp_init_rd_atom: 16 | |
max_ee_init_rd_atom: 0 | |
atomic_cap: ATOMIC_HCA (1) | |
log atomic arg sizes (mask) 0x8 | |
masked_log_atomic_arg_sizes (mask) 0x3c | |
masked_log_atomic_arg_sizes_network_endianness (mask) 0x34 | |
max fetch and add bit boundary 64 | |
log max atomic inline 5 | |
max_ee: 0 | |
max_rdd: 0 | |
max_mw: 16777216 | |
max_raw_ipv6_qp: 0 | |
max_raw_ethy_qp: 0 | |
max_mcast_grp: 2097152 | |
max_mcast_qp_attach: 48 | |
max_total_mcast_qp_attach: 100663296 | |
max_ah: 2147483647 | |
max_fmr: 0 | |
max_srq: 8388608 | |
max_srq_wr: 32767 | |
max_srq_sge: 31 | |
max_pkeys: 128 | |
local_ca_ack_delay: 16 | |
hca_core_clock: 156250 | |
max_klm_list_size: 65536 | |
max_send_wqe_inline_klms: 20 | |
max_umr_recursion_depth: 4 | |
max_umr_stride_dimension: 1 | |
general_odp_caps: | |
ODP_SUPPORT | |
rc_odp_caps: | |
SUPPORT_SEND | |
SUPPORT_RECV | |
SUPPORT_WRITE | |
SUPPORT_READ | |
uc_odp_caps: | |
NO SUPPORT | |
ud_odp_caps: | |
SUPPORT_SEND | |
dc_odp_caps: | |
NO SUPPORT | |
xrc_odp_caps: | |
NO SUPPORT | |
raw_eth_odp_caps: | |
NO SUPPORT | |
max_dct: 262144 | |
max_device_ctx: 1020 | |
Multi-Packet RQ is not supported | |
rx_pad_end_addr_align: 64 | |
tso_caps: | |
max_tso: 0 | |
packet_pacing_caps: | |
qp_rate_limit_min: 0kbps | |
qp_rate_limit_max: 0kbps | |
Device ports: | |
port: 1 | |
state: PORT_ACTIVE (4) | |
max_mtu: 4096 (5) | |
active_mtu: 4096 (5) | |
sm_lid: 1 | |
port_lid: 10 | |
port_lmc: 0x00 | |
link_layer: InfiniBand | |
max_msg_sz: 0x40000000 | |
port_cap_flags: 0x2651e848 | |
max_vl_num: 4 (3) | |
bad_pkey_cntr: 0x0 | |
qkey_viol_cntr: 0x0 | |
sm_sl: 0 | |
pkey_tbl_len: 128 | |
gid_tbl_len: 8 | |
subnet_timeout: 18 | |
init_type_reply: 0 | |
active_width: 4X (2) | |
active_speed: 25.0 Gbps (32) | |
phys_state: LINK_UP (5) | |
GID[ 0]: fe80:0000:0000:0000:9cdc:71ff:ff42:f5d0 | |
hca_id: mlx5_1 | |
transport: InfiniBand (0) | |
fw_ver: 12.18.1000 | |
node_guid: 9cdc:71ff:ff42:f599 | |
sys_image_guid: 9cdc:71ff:ff42:f598 | |
vendor_id: 0x02c9 | |
vendor_part_id: 4115 | |
hw_ver: 0x0 | |
board_id: HP_2190110032 | |
phys_port_cnt: 1 | |
max_mr_size: 0xffffffffffffffff | |
page_size_cap: 0xfffffffffffff000 | |
max_qp: 262144 | |
max_qp_wr: 32768 | |
device_cap_flags: 0x65721c36 | |
BAD_PKEY_CNTR | |
BAD_QKEY_CNTR | |
AUTO_PATH_MIG | |
CHANGE_PHY_PORT | |
PORT_ACTIVE_EVENT | |
SYS_IMAGE_GUID | |
RC_RNR_NAK_GEN | |
XRC | |
Unknown flags: 0x65620000 | |
device_cap_exp_flags: 0x5001F8F000000000 | |
EXP_CROSS_CHANNEL | |
EXP_MR_ALLOCATE | |
EXT_ATOMICS | |
EXT_SEND NOP | |
EXP_UMR | |
EXP_ODP | |
EXP_RX_CSUM_TCP_UDP_PKT | |
EXP_RX_CSUM_IP_PKT | |
EXP_MASKED_ATOMICS | |
EXP_RX_TCP_UDP_PKT_TYPE | |
EXP_SCATTER_FCS | |
Unknown flags: 0x200000000000 | |
max_sge: 30 | |
max_sge_rd: 30 | |
max_cq: 16777216 | |
max_cqe: 4194303 | |
max_mr: 16777216 | |
max_pd: 16777216 | |
max_qp_rd_atom: 16 | |
max_ee_rd_atom: 0 | |
max_res_rd_atom: 4194304 | |
max_qp_init_rd_atom: 16 | |
max_ee_init_rd_atom: 0 | |
atomic_cap: ATOMIC_HCA (1) | |
log atomic arg sizes (mask) 0x8 | |
masked_log_atomic_arg_sizes (mask) 0x3c | |
masked_log_atomic_arg_sizes_network_endianness (mask) 0x34 | |
max fetch and add bit boundary 64 | |
log max atomic inline 5 | |
max_ee: 0 | |
max_rdd: 0 | |
max_mw: 16777216 | |
max_raw_ipv6_qp: 0 | |
max_raw_ethy_qp: 0 | |
max_mcast_grp: 2097152 | |
max_mcast_qp_attach: 48 | |
max_total_mcast_qp_attach: 100663296 | |
max_ah: 2147483647 | |
max_fmr: 0 | |
max_srq: 8388608 | |
max_srq_wr: 32767 | |
max_srq_sge: 31 | |
max_pkeys: 128 | |
local_ca_ack_delay: 16 | |
hca_core_clock: 156250 | |
max_klm_list_size: 65536 | |
max_send_wqe_inline_klms: 20 | |
max_umr_recursion_depth: 4 | |
max_umr_stride_dimension: 1 | |
general_odp_caps: | |
rc_odp_caps: | |
NO SUPPORT | |
uc_odp_caps: | |
NO SUPPORT | |
ud_odp_caps: | |
NO SUPPORT | |
dc_odp_caps: | |
NO SUPPORT | |
xrc_odp_caps: | |
NO SUPPORT | |
raw_eth_odp_caps: | |
NO SUPPORT | |
max_dct: 0 | |
max_device_ctx: 1020 | |
Multi-Packet RQ is not supported | |
VLAN offloads caps: | |
C-VLAN stripping offload | |
C-VLAN insertion offload | |
rx_pad_end_addr_align: 64 | |
tso_caps: | |
max_tso: 262144 | |
supported_qp: | |
SUPPORT_RAW_PACKET | |
packet_pacing_caps: | |
qp_rate_limit_min: 0kbps | |
qp_rate_limit_max: 0kbps | |
Device ports: | |
port: 1 | |
state: PORT_ACTIVE (4) | |
max_mtu: 4096 (5) | |
active_mtu: 1024 (3) | |
sm_lid: 0 | |
port_lid: 0 | |
port_lmc: 0x00 | |
link_layer: Ethernet | |
max_msg_sz: 0x40000000 | |
port_cap_flags: 0x04010000 | |
max_vl_num: invalid value (0) | |
bad_pkey_cntr: 0x0 | |
qkey_viol_cntr: 0x0 | |
sm_sl: 0 | |
pkey_tbl_len: 1 | |
gid_tbl_len: 256 | |
subnet_timeout: 0 | |
init_type_reply: 0 | |
active_width: 4X (2) | |
active_speed: 10.0 Gbps (4) | |
phys_state: LINK_UP (5) | |
GID[ 0]: fe80:0000:0000:0000:9edc:71ff:fe42:f599 | |
GID[ 1]: fe80:0000:0000:0000:9edc:71ff:fe42:f599 | |
GID[ 2]: 0000:0000:0000:0000:0000:ffff:ac17:e94b | |
GID[ 3]: 0000:0000:0000:0000:0000:ffff:ac17:e94b | |
hca_id: mlx5_0 | |
transport: InfiniBand (0) | |
fw_ver: 12.18.1000 | |
node_guid: 9cdc:71ff:ff42:f598 | |
sys_image_guid: 9cdc:71ff:ff42:f598 | |
vendor_id: 0x02c9 | |
vendor_part_id: 4115 | |
hw_ver: 0x0 | |
board_id: HP_2190110032 | |
phys_port_cnt: 1 | |
max_mr_size: 0xffffffffffffffff | |
page_size_cap: 0xfffffffffffff000 | |
max_qp: 262144 | |
max_qp_wr: 32768 | |
device_cap_flags: 0xc17e1c36 | |
BAD_PKEY_CNTR | |
BAD_QKEY_CNTR | |
AUTO_PATH_MIG | |
CHANGE_PHY_PORT | |
PORT_ACTIVE_EVENT | |
SYS_IMAGE_GUID | |
RC_RNR_NAK_GEN | |
XRC | |
Unknown flags: 0xc16e0000 | |
device_cap_exp_flags: 0x504060F100000000 | |
EXP_DC_TRANSPORT | |
EXP_CROSS_CHANNEL | |
EXP_MR_ALLOCATE | |
EXT_ATOMICS | |
EXT_SEND NOP | |
EXP_UMR | |
EXP_ODP | |
EXP_DC_INFO | |
EXP_MASKED_ATOMICS | |
Unknown flags: 0x200000000000 | |
max_sge: 30 | |
max_sge_rd: 30 | |
max_cq: 16777216 | |
max_cqe: 4194303 | |
max_mr: 16777216 | |
max_pd: 16777216 | |
max_qp_rd_atom: 16 | |
max_ee_rd_atom: 0 | |
max_res_rd_atom: 4194304 | |
max_qp_init_rd_atom: 16 | |
max_ee_init_rd_atom: 0 | |
atomic_cap: ATOMIC_HCA (1) | |
log atomic arg sizes (mask) 0x8 | |
masked_log_atomic_arg_sizes (mask) 0x3c | |
masked_log_atomic_arg_sizes_network_endianness (mask) 0x34 | |
max fetch and add bit boundary 64 | |
log max atomic inline 5 | |
max_ee: 0 | |
max_rdd: 0 | |
max_mw: 16777216 | |
max_raw_ipv6_qp: 0 | |
max_raw_ethy_qp: 0 | |
max_mcast_grp: 2097152 | |
max_mcast_qp_attach: 48 | |
max_total_mcast_qp_attach: 100663296 | |
max_ah: 2147483647 | |
max_fmr: 0 | |
max_srq: 8388608 | |
max_srq_wr: 32767 | |
max_srq_sge: 31 | |
max_pkeys: 128 | |
local_ca_ack_delay: 16 | |
hca_core_clock: 156250 | |
max_klm_list_size: 65536 | |
max_send_wqe_inline_klms: 20 | |
max_umr_recursion_depth: 4 | |
max_umr_stride_dimension: 1 | |
general_odp_caps: | |
ODP_SUPPORT | |
rc_odp_caps: | |
SUPPORT_SEND | |
SUPPORT_RECV | |
SUPPORT_WRITE | |
SUPPORT_READ | |
uc_odp_caps: | |
NO SUPPORT | |
ud_odp_caps: | |
SUPPORT_SEND | |
dc_odp_caps: | |
NO SUPPORT | |
xrc_odp_caps: | |
NO SUPPORT | |
raw_eth_odp_caps: | |
NO SUPPORT | |
max_dct: 262144 | |
max_device_ctx: 1020 | |
Multi-Packet RQ is not supported | |
rx_pad_end_addr_align: 64 | |
tso_caps: | |
max_tso: 0 | |
packet_pacing_caps: | |
qp_rate_limit_min: 0kbps | |
qp_rate_limit_max: 0kbps | |
Device ports: | |
port: 1 | |
state: PORT_ACTIVE (4) | |
max_mtu: 4096 (5) | |
active_mtu: 4096 (5) | |
sm_lid: 1 | |
port_lid: 9 | |
port_lmc: 0x00 | |
link_layer: InfiniBand | |
max_msg_sz: 0x40000000 | |
port_cap_flags: 0x2651e848 | |
max_vl_num: 4 (3) | |
bad_pkey_cntr: 0x0 | |
qkey_viol_cntr: 0x0 | |
sm_sl: 0 | |
pkey_tbl_len: 128 | |
gid_tbl_len: 8 | |
subnet_timeout: 18 | |
init_type_reply: 0 | |
active_width: 4X (2) | |
active_speed: 25.0 Gbps (32) | |
phys_state: LINK_UP (5) | |
GID[ 0]: fe80:0000:0000:0000:9cdc:71ff:ff42:f598 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[m12:09052] mca_base_component_repository_open: unable to open mca_oob_ud: libmca_common_verbs.so.40: cannot open shared object file: No such file or directory (ignored) | |
-------------------------------------------------------------------------- | |
WARNING: There are more than one active ports on host 'm13', but the | |
default subnet GID prefix was detected on more than one of these | |
ports. If these ports are connected to different physical IB | |
networks, this configuration will fail in Open MPI. This version of | |
Open MPI requires that every physically separate IB subnet that is | |
used between connected MPI processes must have different subnet ID | |
values. | |
Please see this FAQ entry for more details: | |
http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid | |
NOTE: You can turn off this warning by setting the MCA parameter | |
btl_openib_warn_default_gid_prefix to 0. | |
-------------------------------------------------------------------------- | |
-------------------------------------------------------------------------- | |
No OpenFabrics connection schemes reported that they were able to be | |
used on a specific port. As such, the openib BTL (OpenFabrics | |
support) will be disabled for this port. | |
Local host: msragpum13 | |
Local device: mlx5_1 | |
Local port: 1 | |
CPCs attempted: rdmacm, udcm | |
-------------------------------------------------------------------------- | |
Extracting MNIST-data-0/train-images-idx3-ubyte.gz | |
Extracting MNIST-data-1/train-images-idx3-ubyte.gz | |
Extracting MNIST-data-0/train-labels-idx1-ubyte.gz | |
Extracting MNIST-data-0/t10k-images-idx3-ubyte.gz | |
Extracting MNIST-data-0/t10k-labels-idx1-ubyte.gz | |
Extracting MNIST-data-1/train-labels-idx1-ubyte.gz | |
Extracting MNIST-data-1/t10k-images-idx3-ubyte.gz | |
Extracting MNIST-data-1/t10k-labels-idx1-ubyte.gz | |
INFO:tensorflow:Create CheckpointSaverHook. | |
2017-11-06 11:11:36.314355: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:11:36.314395: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:11:36.314402: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:11:36.314408: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:11:36.314414: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:11:37.074464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: | |
name: Tesla M40 24GB | |
major: 5 minor: 2 memoryClockRate (GHz) 1.112 | |
pciBusID 0000:13:00.0 | |
Total memory: 22.40GiB | |
Free memory: 1.94GiB | |
2017-11-06 11:11:37.074503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 | |
2017-11-06 11:11:37.074510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y | |
2017-11-06 11:11:37.074525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:13:00.0) | |
2017-11-06 11:10:34.707113: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:10:34.707152: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:10:34.707159: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:10:34.707164: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:10:34.707170: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:10:35.415034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: | |
name: Tesla M40 24GB | |
major: 5 minor: 2 memoryClockRate (GHz) 1.112 | |
pciBusID 0000:13:00.0 | |
Total memory: 22.40GiB | |
Free memory: 14.11GiB | |
2017-11-06 11:10:35.415081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 | |
2017-11-06 11:10:35.415089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y | |
2017-11-06 11:10:35.415100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:13:00.0) | |
[m13:18029] 1 more process has sent help message help-mpi-btl-openib.txt / default subnet prefix | |
[m13:18029] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages | |
[m13:18029] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[m12:04452] mca_base_component_repository_open: unable to open mca_oob_ud: libmca_common_verbs.so.40: cannot open shared object file: No such file or directory (ignored) | |
Extracting MNIST-data-1/train-images-idx3-ubyte.gz | |
Extracting MNIST-data-0/train-images-idx3-ubyte.gz | |
Extracting MNIST-data-1/train-labels-idx1-ubyte.gz | |
Extracting MNIST-data-1/t10k-images-idx3-ubyte.gz | |
Extracting MNIST-data-1/t10k-labels-idx1-ubyte.gz | |
Extracting MNIST-data-0/train-labels-idx1-ubyte.gz | |
Extracting MNIST-data-0/t10k-images-idx3-ubyte.gz | |
Extracting MNIST-data-0/t10k-labels-idx1-ubyte.gz | |
INFO:tensorflow:Create CheckpointSaverHook. | |
2017-11-06 11:37:37.821414: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:37:37.821453: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:37:37.821460: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:37:37.821466: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:37:37.821472: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:36:34.567018: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:36:34.567059: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:36:34.567066: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:36:34.567072: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:36:34.567078: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 11:37:38.670994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: | |
name: Tesla M40 24GB | |
major: 5 minor: 2 memoryClockRate (GHz) 1.112 | |
pciBusID 0000:13:00.0 | |
Total memory: 22.40GiB | |
Free memory: 1.94GiB | |
2017-11-06 11:37:38.671038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 | |
2017-11-06 11:37:38.671045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y | |
2017-11-06 11:37:38.671055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:13:00.0) | |
2017-11-06 11:36:35.740282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: | |
name: Tesla M40 24GB | |
major: 5 minor: 2 memoryClockRate (GHz) 1.112 | |
pciBusID 0000:13:00.0 | |
Total memory: 22.40GiB | |
Free memory: 14.11GiB | |
2017-11-06 11:36:35.740434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 | |
2017-11-06 11:36:35.740442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y | |
2017-11-06 11:36:35.740453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:13:00.0) | |
m13:45462:45667 [0] INFO NET : Using interface eth0:10.150.144.114<0> | |
m13:45462:45667 [0] INFO NET/IB : Using interface eth0 for sideband communication | |
m13:45462:45667 [0] INFO NET/IB: [1] mlx5_2:1/IB | |
m13:45462:45667 [0] INFO NET/IB: [2] mlx5_1:1/RoCE | |
m13:45462:45667 [0] INFO NET/IB: [3] mlx5_0:1/IB | |
m13:45462:45667 [0] INFO Using internal Network IB | |
NCCL version 2.0.5 compiled with CUDA 8.0 | |
m12:4456:4572 [0] INFO NET : Using interface eth0:10.150.144.115<0> | |
m12:4456:4572 [0] INFO NET/IB : Using interface eth0 for sideband communication | |
m12:4456:4572 [0] INFO NET/IB: [1] mlx5_2:1/IB | |
m12:4456:4572 [0] INFO NET/IB: [2] mlx5_1:1/RoCE | |
m12:4456:4572 [0] INFO NET/IB: [3] mlx5_0:1/IB | |
m12:4456:4572 [0] INFO Using internal Network IB | |
m13:45462:45667 [0] INFO CUDA Dev 0, IB Ports : mlx5_2/1(SOC) mlx5_1/1(PIX) mlx5_0/1(PIX) | |
m12:4456:4572 [0] INFO CUDA Dev 0, IB Ports : mlx5_2/1(SOC) mlx5_1/1(PIX) mlx5_0/1(PIX) | |
m13:45462:45667 [0] INFO Using 256 threads | |
m13:45462:45667 [0] INFO [0] Ring 0 : 0 1 | |
m13:45462:45667 [0] INFO [0] Ring 1 : 0 1 | |
m13:45462:45667 [0] transport/net_ib.cu:192 WARN No module present for GPU Direct RDMA. | |
m13:45462:45667 [0] INFO 1 -> 0 via NET/IB/1 | |
m13:45462:45667 [0] transport/net_ib.cu:192 WARN No module present for GPU Direct RDMA. | |
m12:4456:4572 [0] INFO 0 -> 1 via NET/IB/1/GDRDMA | |
m13:45462:45667 [0] transport/net_ib.cu:192 WARN No module present for GPU Direct RDMA. | |
m13:45462:45667 [0] INFO 1 -> 0 via NET/IB/2 | |
m13:45462:45667 [0] transport/net_ib.cu:192 WARN No module present for GPU Direct RDMA. | |
m12:4456:4572 [0] INFO 0 -> 1 via NET/IB/2/GDRDMA | |
INFO:tensorflow:loss = 2.30775, step = 1 | |
INFO:tensorflow:Saving checkpoints for 1 into ./checkpoints/model.ckpt. | |
INFO:tensorflow:loss = 2.29844, step = 1 | |
INFO:tensorflow:loss = 2.28173, step = 11 (0.430 sec) | |
INFO:tensorflow:loss = 2.29419, step = 11 (0.660 sec) | |
INFO:tensorflow:loss = 2.24214, step = 21 (0.413 sec) | |
INFO:tensorflow:loss = 2.24506, step = 21 (0.407 sec) | |
INFO:tensorflow:loss = 2.08036, step = 31 (0.373 sec) | |
INFO:tensorflow:loss = 2.12947, step = 31 (0.374 sec) | |
INFO:tensorflow:loss = 1.772, step = 41 (0.345 sec) | |
INFO:tensorflow:loss = 1.72418, step = 41 (0.345 sec) | |
INFO:tensorflow:loss = 1.17998, step = 51 (0.366 sec) | |
INFO:tensorflow:loss = 0.99576, step = 51 (0.365 sec) | |
INFO:tensorflow:loss = 1.2166, step = 61 (0.377 sec) | |
INFO:tensorflow:loss = 1.18057, step = 61 (0.379 sec) | |
INFO:tensorflow:loss = 2.65005, step = 71 (0.350 sec) | |
INFO:tensorflow:loss = 2.63248, step = 71 (0.352 sec) | |
INFO:tensorflow:loss = 0.766145, step = 81 (0.345 sec) | |
INFO:tensorflow:loss = 0.774453, step = 81 (0.345 sec) | |
INFO:tensorflow:loss = 0.940878, step = 91 (0.368 sec) | |
INFO:tensorflow:loss = 1.13829, step = 91 (0.371 sec) | |
INFO:tensorflow:Saving checkpoints for 100 into ./checkpoints/model.ckpt. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[m12:14019] mca_base_component_repository_open: unable to open mca_oob_ud: libmca_common_verbs.so.40: cannot open shared object file: No such file or directory (ignored) | |
-------------------------------------------------------------------------- | |
WARNING: There is at least non-excluded one OpenFabrics device found, | |
but there are no active ports detected (or Open MPI was unable to use | |
them). This is most certainly not what you wanted. Check your | |
cables, subnet manager configuration, etc. The openib BTL will be | |
ignored for this job. | |
Local host: m12 | |
-------------------------------------------------------------------------- | |
Extracting MNIST-data-1/train-images-idx3-ubyte.gz | |
Extracting MNIST-data-0/train-images-idx3-ubyte.gz | |
Extracting MNIST-data-0/train-labels-idx1-ubyte.gz | |
Extracting MNIST-data-0/t10k-images-idx3-ubyte.gz | |
Extracting MNIST-data-0/t10k-labels-idx1-ubyte.gz | |
Extracting MNIST-data-1/train-labels-idx1-ubyte.gz | |
Extracting MNIST-data-1/t10k-images-idx3-ubyte.gz | |
Extracting MNIST-data-1/t10k-labels-idx1-ubyte.gz | |
INFO:tensorflow:Create CheckpointSaverHook. | |
2017-11-06 12:09:21.228394: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 12:09:21.228439: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 12:09:21.228445: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 12:09:21.228451: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 12:09:21.228456: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 12:09:22.064501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: | |
name: Tesla M40 24GB | |
major: 5 minor: 2 memoryClockRate (GHz) 1.112 | |
pciBusID 0000:13:00.0 | |
Total memory: 22.40GiB | |
Free memory: 1.94GiB | |
2017-11-06 12:09:22.064541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 | |
2017-11-06 12:09:22.064547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y | |
2017-11-06 12:09:22.064558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:13:00.0) | |
INFO:tensorflow:Restoring parameters from ./checkpoints/model.ckpt-100 | |
2017-11-06 12:08:20.496613: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 12:08:20.496651: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 12:08:20.496658: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 12:08:20.496663: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 12:08:20.496668: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. | |
2017-11-06 12:08:21.147244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: | |
name: Tesla M40 24GB | |
major: 5 minor: 2 memoryClockRate (GHz) 1.112 | |
pciBusID 0000:13:00.0 | |
Total memory: 22.40GiB | |
Free memory: 14.11GiB | |
2017-11-06 12:08:21.147284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 | |
2017-11-06 12:08:21.147291: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y | |
2017-11-06 12:08:21.147300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:13:00.0) | |
[m13:44317] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found | |
[m13:44317] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages | |
m13:44370:44598 [0] INFO NET : Using interface eth2:172.23.233.75<0> | |
m13:44370:44598 [0] INFO NET/IB : Using interface eth2 for sideband communication | |
m13:44370:44598 [0] INFO Using internal Network Socket | |
m13:44370:44598 [0] INFO NET : Using interface eth2:172.23.233.75<0> | |
m13:44370:44598 [0] INFO NET/Socket : 1 interfaces found | |
NCCL version 2.0.5 compiled with CUDA 8.0 | |
m12:14023:14113 [0] INFO NET : Using interface eth2:172.23.233.77<0> | |
m12:14023:14113 [0] INFO NET/IB : Using interface eth2 for sideband communication | |
m12:14023:14113 [0] INFO Using internal Network Socket | |
m12:14023:14113 [0] INFO NET : Using interface eth2:172.23.233.77<0> | |
m12:14023:14113 [0] INFO NET/Socket : 1 interfaces found | |
m13:44370:44598 [0] INFO Using 256 threads | |
m13:44370:44598 [0] INFO [0] Ring 0 : 0 1 | |
m12:14023:14113 [0] INFO 0 -> 1 via NET/Socket/0 | |
m13:44370:44598 [0] INFO 1 -> 0 via NET/Socket/0 | |
INFO:tensorflow:Saving checkpoints for 101 into ./checkpoints/model.ckpt. | |
INFO:tensorflow:loss = 0.554918, step = 101 | |
INFO:tensorflow:loss = 0.560171, step = 101 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I install horovod https://github.com/uber/horovod, with RDMA and GPUDirect following https://github.com/uber/horovod/blob/master/docs/gpus.md#advanced-have-gpus-and-networking-with-rdma-and-gpudirect
When I run the example code tensorflow_mnist.py on two machines with command:
mpirun -np 2 -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES -x NCCL_DEBUG=INFO -mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0 -H 10.150.144.115:1,10.150.144.114:1 python tensorflow_mnist.py
but the logs hangs on and does not show log any more. The process on both machines has GPU memory usage. The logs is in file logs.
However, if I just run on a single machine , it has no problems:
mpirun -np 2 -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES -H 172.23.233.75:2 python tensorflow_mnist.py
or
mpirun -np 2 -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES -H 172.23.233.77:2 python tensorflow_mnist.py
both can work properly.
My system is ubuntu 14.04, tensorflow is 1.3.0, python is 2.7, horovod is following the install guides using NCCL2.0 and openmpi 3.0. So what's the problems?