Skip to content

Instantly share code, notes, and snippets.

@tobyyouup
Last active November 13, 2017 06:48
Show Gist options
  • Save tobyyouup/ca7ba6542deed1c6a473be70061b41a5 to your computer and use it in GitHub Desktop.
Save tobyyouup/ca7ba6542deed1c6a473be70061b41a5 to your computer and use it in GitHub Desktop.
run distributed tensorflow_mnist.py on two machines, but it hangs up and dose not show log
m12:2037:2131 [0] INFO NET : Using interface eth2:172.23.233.77<0>
m12:2037:2131 [0] INFO NET/IB : Using interface eth2 for sideband communication
m12:2037:2131 [0] INFO NET/IB: [3] mlx5_0:1/IB
m12:2037:2131 [0] INFO Using internal Network IB
NCCL version 2.0.5 compiled with CUDA 8.0
m12:2039:2134 [2] INFO NET : Using interface eth2:172.23.233.77<0>
m12:2039:2134 [2] INFO NET/IB : Using interface eth2 for sideband communication
m12:2041:2220 [4] INFO NET : Using interface eth2:172.23.233.77<0>
m12:2041:2220 [4] INFO NET/IB : Using interface eth2 for sideband communication
m13:42973:43395 [0] INFO NET : Using interface eth2:172.23.233.75<0>
m13:42973:43395 [0] INFO NET/IB : Using interface eth2 for sideband communication
m13:42975:43396 [2] INFO NET : Using interface eth2:172.23.233.75<0>
m13:42975:43396 [2] INFO NET/IB : Using interface eth2 for sideband communication
m13:42977:43394 [4] INFO NET : Using interface eth2:172.23.233.75<0>
m13:42977:43394 [4] INFO NET/IB : Using interface eth2 for sideband communication
m13:42973:43395 [0] INFO NET/IB: [3] mlx5_0:1/IB
m13:42973:43395 [0] INFO Using internal Network IB
m13:42977:43394 [4] INFO NET/IB: [3] mlx5_0:1/IB
m13:42975:43396 [2] INFO NET/IB: [3] mlx5_0:1/IB
m13:42975:43396 [2] INFO Using internal Network IB
m13:42977:43394 [4] INFO Using internal Network IB
m12:2041:2220 [4] INFO NET/IB: [3] mlx5_0:1/IB
m12:2041:2220 [4] INFO Using internal Network IB
m12:2039:2134 [2] INFO NET/IB: [3] mlx5_0:1/IB
m12:2039:2134 [2] INFO Using internal Network IB
m13:42974:43409 [1] INFO NET : Using interface eth2:172.23.233.75<0>
m13:42974:43409 [1] INFO NET/IB : Using interface eth2 for sideband communication
m13:42976:43403 [3] INFO NET : Using interface eth2:172.23.233.75<0>
m13:42976:43403 [3] INFO NET/IB : Using interface eth2 for sideband communication
m13:42978:43404 [5] INFO NET : Using interface eth2:172.23.233.75<0>
m13:42978:43404 [5] INFO NET/IB : Using interface eth2 for sideband communication
m12:2038:2123 [1] INFO NET : Using interface eth2:172.23.233.77<0>
m12:2038:2123 [1] INFO NET/IB : Using interface eth2 for sideband communication
m12:2040:2128 [3] INFO NET : Using interface eth2:172.23.233.77<0>
m12:2040:2128 [3] INFO NET/IB : Using interface eth2 for sideband communication
m12:2042:2122 [5] INFO NET : Using interface eth2:172.23.233.77<0>
m12:2042:2122 [5] INFO NET/IB : Using interface eth2 for sideband communication
m12:2040:2128 [3] INFO NET/IB: [3] mlx5_0:1/IB
m12:2040:2128 [3] INFO Using internal Network IB
m12:2038:2123 [1] INFO NET/IB: [3] mlx5_0:1/IB
m12:2042:2122 [5] INFO NET/IB: [3] mlx5_0:1/IB
m12:2042:2122 [5] INFO Using internal Network IB
m13:42974:43409 [1] INFO NET/IB: [3] mlx5_0:1/IB
m13:42974:43409 [1] INFO Using internal Network IB
m13:42976:43403 [3] INFO NET/IB: [3] mlx5_0:1/IB
m13:42978:43404 [5] INFO NET/IB: [3] mlx5_0:1/IB
m13:42978:43404 [5] INFO Using internal Network IB
m12:2038:2123 [1] INFO Using internal Network IB
m13:42976:43403 [3] INFO Using internal Network IB
m13:42973:43395 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(PIX)
m12:2041:2220 [4] INFO CUDA Dev 4, IB Ports : mlx5_0/1(SOC)
m12:2037:2131 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(PIX)
m13:42975:43396 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
m13:42977:43394 [4] INFO CUDA Dev 4, IB Ports : mlx5_0/1(SOC)
m12:2039:2134 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
m13:42976:43403 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
m13:42974:43409 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(PIX)
m13:42978:43404 [5] INFO CUDA Dev 5, IB Ports : mlx5_0/1(SOC)
m12:2040:2128 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
m12:2042:2122 [5] INFO CUDA Dev 5, IB Ports : mlx5_0/1(SOC)
m12:2038:2123 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(PIX)
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2037:2131 [0] INFO Using 256 threads
m12:2037:2131 [0] INFO [0] Ring 0 : 0 1 2 3 4 5 6 7 8 9 10 11
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2038:2123 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2042:2122 [5] INFO 5 -> 4 via P2P/IPC
m12:2038:2123 [1] INFO 1 -> 0 via P2P/IPC
m12:2040:2128 [3] INFO 3 -> 2 via P2P/IPC
m12:2041:2220 [4] INFO 4 -> 3 via P2P/IPC
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2040:2128 [3] INFO 3 -> 4 via P2P/IPC
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2041:2220 [4] INFO 4 -> 5 via P2P/IPC
m12:2038:2123 [1] INFO 1 -> 2 via direct shared memory
m12:2037:2131 [0] INFO 11 -> 0 via NET/IB/0/GDRDMA
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2037:2131 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2037:2131 [0] INFO 0 -> 1 via P2P/IPC
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42974:43409 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO 9 -> 8 via P2P/IPC
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:2039:2134 [2] INFO 2 -> 3 via P2P/IPC
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42976:43403 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO 10 -> 9 via P2P/IPC
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42977:43394 [4] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42978:43404 [5] INFO 11 -> 10 via P2P/IPC
m13:42976:43403 [3] INFO 9 -> 10 via P2P/IPC
m13:42977:43394 [4] INFO 10 -> 11 via P2P/IPC
m13:42974:43409 [1] INFO 7 -> 6 via P2P/IPC
m13:42974:43409 [1] INFO 7 -> 8 via direct shared memory
m13:42973:43395 [0] INFO 5 -> 6 via NET/IB/0/GDRDMA
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42973:43395 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42973:43395 [0] INFO 6 -> 7 via P2P/IPC
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:42975:43396 [2] INFO 8 -> 9 via P2P/IPC
m12:25666:25697 [0] INFO NET : Using interface eth2:172.23.233.77<0>
m12:25666:25697 [0] INFO NET/IB : Using interface eth2 for sideband communication
m12:25666:25697 [0] INFO Using internal Network Socket
m12:25666:25697 [0] INFO NET : Using interface eth2:172.23.233.77<0>
m12:25666:25697 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.0.5 compiled with CUDA 8.0
m12:25667:25694 [1] INFO NET : Using interface eth2:172.23.233.77<0>
m12:25667:25694 [1] INFO NET/IB : Using interface eth2 for sideband communication
m13:4865:5012 [0] INFO NET : Using interface eth2:172.23.233.75<0>
m13:4865:5012 [0] INFO NET/IB : Using interface eth2 for sideband communication
m13:4866:5009 [1] INFO NET : Using interface eth2:172.23.233.75<0>
m13:4866:5009 [1] INFO NET/IB : Using interface eth2 for sideband communication
m12:25667:25694 [1] INFO Using internal Network Socket
m13:4865:5012 [0] INFO Using internal Network Socket
m13:4866:5009 [1] INFO Using internal Network Socket
m13:4866:5009 [1] INFO NET : Using interface eth2:172.23.233.75<0>
m13:4866:5009 [1] INFO NET/Socket : 1 interfaces found
m13:4865:5012 [0] INFO NET : Using interface eth2:172.23.233.75<0>
m13:4865:5012 [0] INFO NET/Socket : 1 interfaces found
m12:25667:25694 [1] INFO NET : Using interface eth2:172.23.233.77<0>
m12:25667:25694 [1] INFO NET/Socket : 1 interfaces found
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25666:25697 [0] INFO Using 256 threads
m12:25666:25697 [0] INFO [0] Ring 0 : 0 1 2 3
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25667:25694 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25667:25694 [1] INFO 1 -> 0 via P2P/IPC
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4866:5009 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4866:5009 [1] INFO 3 -> 2 via P2P/IPC
m12:25666:25697 [0] INFO 3 -> 0 via NET/Socket/0
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25666:25697 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m12:25666:25697 [0] INFO 0 -> 1 via P2P/IPC
m13:4865:5012 [0] INFO 1 -> 2 via NET/Socket/0
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4865:5012 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
m13:4865:5012 [0] INFO 2 -> 3 via P2P/IPC
**training log for model**
hca_id: mlx5_3
transport: InfiniBand (0)
fw_ver: 12.18.1000
node_guid: 9cdc:71ff:ff42:f5d1
sys_image_guid: 9cdc:71ff:ff42:f5d0
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: HP_2190110032
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 262144
max_qp_wr: 32768
device_cap_flags: 0xc17e1c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
XRC
Unknown flags: 0xc16e0000
device_cap_exp_flags: 0x504060F100000000
EXP_DC_TRANSPORT
EXP_CROSS_CHANNEL
EXP_MR_ALLOCATE
EXT_ATOMICS
EXT_SEND NOP
EXP_UMR
EXP_ODP
EXP_DC_INFO
EXP_MASKED_ATOMICS
Unknown flags: 0x200000000000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 16777216
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 4194304
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
log atomic arg sizes (mask) 0x8
masked_log_atomic_arg_sizes (mask) 0x3c
masked_log_atomic_arg_sizes_network_endianness (mask) 0x34
max fetch and add bit boundary 64
log max atomic inline 5
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 48
max_total_mcast_qp_attach: 100663296
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
hca_core_clock: 156250
max_klm_list_size: 65536
max_send_wqe_inline_klms: 20
max_umr_recursion_depth: 4
max_umr_stride_dimension: 1
general_odp_caps:
ODP_SUPPORT
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
SUPPORT_SEND
dc_odp_caps:
NO SUPPORT
xrc_odp_caps:
NO SUPPORT
raw_eth_odp_caps:
NO SUPPORT
max_dct: 262144
max_device_ctx: 1020
Multi-Packet RQ is not supported
rx_pad_end_addr_align: 64
tso_caps:
max_tso: 0
packet_pacing_caps:
qp_rate_limit_min: 0kbps
qp_rate_limit_max: 0kbps
Device ports:
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 65535
port_lmc: 0x00
link_layer: InfiniBand
max_msg_sz: 0x40000000
port_cap_flags: 0x2651e848
max_vl_num: 4 (3)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 128
gid_tbl_len: 8
subnet_timeout: 0
init_type_reply: 0
active_width: 4X (2)
active_speed: invalid speed (0)
phys_state: DISABLED (3)
GID[ 0]: fe80:0000:0000:0000:9cdc:71ff:ff42:f5d1
hca_id: mlx5_2
transport: InfiniBand (0)
fw_ver: 12.18.1000
node_guid: 9cdc:71ff:ff42:f5d0
sys_image_guid: 9cdc:71ff:ff42:f5d0
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: HP_2190110032
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 262144
max_qp_wr: 32768
device_cap_flags: 0xc17e1c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
XRC
Unknown flags: 0xc16e0000
device_cap_exp_flags: 0x504060F100000000
EXP_DC_TRANSPORT
EXP_CROSS_CHANNEL
EXP_MR_ALLOCATE
EXT_ATOMICS
EXT_SEND NOP
EXP_UMR
EXP_ODP
EXP_DC_INFO
EXP_MASKED_ATOMICS
Unknown flags: 0x200000000000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 16777216
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 4194304
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
log atomic arg sizes (mask) 0x8
masked_log_atomic_arg_sizes (mask) 0x3c
masked_log_atomic_arg_sizes_network_endianness (mask) 0x34
max fetch and add bit boundary 64
log max atomic inline 5
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 48
max_total_mcast_qp_attach: 100663296
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
hca_core_clock: 156250
max_klm_list_size: 65536
max_send_wqe_inline_klms: 20
max_umr_recursion_depth: 4
max_umr_stride_dimension: 1
general_odp_caps:
ODP_SUPPORT
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
SUPPORT_SEND
dc_odp_caps:
NO SUPPORT
xrc_odp_caps:
NO SUPPORT
raw_eth_odp_caps:
NO SUPPORT
max_dct: 262144
max_device_ctx: 1020
Multi-Packet RQ is not supported
rx_pad_end_addr_align: 64
tso_caps:
max_tso: 0
packet_pacing_caps:
qp_rate_limit_min: 0kbps
qp_rate_limit_max: 0kbps
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 10
port_lmc: 0x00
link_layer: InfiniBand
max_msg_sz: 0x40000000
port_cap_flags: 0x2651e848
max_vl_num: 4 (3)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 128
gid_tbl_len: 8
subnet_timeout: 18
init_type_reply: 0
active_width: 4X (2)
active_speed: 25.0 Gbps (32)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:9cdc:71ff:ff42:f5d0
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 12.18.1000
node_guid: 9cdc:71ff:ff42:f599
sys_image_guid: 9cdc:71ff:ff42:f598
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: HP_2190110032
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 262144
max_qp_wr: 32768
device_cap_flags: 0x65721c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
XRC
Unknown flags: 0x65620000
device_cap_exp_flags: 0x5001F8F000000000
EXP_CROSS_CHANNEL
EXP_MR_ALLOCATE
EXT_ATOMICS
EXT_SEND NOP
EXP_UMR
EXP_ODP
EXP_RX_CSUM_TCP_UDP_PKT
EXP_RX_CSUM_IP_PKT
EXP_MASKED_ATOMICS
EXP_RX_TCP_UDP_PKT_TYPE
EXP_SCATTER_FCS
Unknown flags: 0x200000000000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 16777216
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 4194304
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
log atomic arg sizes (mask) 0x8
masked_log_atomic_arg_sizes (mask) 0x3c
masked_log_atomic_arg_sizes_network_endianness (mask) 0x34
max fetch and add bit boundary 64
log max atomic inline 5
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 48
max_total_mcast_qp_attach: 100663296
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
hca_core_clock: 156250
max_klm_list_size: 65536
max_send_wqe_inline_klms: 20
max_umr_recursion_depth: 4
max_umr_stride_dimension: 1
general_odp_caps:
rc_odp_caps:
NO SUPPORT
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
NO SUPPORT
dc_odp_caps:
NO SUPPORT
xrc_odp_caps:
NO SUPPORT
raw_eth_odp_caps:
NO SUPPORT
max_dct: 0
max_device_ctx: 1020
Multi-Packet RQ is not supported
VLAN offloads caps:
C-VLAN stripping offload
C-VLAN insertion offload
rx_pad_end_addr_align: 64
tso_caps:
max_tso: 262144
supported_qp:
SUPPORT_RAW_PACKET
packet_pacing_caps:
qp_rate_limit_min: 0kbps
qp_rate_limit_max: 0kbps
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
max_msg_sz: 0x40000000
port_cap_flags: 0x04010000
max_vl_num: invalid value (0)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 1
gid_tbl_len: 256
subnet_timeout: 0
init_type_reply: 0
active_width: 4X (2)
active_speed: 10.0 Gbps (4)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:9edc:71ff:fe42:f599
GID[ 1]: fe80:0000:0000:0000:9edc:71ff:fe42:f599
GID[ 2]: 0000:0000:0000:0000:0000:ffff:ac17:e94b
GID[ 3]: 0000:0000:0000:0000:0000:ffff:ac17:e94b
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.18.1000
node_guid: 9cdc:71ff:ff42:f598
sys_image_guid: 9cdc:71ff:ff42:f598
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: HP_2190110032
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 262144
max_qp_wr: 32768
device_cap_flags: 0xc17e1c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
XRC
Unknown flags: 0xc16e0000
device_cap_exp_flags: 0x504060F100000000
EXP_DC_TRANSPORT
EXP_CROSS_CHANNEL
EXP_MR_ALLOCATE
EXT_ATOMICS
EXT_SEND NOP
EXP_UMR
EXP_ODP
EXP_DC_INFO
EXP_MASKED_ATOMICS
Unknown flags: 0x200000000000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 16777216
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 4194304
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
log atomic arg sizes (mask) 0x8
masked_log_atomic_arg_sizes (mask) 0x3c
masked_log_atomic_arg_sizes_network_endianness (mask) 0x34
max fetch and add bit boundary 64
log max atomic inline 5
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 48
max_total_mcast_qp_attach: 100663296
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
hca_core_clock: 156250
max_klm_list_size: 65536
max_send_wqe_inline_klms: 20
max_umr_recursion_depth: 4
max_umr_stride_dimension: 1
general_odp_caps:
ODP_SUPPORT
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
SUPPORT_SEND
dc_odp_caps:
NO SUPPORT
xrc_odp_caps:
NO SUPPORT
raw_eth_odp_caps:
NO SUPPORT
max_dct: 262144
max_device_ctx: 1020
Multi-Packet RQ is not supported
rx_pad_end_addr_align: 64
tso_caps:
max_tso: 0
packet_pacing_caps:
qp_rate_limit_min: 0kbps
qp_rate_limit_max: 0kbps
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 9
port_lmc: 0x00
link_layer: InfiniBand
max_msg_sz: 0x40000000
port_cap_flags: 0x2651e848
max_vl_num: 4 (3)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 128
gid_tbl_len: 8
subnet_timeout: 18
init_type_reply: 0
active_width: 4X (2)
active_speed: 25.0 Gbps (32)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:9cdc:71ff:ff42:f598
[m12:09052] mca_base_component_repository_open: unable to open mca_oob_ud: libmca_common_verbs.so.40: cannot open shared object file: No such file or directory (ignored)
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'm13', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.
Please see this FAQ entry for more details:
http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: msragpum13
Local device: mlx5_1
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
Extracting MNIST-data-0/train-images-idx3-ubyte.gz
Extracting MNIST-data-1/train-images-idx3-ubyte.gz
Extracting MNIST-data-0/train-labels-idx1-ubyte.gz
Extracting MNIST-data-0/t10k-images-idx3-ubyte.gz
Extracting MNIST-data-0/t10k-labels-idx1-ubyte.gz
Extracting MNIST-data-1/train-labels-idx1-ubyte.gz
Extracting MNIST-data-1/t10k-images-idx3-ubyte.gz
Extracting MNIST-data-1/t10k-labels-idx1-ubyte.gz
INFO:tensorflow:Create CheckpointSaverHook.
2017-11-06 11:11:36.314355: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:11:36.314395: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:11:36.314402: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:11:36.314408: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:11:36.314414: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:11:37.074464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla M40 24GB
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:13:00.0
Total memory: 22.40GiB
Free memory: 1.94GiB
2017-11-06 11:11:37.074503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-11-06 11:11:37.074510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-11-06 11:11:37.074525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:13:00.0)
2017-11-06 11:10:34.707113: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:10:34.707152: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:10:34.707159: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:10:34.707164: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:10:34.707170: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:10:35.415034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla M40 24GB
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:13:00.0
Total memory: 22.40GiB
Free memory: 14.11GiB
2017-11-06 11:10:35.415081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-11-06 11:10:35.415089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-11-06 11:10:35.415100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:13:00.0)
[m13:18029] 1 more process has sent help message help-mpi-btl-openib.txt / default subnet prefix
[m13:18029] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[m13:18029] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[m12:04452] mca_base_component_repository_open: unable to open mca_oob_ud: libmca_common_verbs.so.40: cannot open shared object file: No such file or directory (ignored)
Extracting MNIST-data-1/train-images-idx3-ubyte.gz
Extracting MNIST-data-0/train-images-idx3-ubyte.gz
Extracting MNIST-data-1/train-labels-idx1-ubyte.gz
Extracting MNIST-data-1/t10k-images-idx3-ubyte.gz
Extracting MNIST-data-1/t10k-labels-idx1-ubyte.gz
Extracting MNIST-data-0/train-labels-idx1-ubyte.gz
Extracting MNIST-data-0/t10k-images-idx3-ubyte.gz
Extracting MNIST-data-0/t10k-labels-idx1-ubyte.gz
INFO:tensorflow:Create CheckpointSaverHook.
2017-11-06 11:37:37.821414: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:37:37.821453: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:37:37.821460: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:37:37.821466: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:37:37.821472: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:36:34.567018: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:36:34.567059: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:36:34.567066: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:36:34.567072: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:36:34.567078: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 11:37:38.670994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla M40 24GB
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:13:00.0
Total memory: 22.40GiB
Free memory: 1.94GiB
2017-11-06 11:37:38.671038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-11-06 11:37:38.671045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-11-06 11:37:38.671055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:13:00.0)
2017-11-06 11:36:35.740282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla M40 24GB
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:13:00.0
Total memory: 22.40GiB
Free memory: 14.11GiB
2017-11-06 11:36:35.740434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-11-06 11:36:35.740442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-11-06 11:36:35.740453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:13:00.0)
m13:45462:45667 [0] INFO NET : Using interface eth0:10.150.144.114<0>
m13:45462:45667 [0] INFO NET/IB : Using interface eth0 for sideband communication
m13:45462:45667 [0] INFO NET/IB: [1] mlx5_2:1/IB
m13:45462:45667 [0] INFO NET/IB: [2] mlx5_1:1/RoCE
m13:45462:45667 [0] INFO NET/IB: [3] mlx5_0:1/IB
m13:45462:45667 [0] INFO Using internal Network IB
NCCL version 2.0.5 compiled with CUDA 8.0
m12:4456:4572 [0] INFO NET : Using interface eth0:10.150.144.115<0>
m12:4456:4572 [0] INFO NET/IB : Using interface eth0 for sideband communication
m12:4456:4572 [0] INFO NET/IB: [1] mlx5_2:1/IB
m12:4456:4572 [0] INFO NET/IB: [2] mlx5_1:1/RoCE
m12:4456:4572 [0] INFO NET/IB: [3] mlx5_0:1/IB
m12:4456:4572 [0] INFO Using internal Network IB
m13:45462:45667 [0] INFO CUDA Dev 0, IB Ports : mlx5_2/1(SOC) mlx5_1/1(PIX) mlx5_0/1(PIX)
m12:4456:4572 [0] INFO CUDA Dev 0, IB Ports : mlx5_2/1(SOC) mlx5_1/1(PIX) mlx5_0/1(PIX)
m13:45462:45667 [0] INFO Using 256 threads
m13:45462:45667 [0] INFO [0] Ring 0 : 0 1
m13:45462:45667 [0] INFO [0] Ring 1 : 0 1
m13:45462:45667 [0] transport/net_ib.cu:192 WARN No module present for GPU Direct RDMA.
m13:45462:45667 [0] INFO 1 -> 0 via NET/IB/1
m13:45462:45667 [0] transport/net_ib.cu:192 WARN No module present for GPU Direct RDMA.
m12:4456:4572 [0] INFO 0 -> 1 via NET/IB/1/GDRDMA
m13:45462:45667 [0] transport/net_ib.cu:192 WARN No module present for GPU Direct RDMA.
m13:45462:45667 [0] INFO 1 -> 0 via NET/IB/2
m13:45462:45667 [0] transport/net_ib.cu:192 WARN No module present for GPU Direct RDMA.
m12:4456:4572 [0] INFO 0 -> 1 via NET/IB/2/GDRDMA
INFO:tensorflow:loss = 2.30775, step = 1
INFO:tensorflow:Saving checkpoints for 1 into ./checkpoints/model.ckpt.
INFO:tensorflow:loss = 2.29844, step = 1
INFO:tensorflow:loss = 2.28173, step = 11 (0.430 sec)
INFO:tensorflow:loss = 2.29419, step = 11 (0.660 sec)
INFO:tensorflow:loss = 2.24214, step = 21 (0.413 sec)
INFO:tensorflow:loss = 2.24506, step = 21 (0.407 sec)
INFO:tensorflow:loss = 2.08036, step = 31 (0.373 sec)
INFO:tensorflow:loss = 2.12947, step = 31 (0.374 sec)
INFO:tensorflow:loss = 1.772, step = 41 (0.345 sec)
INFO:tensorflow:loss = 1.72418, step = 41 (0.345 sec)
INFO:tensorflow:loss = 1.17998, step = 51 (0.366 sec)
INFO:tensorflow:loss = 0.99576, step = 51 (0.365 sec)
INFO:tensorflow:loss = 1.2166, step = 61 (0.377 sec)
INFO:tensorflow:loss = 1.18057, step = 61 (0.379 sec)
INFO:tensorflow:loss = 2.65005, step = 71 (0.350 sec)
INFO:tensorflow:loss = 2.63248, step = 71 (0.352 sec)
INFO:tensorflow:loss = 0.766145, step = 81 (0.345 sec)
INFO:tensorflow:loss = 0.774453, step = 81 (0.345 sec)
INFO:tensorflow:loss = 0.940878, step = 91 (0.368 sec)
INFO:tensorflow:loss = 1.13829, step = 91 (0.371 sec)
INFO:tensorflow:Saving checkpoints for 100 into ./checkpoints/model.ckpt.
[m12:14019] mca_base_component_repository_open: unable to open mca_oob_ud: libmca_common_verbs.so.40: cannot open shared object file: No such file or directory (ignored)
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: m12
--------------------------------------------------------------------------
Extracting MNIST-data-1/train-images-idx3-ubyte.gz
Extracting MNIST-data-0/train-images-idx3-ubyte.gz
Extracting MNIST-data-0/train-labels-idx1-ubyte.gz
Extracting MNIST-data-0/t10k-images-idx3-ubyte.gz
Extracting MNIST-data-0/t10k-labels-idx1-ubyte.gz
Extracting MNIST-data-1/train-labels-idx1-ubyte.gz
Extracting MNIST-data-1/t10k-images-idx3-ubyte.gz
Extracting MNIST-data-1/t10k-labels-idx1-ubyte.gz
INFO:tensorflow:Create CheckpointSaverHook.
2017-11-06 12:09:21.228394: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 12:09:21.228439: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 12:09:21.228445: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 12:09:21.228451: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 12:09:21.228456: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 12:09:22.064501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla M40 24GB
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:13:00.0
Total memory: 22.40GiB
Free memory: 1.94GiB
2017-11-06 12:09:22.064541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-11-06 12:09:22.064547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-11-06 12:09:22.064558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:13:00.0)
INFO:tensorflow:Restoring parameters from ./checkpoints/model.ckpt-100
2017-11-06 12:08:20.496613: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 12:08:20.496651: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 12:08:20.496658: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 12:08:20.496663: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 12:08:20.496668: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-06 12:08:21.147244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla M40 24GB
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:13:00.0
Total memory: 22.40GiB
Free memory: 14.11GiB
2017-11-06 12:08:21.147284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-11-06 12:08:21.147291: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-11-06 12:08:21.147300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:13:00.0)
[m13:44317] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
[m13:44317] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
m13:44370:44598 [0] INFO NET : Using interface eth2:172.23.233.75<0>
m13:44370:44598 [0] INFO NET/IB : Using interface eth2 for sideband communication
m13:44370:44598 [0] INFO Using internal Network Socket
m13:44370:44598 [0] INFO NET : Using interface eth2:172.23.233.75<0>
m13:44370:44598 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.0.5 compiled with CUDA 8.0
m12:14023:14113 [0] INFO NET : Using interface eth2:172.23.233.77<0>
m12:14023:14113 [0] INFO NET/IB : Using interface eth2 for sideband communication
m12:14023:14113 [0] INFO Using internal Network Socket
m12:14023:14113 [0] INFO NET : Using interface eth2:172.23.233.77<0>
m12:14023:14113 [0] INFO NET/Socket : 1 interfaces found
m13:44370:44598 [0] INFO Using 256 threads
m13:44370:44598 [0] INFO [0] Ring 0 : 0 1
m12:14023:14113 [0] INFO 0 -> 1 via NET/Socket/0
m13:44370:44598 [0] INFO 1 -> 0 via NET/Socket/0
INFO:tensorflow:Saving checkpoints for 101 into ./checkpoints/model.ckpt.
INFO:tensorflow:loss = 0.554918, step = 101
INFO:tensorflow:loss = 0.560171, step = 101
@tobyyouup
Copy link
Author

tobyyouup commented Nov 6, 2017

I install horovod https://github.com/uber/horovod, with RDMA and GPUDirect following https://github.com/uber/horovod/blob/master/docs/gpus.md#advanced-have-gpus-and-networking-with-rdma-and-gpudirect

When I run the example code tensorflow_mnist.py on two machines with command:
mpirun -np 2 -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES -x NCCL_DEBUG=INFO -mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0 -H 10.150.144.115:1,10.150.144.114:1 python tensorflow_mnist.py

but the logs hangs on and does not show log any more. The process on both machines has GPU memory usage. The logs is in file logs.

However, if I just run on a single machine , it has no problems:
mpirun -np 2 -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES -H 172.23.233.75:2 python tensorflow_mnist.py
or
mpirun -np 2 -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES -H 172.23.233.77:2 python tensorflow_mnist.py
both can work properly.

My system is ubuntu 14.04, tensorflow is 1.3.0, python is 2.7, horovod is following the install guides using NCCL2.0 and openmpi 3.0. So what's the problems?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment