Skip to content

Instantly share code, notes, and snippets.

@jpkenny
Last active November 29, 2018 18:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jpkenny/64b7623a40b2c2881b87371c36bea80e to your computer and use it in GitHub Desktop.
Save jpkenny/64b7623a40b2c2881b87371c36bea80e to your computer and use it in GitHub Desktop.
Trouble verifying btl for tcp and RoCE
Hi,
I’m trying to do some RoCE benchmarking on a cluster with Mellanox HCA’s:
02:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
MLNX_OFED_LINUX-4.4-2.0.7.0
I’m finding it quite challenging to understand what btl is actually being used based on openmpi’s debug output. I’m using openmpi 4.0.0 (along with a handful of older releases). For example, here’s a command line that I use to run a 16 node HPL test, trying to ensure that internode communication goes over a RoCE-capable btl rather than tcp:
/home/jpkenny/install/openmpi-4.0.0-carnac/bin/mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 64 -N 4 -hostfile hosts.txt ./xhpl
Among the interesting debug messages I see are messages of the form:
[en257.eth:118902] openib BTL: rdmacm CPC unavailable for use on mlx5_0:1; skipped
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: en254
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[en262.eth:103810] select: init of component openib returned failure
[en264.eth:171198] select: init of component openib returned failure
[en264.eth:171198] mca: base: close: component openib closed
[en264.eth:171198] mca: base: close: unloading component openib
[en264.eth:171198] select: initializing btl component uct
[en264.eth:171198] select: init of component uct returned failure
[en264.eth:171198] mca: base: close: component uct closed
[en264.eth:171198] mca: base: close: unloading component uct
So, it looks to me like openib and uct transports are both failing, yet when I read out rdma counters with ethtool I see that the bulk of the traffic is going over rdma somehow (eth2 is the MT27800):
ib counters before:
rx_vport_rdma_unicast_packets: 115943830
rx_vport_rdma_unicast_bytes: 195602189248
tx_vport_rdma_unicast_packets: 273170117
tx_vport_rdma_unicast_bytes: 374057100818
eth0 counters before:
RX packets 87474728 bytes 43335706060 (40.3 GiB)
TX packets 61137838 bytes 71187999781 (66.2 GiB)
eth2 counters before:
RX packets 49490077 bytes 81084834515 (75.5 GiB)
TX packets 532970764 bytes 1742134134428 (1.5 TiB)
ib counters after:
rx_vport_rdma_unicast_packets: 117188033
rx_vport_rdma_unicast_bytes: 200088022302
tx_vport_rdma_unicast_packets: 274456328
tx_vport_rdma_unicast_bytes: 378587627052
eth0 counters after:
RX packets 87481208 bytes 43336915153 (40.3 GiB)
TX packets 61143485 bytes 71189606766 (66.3 GiB)
eth2 counters after:
RX packets 49490077 bytes 81084834515 (75.5 GiB)
TX packets 532970764 bytes 1742134134428 (1.5 TiB)
Yet, looking at the debug output after xhpl runs, I only see vader and self getting unloaded. The evidence suggests that there is no working intranode btl, yet the job runs properly and it looks like rdma transfers are occurring. Equally perplexing behavior is observed when I exclude openib/uct and expect to run over tcp. What’s actually going on here?
I’ll attach output from ompi_info along with the debug output that I’m referring to. I tried to include a compressed config.log, but the message was too big.
Thanks,
Joe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment