[root@lava ~]# prtdiag | head
System Configuration: Supermicro SSG-2028R-ACR24L
[root@volcano ~]# prtdiag | head
System Configuraton: Dell Inc. Joyent-Compute-Platform-3302
bhyve instance with 8VCPUs, 4Gb RAM, is an iperf client on lava to an iperf on volcano GZ, over a 10Gbit ixgbe link.
[root@volcano ~]# /zones/jlevon/iperf -v
iperf version 2.0.5 (08 Jul 2010) pthreads
[root@volcano ~]# /zones/jlevon/iperf -s -B 192.168.9.2
------------------------------------------------------------
Server listening on TCP port 5001
Binding to local address 192.168.9.2
TCP window size: 1.00 MByte (default)
------------------------------------------------------------
The aim of the test is to see if pinning the bhyve VCPUs (https://cr.joyent.us/#/c/3703/) improves performance across the 10Gbit link. Previous testing showed some (but not a huge amount) advantage for iperf between two co-located bhyve instances.
To double-check the NIC/host setup can hit line rate, I first ran iperf in the two GZ: with -P4 we get ~9.4Gbit/s.
Running https://github.com/jlevon/grot/blob/master/compare-pinning.sh
See pinning.png.
Note that we periodically reboot the VM, which will cause the bindings to potentially change.
There's basic bimodel results in both cases where, if we end up on the second socket, we can't do much better than 4Gbits/s max. Some dtracing shows that ixgbe tx intrs are happening on CPU12 (socket 1), viona tx is sat on socket 2.
With VCPU pinning, this means (since we prefer a single socket for all pinned threads), once we're there, we're going to get that kind of result until a reboot, whereas unpinned we basically jump between the two modes.
Looking at mean average of a bunch of those runs:
[moz@pent joyent]$ ~/src/grot/avg <./unpinned.vals
4.53028
[moz@pent joyent]$ ~/src/grot/avg <./pinned.vals
5.34364
We gain fairly significantly from pinning here. This seems likely to be that in the pinning case, we won't ever be straddling the sockets, but we'll be all on the good socket, or all on the bad socket, but it's not clear.
If we strip out what we presume to be "bad socket":
[moz@pent joyent]$ ~/src/grot/avg 4.0 <./unpinned.vals
6.13625
[moz@pent joyent]$ ~/src/grot/avg 4.0 <./pinned.vals
6.62965
But it's not clear that this is actually reasonable as seen below (i.e. we can dip below 4.0 even when on the same socket).
We'll disable the second socket.
This is socket1-pinning.png.
jlevon@kent:~$ ~/src/grot/avg <s1-unpinned.vals
6.18593
jlevon@kent:~$ ~/src/grot/avg <s1-pinned.vals
5.866
It appears to actually be worse to pin in this case, we see big variability within single runs (i.e. without reboots), so it's not directly a matter of our pinning choices themselves. Note that the viona TX thread will still be wandering around the socket. One possibility is that sometimes the TX thread is scheduled onto a CPU (or HT sibling) of a pinned VCPU thread: this can significantly knock performance if it happens. If this is less likely when the scheduler can dispatch the VCPU elsewhere, this might explain this result.
Example variability:
[ ID] Interval Transfer Bandwidth
[SUM] 0.0- 1.0 sec 708 MBytes 5.94 Gbits/sec
[SUM] 1.0- 2.0 sec 528 MBytes 4.43 Gbits/sec
[SUM] 2.0- 3.0 sec 563 MBytes 4.72 Gbits/sec
[SUM] 3.0- 4.0 sec 698 MBytes 5.85 Gbits/sec
[SUM] 4.0- 5.0 sec 638 MBytes 5.36 Gbits/sec
[SUM] 5.0- 6.0 sec 597 MBytes 5.01 Gbits/sec
[SUM] 6.0- 7.0 sec 509 MBytes 4.27 Gbits/sec
[SUM] 7.0- 8.0 sec 376 MBytes 3.16 Gbits/sec
[SUM] 8.0- 9.0 sec 624 MBytes 5.24 Gbits/sec
[SUM] 9.0-10.0 sec 551 MBytes 4.62 Gbits/sec
[SUM] 0.0-10.0 sec 5.66 GBytes 4.86 Gbits/sec
More details on one run when we bind VCPUs. Results in this run are highly variable (4-7Gbits). Bindings are:
The physical processor has 14 cores and 28 virtual processors (0-13 28-41)
The core has 2 virtual processors (0 28)
The core has 2 virtual processors (1 29)
The core has 2 virtual processors (2 30)
2: b3
The core has 2 virtual processors (3 31)
31: b3
The core has 2 virtual processors (4 32)
The core has 2 virtual processors (5 33)
5: b3
The core has 2 virtual processors (6 34)
6: b3
The core has 2 virtual processors (7 35)
35: b3
The core has 2 virtual processors (8 36)
36: b3
The core has 2 virtual processors (9 37)
The core has 2 virtual processors (10 38)
10: b3
The core has 2 virtual processors (11 39)
The core has 2 virtual processors (12 40)
40: b3
The core has 2 virtual processors (13 41)
Running net-cpus we see:
tx
9 1722
3 3613
ixgbe tx
0 22
37 141254
ixgbe rx
37 1130029
10 1186968
31 1300790
rx
9 1722
3 3613
Note that CPU31 is hosting a pinned VCPU and it is also taking ixgbe rx interrupts. Same goes for CPU10. Perhaps the variability is when the guest decides to run iperf threads on those VCPUs and they contend with the interrupts?
If we rebind these VCPUs:
# ./bindings | egrep -e 'bind:10|bind:31'
lwpid:20 state:sleep bind:31 lastcpu:31
lwpid:23 state:sleep bind:10 lastcpu:10
# pbind -b 0 $(pgrep bhyve)/20
lwp id 98859/20: was 31, now 0
# pbind -b 1 $(pgrep bhyve)/23
lwp id 98859/23: was 10, now 1
this doesn't seem to make much difference to the results, i.e. it's not the apparent cause of the variability. Since the ixgbe intr CPUs are now clear of VCPUs, this seems to imply that fencing them won't help either?
Let's try pinning the tx thread: https://github.com/jlevon/grot/blob/master/compare-pinning-tx.sh
Note that we're careful to avoid sharing a CPU core between a VCPU and a TX thread (including HT); this would have a significant impact on performance.
socket1-tx-pinning.png is the results from pinning TX thread, versus pinning both TX and VCPUs.
[moz@pent joyent]$ ~/src/grot/avg <s1-tx-only.vals
6.3992
[moz@pent joyent]$ ~/src/grot/avg <s1-txvcpu.vals
6.4804
If we also bind the TX thread in the case from above (pbind -b 3 $(pgrep bhyve)/28
), we finally seem to get stable numbers of around 6.2
Gbits/s. An unpinned TX thread:
# dtrace -n 'sched:::on-cpu /pid == 98859 && tid == 28/ { @[cpu] = count(); }'
...
32 385
10 550
41 866
31 918
0 1086
37 13348
is spending a whole bunch of its time on CPUs used by ixgbe. If we bind to 31, we start getting 6Gbit/s consistently. Binding to 37 hits 7Gbit/s. (Unfortunately dtracing at this point perturbs the results completely.) Binding to a busy VCPU CPU such as 1 seems to affect the guest so badly it can't report the iperf SUMs, but not consistently.
Binding to CPU12 drops perf to 4.4 Gbits/s. This is the HT sibling of a busy VCPU thread:
...
The core has 2 virtual processors (12 40)
12: b3 40: b3
...
# ./lprstat -c -Lm -p $(pgrep bhyve) 1 | grep cpu40
98859 cpu40 0.0 67 0.0 0.0 0.0 0.0 31 2.0 11K 1 12 0 bhyve/24
98859 cpu40 0.0 82 0.0 0.0 0.0 0.0 16 2.1 11K 3 4 0 bhyve/24
98859 cpu40 0.0 84 0.0 0.0 0.0 0.0 14 2.0 11K 2 3 0 bhyve/24
Although it's somewhat difficult to catch, it looks like most of the variability we see above is when we co-schedule a VCPU thread and the TX thread onto the same CPU core.
stddev:
> my.d = read.csv('s1-unpinned.vals')
> sd(my.d[,1])
[1] 1.034487
> my.d = read.csv('s1-tx-only.vals')
> sd(my.d[,1])
[1] 0.5827125
> my.d = read.csv('s1-txvcpu.vals')
> sd(my.d[,1])
[1] 0.5557916
Pinning of any kind doesn't get us to line rate; there are other factors preventing us exceeding around 7.5Gbits/s peak. It's also possible that these other factors (such as lack of LSO) will significantly change the effect that VCPU pinning can have.
Pinning runs the risk of ending up on the wrong socket permanently (rather than for a period of time). The huge difference in performance cross-socket seems to imply we might need different configurations for bhyve at least?
Pinning decisions can end up with very bad results if they end up sharing the wrong CPUs, this is presumably especially true when we have multiple VMs on a system rather than just the one under test here.
It looks like VCPU pinning on its own will not help overall; pinning viona TX threads is the only way to reduce low perf + variability here.
The scheduler seems to make some non-ideal choices for where to run threads in this case, in that it will apparently pick an idle HT sibling of a CPU hosting a very busy pinned VCPU. This might benefit from some tweaking.