Skip to content

Instantly share code, notes, and snippets.

@jlevon
Last active May 2, 2018 14:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jlevon/6c3d13a289b6f0b34009e6acd8463287 to your computer and use it in GitHub Desktop.
Save jlevon/6c3d13a289b6f0b34009e6acd8463287 to your computer and use it in GitHub Desktop.
CPU pinning results

Basic set up

[root@lava ~]# prtdiag | head
System Configuration: Supermicro SSG-2028R-ACR24L

[root@volcano ~]# prtdiag | head
System Configuraton: Dell Inc. Joyent-Compute-Platform-3302

bhyve instance with 8VCPUs, 4Gb RAM, is an iperf client on lava to an iperf on volcano GZ, over a 10Gbit ixgbe link.

[root@volcano ~]# /zones/jlevon/iperf -v
iperf version 2.0.5 (08 Jul 2010) pthreads
[root@volcano ~]# /zones/jlevon/iperf -s -B 192.168.9.2
------------------------------------------------------------
Server listening on TCP port 5001
Binding to local address 192.168.9.2
TCP window size: 1.00 MByte (default)
------------------------------------------------------------

The aim of the test is to see if pinning the bhyve VCPUs (https://cr.joyent.us/#/c/3703/) improves performance across the 10Gbit link. Previous testing showed some (but not a huge amount) advantage for iperf between two co-located bhyve instances.

To double-check the NIC/host setup can hit line rate, I first ran iperf in the two GZ: with -P4 we get ~9.4Gbit/s.

bhyve VCPU pinning versus current

Running https://github.com/jlevon/grot/blob/master/compare-pinning.sh

See pinning.png.

Note that we periodically reboot the VM, which will cause the bindings to potentially change.

There's basic bimodel results in both cases where, if we end up on the second socket, we can't do much better than 4Gbits/s max. Some dtracing shows that ixgbe tx intrs are happening on CPU12 (socket 1), viona tx is sat on socket 2.

With VCPU pinning, this means (since we prefer a single socket for all pinned threads), once we're there, we're going to get that kind of result until a reboot, whereas unpinned we basically jump between the two modes.

Looking at mean average of a bunch of those runs:

[moz@pent joyent]$ ~/src/grot/avg <./unpinned.vals 
4.53028
[moz@pent joyent]$ ~/src/grot/avg <./pinned.vals 
5.34364

We gain fairly significantly from pinning here. This seems likely to be that in the pinning case, we won't ever be straddling the sockets, but we'll be all on the good socket, or all on the bad socket, but it's not clear.

If we strip out what we presume to be "bad socket":

[moz@pent joyent]$ ~/src/grot/avg 4.0  <./unpinned.vals 
6.13625
[moz@pent joyent]$ ~/src/grot/avg 4.0  <./pinned.vals 
6.62965

But it's not clear that this is actually reasonable as seen below (i.e. we can dip below 4.0 even when on the same socket).

Second socket disabled

We'll disable the second socket.

This is socket1-pinning.png.

jlevon@kent:~$ ~/src/grot/avg <s1-unpinned.vals 
6.18593
jlevon@kent:~$ ~/src/grot/avg <s1-pinned.vals 
5.866

It appears to actually be worse to pin in this case, we see big variability within single runs (i.e. without reboots), so it's not directly a matter of our pinning choices themselves. Note that the viona TX thread will still be wandering around the socket. One possibility is that sometimes the TX thread is scheduled onto a CPU (or HT sibling) of a pinned VCPU thread: this can significantly knock performance if it happens. If this is less likely when the scheduler can dispatch the VCPU elsewhere, this might explain this result.

Example variability:

[ ID] Interval       Transfer     Bandwidth
[SUM]  0.0- 1.0 sec   708 MBytes  5.94 Gbits/sec
[SUM]  1.0- 2.0 sec   528 MBytes  4.43 Gbits/sec 
[SUM]  2.0- 3.0 sec   563 MBytes  4.72 Gbits/sec
[SUM]  3.0- 4.0 sec   698 MBytes  5.85 Gbits/sec
[SUM]  4.0- 5.0 sec   638 MBytes  5.36 Gbits/sec
[SUM]  5.0- 6.0 sec   597 MBytes  5.01 Gbits/sec
[SUM]  6.0- 7.0 sec   509 MBytes  4.27 Gbits/sec
[SUM]  7.0- 8.0 sec   376 MBytes  3.16 Gbits/sec
[SUM]  8.0- 9.0 sec   624 MBytes  5.24 Gbits/sec
[SUM]  9.0-10.0 sec   551 MBytes  4.62 Gbits/sec
[SUM]  0.0-10.0 sec  5.66 GBytes  4.86 Gbits/sec

More details on one run when we bind VCPUs. Results in this run are highly variable (4-7Gbits). Bindings are:

The physical processor has 14 cores and 28 virtual processors (0-13 28-41)
The core has 2 virtual processors (0 28)
The core has 2 virtual processors (1 29)
The core has 2 virtual processors (2 30)
2: b3
The core has 2 virtual processors (3 31)
31: b3
The core has 2 virtual processors (4 32)
The core has 2 virtual processors (5 33)
5: b3
The core has 2 virtual processors (6 34)
6: b3
The core has 2 virtual processors (7 35)
35: b3
The core has 2 virtual processors (8 36)
36: b3
The core has 2 virtual processors (9 37)
The core has 2 virtual processors (10 38)
10: b3
The core has 2 virtual processors (11 39)
The core has 2 virtual processors (12 40)
40: b3
The core has 2 virtual processors (13 41)

Running net-cpus we see:

tx
        9             1722
        3             3613
ixgbe tx
        0               22
       37           141254
ixgbe rx
       37          1130029
       10          1186968
       31          1300790
rx
        9             1722
        3             3613

Note that CPU31 is hosting a pinned VCPU and it is also taking ixgbe rx interrupts. Same goes for CPU10. Perhaps the variability is when the guest decides to run iperf threads on those VCPUs and they contend with the interrupts?

If we rebind these VCPUs:

# ./bindings | egrep -e 'bind:10|bind:31'
lwpid:20 state:sleep bind:31 lastcpu:31
lwpid:23 state:sleep bind:10 lastcpu:10
# pbind -b 0 $(pgrep bhyve)/20
lwp id 98859/20: was 31, now 0
# pbind -b 1 $(pgrep bhyve)/23
lwp id 98859/23: was 10, now 1

this doesn't seem to make much difference to the results, i.e. it's not the apparent cause of the variability. Since the ixgbe intr CPUs are now clear of VCPUs, this seems to imply that fencing them won't help either?

Pinning the viona tx thread

Let's try pinning the tx thread: https://github.com/jlevon/grot/blob/master/compare-pinning-tx.sh

Note that we're careful to avoid sharing a CPU core between a VCPU and a TX thread (including HT); this would have a significant impact on performance.

socket1-tx-pinning.png is the results from pinning TX thread, versus pinning both TX and VCPUs.

[moz@pent joyent]$ ~/src/grot/avg <s1-tx-only.vals 
6.3992
[moz@pent joyent]$ ~/src/grot/avg <s1-txvcpu.vals 
6.4804

If we also bind the TX thread in the case from above (pbind -b 3 $(pgrep bhyve)/28), we finally seem to get stable numbers of around 6.2 Gbits/s. An unpinned TX thread:

# dtrace -n 'sched:::on-cpu /pid == 98859 && tid == 28/ { @[cpu] = count(); }'
...
       32              385
       10              550
       41              866
       31              918
        0             1086
       37            13348

is spending a whole bunch of its time on CPUs used by ixgbe. If we bind to 31, we start getting 6Gbit/s consistently. Binding to 37 hits 7Gbit/s. (Unfortunately dtracing at this point perturbs the results completely.) Binding to a busy VCPU CPU such as 1 seems to affect the guest so badly it can't report the iperf SUMs, but not consistently.

Binding to CPU12 drops perf to 4.4 Gbits/s. This is the HT sibling of a busy VCPU thread:

...
The core has 2 virtual processors (12 40)
12: b3 40: b3
...
# ./lprstat -c -Lm -p $(pgrep bhyve) 1  | grep cpu40
 98859 cpu40    0.0  67 0.0 0.0 0.0 0.0  31 2.0 11K   1  12   0 bhyve/24
 98859 cpu40    0.0  82 0.0 0.0 0.0 0.0  16 2.1 11K   3   4   0 bhyve/24
 98859 cpu40    0.0  84 0.0 0.0 0.0 0.0  14 2.0 11K   2   3   0 bhyve/24

Although it's somewhat difficult to catch, it looks like most of the variability we see above is when we co-schedule a VCPU thread and the TX thread onto the same CPU core.

stddev:

> my.d = read.csv('s1-unpinned.vals')
> sd(my.d[,1])
[1] 1.034487
> my.d  = read.csv('s1-tx-only.vals')
> sd(my.d[,1])
[1] 0.5827125
> my.d = read.csv('s1-txvcpu.vals')
> sd(my.d[,1])
[1] 0.5557916

Conclusions

Pinning of any kind doesn't get us to line rate; there are other factors preventing us exceeding around 7.5Gbits/s peak. It's also possible that these other factors (such as lack of LSO) will significantly change the effect that VCPU pinning can have.

Pinning runs the risk of ending up on the wrong socket permanently (rather than for a period of time). The huge difference in performance cross-socket seems to imply we might need different configurations for bhyve at least?

Pinning decisions can end up with very bad results if they end up sharing the wrong CPUs, this is presumably especially true when we have multiple VMs on a system rather than just the one under test here.

It looks like VCPU pinning on its own will not help overall; pinning viona TX threads is the only way to reduce low perf + variability here.

The scheduler seems to make some non-ideal choices for where to run threads in this case, in that it will apparently pick an idle HT sibling of a CPU hosting a very busy pinned VCPU. This might benefit from some tweaking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment