jlevon/cpu-pinning.md

## cpu-pinning.md

      
    Raw
  

              cpu-pinning.md
            
          
    Basic set up

[root@lava ~]# prtdiag | head
System Configuration: Supermicro SSG-2028R-ACR24L

[root@volcano ~]# prtdiag | head
System Configuraton: Dell Inc. Joyent-Compute-Platform-3302

bhyve instance with 8VCPUs, 4Gb RAM, is an iperf client on lava to an
iperf on volcano GZ, over a 10Gbit ixgbe link.
[root@volcano ~]# /zones/jlevon/iperf -v
iperf version 2.0.5 (08 Jul 2010) pthreads
[root@volcano ~]# /zones/jlevon/iperf -s -B 192.168.9.2
------------------------------------------------------------
Server listening on TCP port 5001
Binding to local address 192.168.9.2
TCP window size: 1.00 MByte (default)
------------------------------------------------------------

The aim of the test is to see if pinning the bhyve VCPUs
(https://cr.joyent.us/#/c/3703/) improves performance across the 10Gbit
link. Previous testing showed some (but not a huge amount) advantage for
iperf between two co-located bhyve instances.
To double-check the NIC/host setup can hit line rate, I first ran iperf
in the two GZ: with -P4 we get ~9.4Gbit/s.
bhyve VCPU pinning versus current

Running https://github.com/jlevon/grot/blob/master/compare-pinning.sh
See pinning.png.
Note that we periodically reboot the VM, which will cause the bindings
to potentially change.
There's basic bimodel results in both cases where, if we end up on the
second socket, we can't do much better than 4Gbits/s max. Some dtracing
shows that ixgbe tx intrs are happening on CPU12 (socket 1), viona tx is
sat on socket 2.
With VCPU pinning, this means (since we prefer a single socket for all
pinned threads), once we're there, we're going to get that kind of
result until a reboot, whereas unpinned we basically jump between the
two modes.
Looking at mean average of a bunch of those runs:
[moz@pent joyent]$ ~/src/grot/avg <./unpinned.vals 
4.53028
[moz@pent joyent]$ ~/src/grot/avg <./pinned.vals 
5.34364

We gain fairly significantly from pinning here. This seems likely to be that
in the pinning case, we won't ever be straddling the sockets, but we'll be all
on the good socket, or all on the bad socket, but it's not clear.
If we strip out what we presume to be "bad socket":
[moz@pent joyent]$ ~/src/grot/avg 4.0  <./unpinned.vals 
6.13625
[moz@pent joyent]$ ~/src/grot/avg 4.0  <./pinned.vals 
6.62965

But it's not clear that this is actually reasonable as seen below (i.e. we can
dip below 4.0 even when on the same socket).
Second socket disabled

We'll disable the second
socket.
This is socket1-pinning.png.
jlevon@kent:~$ ~/src/grot/avg <s1-unpinned.vals 
6.18593
jlevon@kent:~$ ~/src/grot/avg <s1-pinned.vals 
5.866

It appears to actually be worse to pin in this case, we see big
variability within single runs (i.e. without reboots), so it's not
directly a matter of our pinning choices themselves. Note that the viona
TX thread will still be wandering around the socket. One possibility is
that sometimes the TX thread is scheduled onto a CPU (or HT sibling) of
a pinned VCPU thread: this can significantly knock performance if it
happens. If this is less likely when the scheduler can dispatch the VCPU
elsewhere, this might explain this result.
Example variability:
[ ID] Interval       Transfer     Bandwidth
[SUM]  0.0- 1.0 sec   708 MBytes  5.94 Gbits/sec
[SUM]  1.0- 2.0 sec   528 MBytes  4.43 Gbits/sec 
[SUM]  2.0- 3.0 sec   563 MBytes  4.72 Gbits/sec
[SUM]  3.0- 4.0 sec   698 MBytes  5.85 Gbits/sec
[SUM]  4.0- 5.0 sec   638 MBytes  5.36 Gbits/sec
[SUM]  5.0- 6.0 sec   597 MBytes  5.01 Gbits/sec
[SUM]  6.0- 7.0 sec   509 MBytes  4.27 Gbits/sec
[SUM]  7.0- 8.0 sec   376 MBytes  3.16 Gbits/sec
[SUM]  8.0- 9.0 sec   624 MBytes  5.24 Gbits/sec
[SUM]  9.0-10.0 sec   551 MBytes  4.62 Gbits/sec
[SUM]  0.0-10.0 sec  5.66 GBytes  4.86 Gbits/sec

More details on one run when we bind VCPUs. Results in this run are
highly variable (4-7Gbits). Bindings are:
The physical processor has 14 cores and 28 virtual processors (0-13 28-41)
The core has 2 virtual processors (0 28)
The core has 2 virtual processors (1 29)
The core has 2 virtual processors (2 30)
2: b3
The core has 2 virtual processors (3 31)
31: b3
The core has 2 virtual processors (4 32)
The core has 2 virtual processors (5 33)
5: b3
The core has 2 virtual processors (6 34)
6: b3
The core has 2 virtual processors (7 35)
35: b3
The core has 2 virtual processors (8 36)
36: b3
The core has 2 virtual processors (9 37)
The core has 2 virtual processors (10 38)
10: b3
The core has 2 virtual processors (11 39)
The core has 2 virtual processors (12 40)
40: b3
The core has 2 virtual processors (13 41)

Running
net-cpus we
see:
tx
        9             1722
        3             3613
ixgbe tx
        0               22
       37           141254
ixgbe rx
       37          1130029
       10          1186968
       31          1300790
rx
        9             1722
        3             3613

Note that CPU31 is hosting a pinned VCPU and it is also taking ixgbe rx
interrupts. Same goes for CPU10. Perhaps the variability is when the
guest decides to run iperf threads on those VCPUs and they contend with
the interrupts?
If we rebind these VCPUs:
# ./bindings | egrep -e 'bind:10|bind:31'
lwpid:20 state:sleep bind:31 lastcpu:31
lwpid:23 state:sleep bind:10 lastcpu:10
# pbind -b 0 $(pgrep bhyve)/20
lwp id 98859/20: was 31, now 0
# pbind -b 1 $(pgrep bhyve)/23
lwp id 98859/23: was 10, now 1

this doesn't seem to make much difference to the results, i.e. it's not
the apparent cause of the variability. Since the ixgbe intr CPUs are now
clear of VCPUs, this seems to imply that fencing them won't help either?
Pinning the viona tx thread

Let's try pinning the tx thread: https://github.com/jlevon/grot/blob/master/compare-pinning-tx.sh
Note that we're careful to avoid sharing a CPU core between a VCPU and a TX
thread (including HT); this would have a significant impact on
performance.
socket1-tx-pinning.png is the results from pinning TX thread, versus
pinning both TX and VCPUs.
[moz@pent joyent]$ ~/src/grot/avg <s1-tx-only.vals 
6.3992
[moz@pent joyent]$ ~/src/grot/avg <s1-txvcpu.vals 
6.4804

If we also bind the TX thread in the case from above (pbind -b 3 $(pgrep bhyve)/28), we finally seem to get stable numbers of around 6.2
Gbits/s. An unpinned TX thread:
# dtrace -n 'sched:::on-cpu /pid == 98859 && tid == 28/ { @[cpu] = count(); }'
...
       32              385
       10              550
       41              866
       31              918
        0             1086
       37            13348

is spending a whole bunch of its time on CPUs used by ixgbe. If we bind
to 31, we start getting 6Gbit/s consistently. Binding to 37 hits
7Gbit/s. (Unfortunately dtracing at this point perturbs the results
completely.) Binding to a busy VCPU CPU such as 1 seems to affect the
guest so badly it can't report the iperf SUMs, but not consistently.
Binding to CPU12 drops perf to 4.4 Gbits/s. This  is the HT sibling of a
busy VCPU thread:
...
The core has 2 virtual processors (12 40)
12: b3 40: b3
...
# ./lprstat -c -Lm -p $(pgrep bhyve) 1  | grep cpu40
 98859 cpu40    0.0  67 0.0 0.0 0.0 0.0  31 2.0 11K   1  12   0 bhyve/24
 98859 cpu40    0.0  82 0.0 0.0 0.0 0.0  16 2.1 11K   3   4   0 bhyve/24
 98859 cpu40    0.0  84 0.0 0.0 0.0 0.0  14 2.0 11K   2   3   0 bhyve/24

Although it's somewhat difficult to catch, it looks like most of the
variability we see above is when we co-schedule a VCPU thread and the TX
thread onto the same CPU core.
stddev:
> my.d = read.csv('s1-unpinned.vals')
> sd(my.d[,1])
[1] 1.034487
> my.d  = read.csv('s1-tx-only.vals')
> sd(my.d[,1])
[1] 0.5827125
> my.d = read.csv('s1-txvcpu.vals')
> sd(my.d[,1])
[1] 0.5557916

Conclusions

Pinning of any kind doesn't get us to line rate; there are other factors
preventing us exceeding around 7.5Gbits/s peak. It's also possible that
these other factors (such as lack of LSO) will significantly change the
effect that VCPU pinning can have.
Pinning runs the risk of ending up on the wrong socket permanently
(rather than for a period of time). The huge difference in performance
cross-socket seems to imply we might need different configurations for
bhyve at least?
Pinning decisions can end up with very bad results if they end up
sharing the wrong CPUs, this is presumably especially true when we have
multiple VMs on a system rather than just the one under test here.
It looks like VCPU pinning on its own will not help overall; pinning
viona TX threads is the only way to reduce low perf + variability here.
The scheduler seems to make some non-ideal choices for where to run
threads in this case, in that it will apparently pick an idle HT sibling of
a CPU hosting a very busy pinned VCPU. This might benefit from some
tweaking.

  
## pinning.png

      
    Raw
  

              pinning.png
            
          
## socket1-pinning.png

      
    Raw
  

              socket1-pinning.png
            
          
## socket1-tx-pinning.png

      
    Raw
  

              socket1-tx-pinning.png