jlevon/bhyve pinning

## bhyve pinning
tl;dr: turn off the second socket if you want consistent results
for comparisons.

My system is two socket, with 2 8-vcpu bhyve guests. This script:

https://github.com/jlevon/grot/blob/master/bhyve-pin

will bind a VM to specific CPUs in a socket (nothing else should be
running other than the VMs under test). It's not smart enough to pick
idle CPUs, it's pretty hard coded (and make sure your system has CPUs in
the order 0 2 4 6 8 for socket 1 in psrinfo -vp).

https://github.com/jlevon/grot/blob/master/bindings

reports on thread bindings for you.

With current bits, I saw a lot less variability than I had previously,
even re-starting tests. Previously iperf's second-by-second throughput
would vary wildly from 1-6 Gbit/s. Now we're much more consistent, for example:

```
root@be033899-0295-4caa-ec7f-8b63edc7f39d:~# iperf -t 30 -i 1 -c b5 -P8 | grep SUM
[SUM]  0.0- 1.0 sec  2.23 GBytes  19.2 Gbits/sec
[SUM]  1.0- 2.0 sec  2.35 GBytes  20.2 Gbits/sec
[SUM]  2.0- 3.0 sec  2.36 GBytes  20.2 Gbits/sec
[SUM]  3.0- 4.0 sec  2.37 GBytes  20.4 Gbits/sec
[SUM]  4.0- 5.0 sec  2.30 GBytes  19.7 Gbits/sec
[SUM]  5.0- 6.0 sec  2.37 GBytes  20.4 Gbits/sec
[SUM]  6.0- 7.0 sec  2.30 GBytes  19.8 Gbits/sec
[SUM]  7.0- 8.0 sec  2.32 GBytes  19.9 Gbits/sec
[SUM]  8.0- 9.0 sec  2.38 GBytes  20.4 Gbits/sec
[SUM]  9.0-10.0 sec  2.36 GBytes  20.3 Gbits/sec
[SUM] 10.0-11.0 sec  2.32 GBytes  20.0 Gbits/sec
[SUM] 11.0-12.0 sec  2.24 GBytes  19.2 Gbits/sec
[SUM] 12.0-13.0 sec  2.38 GBytes  20.4 Gbits/sec
[SUM] 13.0-14.0 sec  2.35 GBytes  20.2 Gbits/sec
[SUM] 14.0-15.0 sec  2.34 GBytes  20.1 Gbits/sec
[SUM] 15.0-16.0 sec  2.36 GBytes  20.3 Gbits/sec
[SUM] 16.0-17.0 sec  2.32 GBytes  19.9 Gbits/sec
[SUM] 17.0-18.0 sec  2.37 GBytes  20.4 Gbits/sec
[SUM] 18.0-19.0 sec  2.30 GBytes  19.7 Gbits/sec
[SUM] 19.0-20.0 sec  2.33 GBytes  20.1 Gbits/sec
[SUM] 20.0-21.0 sec  2.36 GBytes  20.2 Gbits/sec
[SUM] 21.0-22.0 sec  2.32 GBytes  19.9 Gbits/sec
[SUM] 22.0-23.0 sec  2.41 GBytes  20.7 Gbits/sec
[SUM] 23.0-24.0 sec  2.40 GBytes  20.6 Gbits/sec
[SUM] 24.0-25.0 sec  2.37 GBytes  20.3 Gbits/sec
[SUM] 25.0-26.0 sec  2.36 GBytes  20.3 Gbits/sec
[SUM] 26.0-27.0 sec  2.42 GBytes  20.8 Gbits/sec
[SUM] 27.0-28.0 sec  2.36 GBytes  20.3 Gbits/sec
[SUM] 28.0-29.0 sec  2.33 GBytes  20.0 Gbits/sec
[SUM] 29.0-30.0 sec  2.28 GBytes  19.6 Gbits/sec
[SUM]  0.0-30.0 sec  70.3 GBytes  20.1 Gbits/sec
```

I was only able to test bhyve-bhyve single system.

Server b5 is running iperf -s -P8
Client b4 is running iperf -t 30 -i 1 -c b5 -P8

Unpinned, I got something like:

        [SUM]  0.0-30.0 sec  53.4 GBytes  15.3 Gbits/sec

Variation seems much reduced from my old tests (long ago, long before LSO)

Let's pin them to separate sockets:

# ./bhyve-pin  $b5 1 ; ./bhyve-pin $b4 1
        [SUM]  0.0-30.0 sec  54.0 GBytes  15.5 Gbits/sec

so much the same. Let's take the socket out of the equation, and unbind:

# /zones/jlevon/socketoff
# pbind -u $(pgrep bhyve)

        [SUM]  0.0-30.0 sec  71.7 GBytes  20.5 Gbits/sec

I also experimented with a few other variants such as only binding the viona
threads, but basically the take away here is that for meaningful comparisons,
we need to be sure that each time the instances are running on the same socket.
I can't test but this probably matters even for off-box traffic.

The simplest way to do this is to offline all the CPUS in the other socket.
	tl;dr: turn off the second socket if you want consistent results
	for comparisons.

	My system is two socket, with 2 8-vcpu bhyve guests. This script:

	https://github.com/jlevon/grot/blob/master/bhyve-pin

	will bind a VM to specific CPUs in a socket (nothing else should be
	running other than the VMs under test). It's not smart enough to pick
	idle CPUs, it's pretty hard coded (and make sure your system has CPUs in
	the order 0 2 4 6 8 for socket 1 in psrinfo -vp).

	https://github.com/jlevon/grot/blob/master/bindings

	reports on thread bindings for you.

	With current bits, I saw a lot less variability than I had previously,
	even re-starting tests. Previously iperf's second-by-second throughput
	would vary wildly from 1-6 Gbit/s. Now we're much more consistent, for example:

	```
	root@be033899-0295-4caa-ec7f-8b63edc7f39d:~# iperf -t 30 -i 1 -c b5 -P8 \| grep SUM
	[SUM] 0.0- 1.0 sec 2.23 GBytes 19.2 Gbits/sec
	[SUM] 1.0- 2.0 sec 2.35 GBytes 20.2 Gbits/sec
	[SUM] 2.0- 3.0 sec 2.36 GBytes 20.2 Gbits/sec
	[SUM] 3.0- 4.0 sec 2.37 GBytes 20.4 Gbits/sec
	[SUM] 4.0- 5.0 sec 2.30 GBytes 19.7 Gbits/sec
	[SUM] 5.0- 6.0 sec 2.37 GBytes 20.4 Gbits/sec
	[SUM] 6.0- 7.0 sec 2.30 GBytes 19.8 Gbits/sec
	[SUM] 7.0- 8.0 sec 2.32 GBytes 19.9 Gbits/sec
	[SUM] 8.0- 9.0 sec 2.38 GBytes 20.4 Gbits/sec
	[SUM] 9.0-10.0 sec 2.36 GBytes 20.3 Gbits/sec
	[SUM] 10.0-11.0 sec 2.32 GBytes 20.0 Gbits/sec
	[SUM] 11.0-12.0 sec 2.24 GBytes 19.2 Gbits/sec
	[SUM] 12.0-13.0 sec 2.38 GBytes 20.4 Gbits/sec
	[SUM] 13.0-14.0 sec 2.35 GBytes 20.2 Gbits/sec
	[SUM] 14.0-15.0 sec 2.34 GBytes 20.1 Gbits/sec
	[SUM] 15.0-16.0 sec 2.36 GBytes 20.3 Gbits/sec
	[SUM] 16.0-17.0 sec 2.32 GBytes 19.9 Gbits/sec
	[SUM] 17.0-18.0 sec 2.37 GBytes 20.4 Gbits/sec
	[SUM] 18.0-19.0 sec 2.30 GBytes 19.7 Gbits/sec
	[SUM] 19.0-20.0 sec 2.33 GBytes 20.1 Gbits/sec
	[SUM] 20.0-21.0 sec 2.36 GBytes 20.2 Gbits/sec
	[SUM] 21.0-22.0 sec 2.32 GBytes 19.9 Gbits/sec
	[SUM] 22.0-23.0 sec 2.41 GBytes 20.7 Gbits/sec
	[SUM] 23.0-24.0 sec 2.40 GBytes 20.6 Gbits/sec
	[SUM] 24.0-25.0 sec 2.37 GBytes 20.3 Gbits/sec
	[SUM] 25.0-26.0 sec 2.36 GBytes 20.3 Gbits/sec
	[SUM] 26.0-27.0 sec 2.42 GBytes 20.8 Gbits/sec
	[SUM] 27.0-28.0 sec 2.36 GBytes 20.3 Gbits/sec
	[SUM] 28.0-29.0 sec 2.33 GBytes 20.0 Gbits/sec
	[SUM] 29.0-30.0 sec 2.28 GBytes 19.6 Gbits/sec
	[SUM] 0.0-30.0 sec 70.3 GBytes 20.1 Gbits/sec
	```

	I was only able to test bhyve-bhyve single system.

	Server b5 is running iperf -s -P8
	Client b4 is running iperf -t 30 -i 1 -c b5 -P8

	Unpinned, I got something like:

	[SUM] 0.0-30.0 sec 53.4 GBytes 15.3 Gbits/sec

	Variation seems much reduced from my old tests (long ago, long before LSO)

	Let's pin them to separate sockets:

	# ./bhyve-pin $b5 1 ; ./bhyve-pin $b4 1
	[SUM] 0.0-30.0 sec 54.0 GBytes 15.5 Gbits/sec

	so much the same. Let's take the socket out of the equation, and unbind:

	# /zones/jlevon/socketoff
	# pbind -u $(pgrep bhyve)

	[SUM] 0.0-30.0 sec 71.7 GBytes 20.5 Gbits/sec

	I also experimented with a few other variants such as only binding the viona
	threads, but basically the take away here is that for meaningful comparisons,
	we need to be sure that each time the instances are running on the same socket.
	I can't test but this probably matters even for off-box traffic.

	The simplest way to do this is to offline all the CPUS in the other socket.