Other benchmarks make me sad. People run them in the cloud where VMs can migrate between different hosts on reboots and use old kernels. Here we use bare metal with the same recent kernel to remove extra variables and measure KPTI impact and KPTI impact only.
- Kernel:
4.14.11
- CPU: 2 x
Xeon E5-2630 v3 @ 2.40GHz
(32 logical cores)
Some prerequisites:
- Every measurement is taken with a freshly restarted Redis without persistence
- There were no other services that could use any significant CPU
- Reboots happened to ensure consistency, but only one boot is included
The following command was used to launch the server:
$ docker run --rm -it --net host redis:4.0.6 redis-server --save "" --appendonly no
$ dmesg -T | fgrep 'page tables isolation'
[Sat Jan 13 04:28:34 2018] Kernel/User page tables isolation: enabled
$ for i in $(seq 1 3); do echo $(date) "try $i"; docker run --rm -it --net host redis:4.0.6 redis-benchmark -q -n 1000000 -t set,get -r 1000000; done
Sat Jan 13 04:49:25 UTC 2018 try 1
SET: 94948.73 requests per second
GET: 95648.02 requests per second
Sat Jan 13 04:49:46 UTC 2018 try 2
SET: 96627.70 requests per second
GET: 95593.15 requests per second
Sat Jan 13 04:50:07 UTC 2018 try 3
SET: 95410.74 requests per second
GET: 94446.54 requests per second
- SET difference between low and high:
(96627.70 - 94948.73) / 94948.73 * 100 = 1.77%
- GET difference between low and high:
(95648.02 - 94446.54) / 94446.54 * 100 = 1.27%
$ sudo perf stat -d -p 3144
^C
Performance counter stats for process id '3144':
58904.467362 task-clock (msec) # 0.804 CPUs utilized
308,662 context-switches # 0.005 M/sec
4 cpu-migrations # 0.000 K/sec
23,828 page-faults # 0.405 K/sec
183,214,645,662 cycles # 3.110 GHz (50.10%)
141,173,051,301 instructions # 0.77 insn per cycle (62.62%)
27,713,334,373 branches # 470.479 M/sec (62.59%)
171,263,576 branch-misses # 0.62% of all branches (62.50%)
42,720,689,753 L1-dcache-loads # 725.254 M/sec (62.42%)
3,632,866,438 L1-dcache-load-misses # 8.50% of all L1-dcache hits (24.95%)
1,014,432,733 LLC-loads # 17.222 M/sec (25.05%)
426,430,439 LLC-load-misses # 42.04% of all LL-cache hits (37.58%)
73.309047198 seconds time elapsed
$ sudo perf stat -d -p 4398
^C
Performance counter stats for process id '4398':
58918.071006 task-clock (msec) # 0.866 CPUs utilized
475,459 context-switches # 0.008 M/sec
5 cpu-migrations # 0.000 K/sec
23,780 page-faults # 0.404 K/sec
182,551,645,439 cycles # 3.098 GHz (50.13%)
142,093,978,992 instructions # 0.78 insn per cycle (62.68%)
27,936,369,361 branches # 474.156 M/sec (62.57%)
173,701,977 branch-misses # 0.62% of all branches (62.44%)
43,069,902,734 L1-dcache-loads # 731.013 M/sec (62.43%)
3,745,293,484 L1-dcache-load-misses # 8.70% of all L1-dcache hits (25.01%)
1,077,645,288 LLC-loads # 18.291 M/sec (25.10%)
432,856,886 LLC-load-misses # 40.17% of all LL-cache hits (37.57%)
68.039181383 seconds time elapsed
$ for i in $(seq 1 3); do echo $(date) "try $i"; docker run --rm -it --net host redis:4.0.6 redis-benchmark -q -P 32 -n 10000000 -t set,get -r 1000000; done
Sat Jan 13 04:52:03 UTC 2018 try 1
SET: 863632.38 requests per second
GET: 1037344.38 requests per second
Sat Jan 13 04:52:24 UTC 2018 try 2
SET: 859254.19 requests per second
GET: 1027643.62 requests per second
Sat Jan 13 04:52:46 UTC 2018 try 3
SET: 862738.38 requests per second
GET: 1016156.94 requests per second
- SET difference between low and high:
(863632.38 - 859254.19) / 859254.19 * 100 = 0.51%
- GET difference between low and high:
(1037344.38 - 1016156.94) / 1016156.94 * 100 = 2.01%
$ dmesg -T | fgrep 'page tables isolation'
[Sat Jan 13 04:56:44 2018] Kernel/User page tables isolation: disabled on command line.
$ for i in $(seq 1 3); do echo $(date) "try $i"; docker run --rm -it --net host redis:4.0.6 redis-benchmark -q -n 1000000 -t set,get -r 1000000; done
Sat Jan 13 05:02:32 UTC 2018 try 1
SET: 109745.38 requests per second
GET: 110277.90 requests per second
Sat Jan 13 05:02:50 UTC 2018 try 2
SET: 113934.14 requests per second
GET: 108003.02 requests per second
Sat Jan 13 05:03:08 UTC 2018 try 3
SET: 112637.98 requests per second
GET: 112498.59 requests per second
- SET difference between low and high:
(113934.14 - 109745.38) / 109745.38 * 100 = 3.82%
- GET difference between low and high:
(112498.59 - 108003.02) / 108003.02 * 100 = 4.16%
$ sudo perf stat -d -p 9498
^C
Performance counter stats for process id '9498':
51859.650102 task-clock (msec) # 0.853 CPUs utilized
171,547 context-switches # 0.003 M/sec
4 cpu-migrations # 0.000 K/sec
23,724 page-faults # 0.457 K/sec
162,137,087,109 cycles # 3.126 GHz (49.95%)
137,847,228,034 instructions # 0.85 insn per cycle (62.46%)
26,968,561,543 branches # 520.030 M/sec (62.46%)
167,141,761 branch-misses # 0.62% of all branches (62.50%)
41,819,831,836 L1-dcache-loads # 806.404 M/sec (62.50%)
3,393,654,584 L1-dcache-load-misses # 8.11% of all L1-dcache hits (25.03%)
953,380,618 LLC-loads # 18.384 M/sec (24.99%)
430,826,721 LLC-load-misses # 45.19% of all LL-cache hits (37.49%)
60.789997981 seconds time elapsed
$ sudo perf stat -d -p 11934
^C
Performance counter stats for process id '11934':
52241.822970 task-clock (msec) # 0.745 CPUs utilized
251,987 context-switches # 0.005 M/sec
4 cpu-migrations # 0.000 K/sec
23,686 page-faults # 0.453 K/sec
162,948,839,545 cycles # 3.119 GHz (49.94%)
138,960,314,319 instructions # 0.85 insn per cycle (62.48%)
27,243,372,526 branches # 521.486 M/sec (62.48%)
167,741,490 branch-misses # 0.62% of all branches (62.49%)
42,188,015,450 L1-dcache-loads # 807.553 M/sec (62.54%)
3,429,701,506 L1-dcache-load-misses # 8.13% of all L1-dcache hits (25.02%)
962,609,210 LLC-loads # 18.426 M/sec (24.99%)
432,863,320 LLC-load-misses # 44.97% of all LL-cache hits (37.45%)
70.121123867 seconds time elapsed
$ for i in $(seq 1 3); do echo $(date) "try $i"; docker run --rm -it --net host redis:4.0.6 redis-benchmark -q -P 32 -n 10000000 -t set,get -r 1000000; done
Sat Jan 13 05:04:10 UTC 2018 try 1
SET: 931532.38 requests per second
GET: 1127777.25 requests per second
Sat Jan 13 05:04:30 UTC 2018 try 2
SET: 914243.94 requests per second
GET: 1135718.25 requests per second
Sat Jan 13 05:04:49 UTC 2018 try 3
SET: 921489.12 requests per second
GET: 1127650.00 requests per second
- SET difference between low and high:
(931532.38 - 914243.94) / 914243.94 * 100 = 1.89%
- GET difference between low and high:
(1135718.25 - 1127650.00) / 1127650.00 * 100 = 0.72%
Minimum throughput comparison across runs is presented below.
In each we calculate by how much no PTI is faster than PTI equivalent.
- SET:
(109745.38 - 94948.73) / 94948.73 * 100 = 15.58%
(13.48% slower) - GET:
(108003.02 - 94446.54) / 94446.54 * 100 = 14.35%
(12.55% slower)
- SET:
(914243.94 - 859254.19) / 859254.19 * 100 = 6.40%
(6.02% slower) - GET:
(1127650.00 - 1016156.94) / 1016156.94 * 100 = 10.98%
(9.90% slower)
Keep in mind that while these numbers look high, there were fluctuations within the same setup between runs, so some amount of noise is present, especially with non-pipelined version.
IPC in perf stat
went down from 0.85 IPC to 0.74 IPC (12.95% down).
There is no conclusion, make yoour own.
Local vs remote
Same setup as above, but here we try client over 2x10G link under the same TOR.
Network is 2 ports of
Intel 82599ES 10-Gigabit SFI/SFP+
bonded together.There is no magical packet steering involved.
KPTI
Local
(97522.92 - 96227.87) / 96227.87 * 100 = 1.35%
(96274.19 - 95283.47) / 95283.47 * 100 = 1.04%
Network
(100553.05 - 96237.13) / 96237.13 * 100 = 4.48%
(97675.33 - 95556.62) / 95556.62 * 100 = 2.22%
No KPTI
Local
(104920.78 - 103444.71) / 103444.71 * 100 = 1.43%
(104504.12 - 101122.46) / 101122.46 * 100 = 3.34%
Network
(102103.33 - 96581.03) / 96581.03 * 100 = 5.72%
(100020.00 - 98541.59) / 98541.59 * 100 = 1.50%
Comparison
Minimum throughput comparison below.
In each we calculate by how much no PTI is faster than no PTI equivalent.
Local
(103444.71 - 96227.87) / 96227.87 * 100 = 7.50%
(6.98% slower, previously 13.48% slower)(101122.46 - 94446.54) / 94446.54 * 100 = 7.06%
(6.59% slower, previously 12.55% slower)Network
(96581.03 - 96237.13) / 96237.13 * 100 = 0.36%
(0.36% slower)(98541.59 - 95556.62) / 95556.62 * 100 = 3.12%
(3.03% slower)Conclusion