Skip to content

Instantly share code, notes, and snippets.

@bobrik bobrik/README.md
Last active Mar 1, 2018

Embed
What would you like to do?
KPTI Redis benchmark on bare metal

Redis KPTI benchmark

Other benchmarks make me sad. People run them in the cloud where VMs can migrate between different hosts on reboots and use old kernels. Here we use bare metal with the same recent kernel to remove extra variables and measure KPTI impact and KPTI impact only.

Setup

  • Kernel: 4.14.11
  • CPU: 2 x Xeon E5-2630 v3 @ 2.40GHz (32 logical cores)

Some prerequisites:

  • Every measurement is taken with a freshly restarted Redis without persistence
  • There were no other services that could use any significant CPU
  • Reboots happened to ensure consistency, but only one boot is included

The following command was used to launch the server:

$ docker run --rm -it --net host redis:4.0.6 redis-server --save "" --appendonly no

With PTI

$ dmesg -T | fgrep 'page tables isolation'
[Sat Jan 13 04:28:34 2018] Kernel/User page tables isolation: enabled

Regular (1M GET + 1M SET)

$ for i in $(seq 1 3); do echo $(date) "try $i"; docker run --rm -it --net host redis:4.0.6 redis-benchmark -q -n 1000000 -t set,get -r 1000000; done
Sat Jan 13 04:49:25 UTC 2018 try 1
SET: 94948.73 requests per second
GET: 95648.02 requests per second

Sat Jan 13 04:49:46 UTC 2018 try 2
SET: 96627.70 requests per second
GET: 95593.15 requests per second

Sat Jan 13 04:50:07 UTC 2018 try 3
SET: 95410.74 requests per second
GET: 94446.54 requests per second
  • SET difference between low and high: (96627.70 - 94948.73) / 94948.73 * 100 = 1.77%
  • GET difference between low and high: (95648.02 - 94446.54) / 94446.54 * 100 = 1.27%

Bonus perf stats

$ sudo perf stat -d -p 3144
^C
 Performance counter stats for process id '3144':

      58904.467362      task-clock (msec)         #    0.804 CPUs utilized
           308,662      context-switches          #    0.005 M/sec
                 4      cpu-migrations            #    0.000 K/sec
            23,828      page-faults               #    0.405 K/sec
   183,214,645,662      cycles                    #    3.110 GHz                      (50.10%)
   141,173,051,301      instructions              #    0.77  insn per cycle           (62.62%)
    27,713,334,373      branches                  #  470.479 M/sec                    (62.59%)
       171,263,576      branch-misses             #    0.62% of all branches          (62.50%)
    42,720,689,753      L1-dcache-loads           #  725.254 M/sec                    (62.42%)
     3,632,866,438      L1-dcache-load-misses     #    8.50% of all L1-dcache hits    (24.95%)
     1,014,432,733      LLC-loads                 #   17.222 M/sec                    (25.05%)
       426,430,439      LLC-load-misses           #   42.04% of all LL-cache hits     (37.58%)

      73.309047198 seconds time elapsed
$ sudo perf stat -d -p 4398
^C
 Performance counter stats for process id '4398':

      58918.071006      task-clock (msec)         #    0.866 CPUs utilized
           475,459      context-switches          #    0.008 M/sec
                 5      cpu-migrations            #    0.000 K/sec
            23,780      page-faults               #    0.404 K/sec
   182,551,645,439      cycles                    #    3.098 GHz                      (50.13%)
   142,093,978,992      instructions              #    0.78  insn per cycle           (62.68%)
    27,936,369,361      branches                  #  474.156 M/sec                    (62.57%)
       173,701,977      branch-misses             #    0.62% of all branches          (62.44%)
    43,069,902,734      L1-dcache-loads           #  731.013 M/sec                    (62.43%)
     3,745,293,484      L1-dcache-load-misses     #    8.70% of all L1-dcache hits    (25.01%)
     1,077,645,288      LLC-loads                 #   18.291 M/sec                    (25.10%)
       432,856,886      LLC-load-misses           #   40.17% of all LL-cache hits     (37.57%)

      68.039181383 seconds time elapsed

Pipelined (10M GET + 10M SET)

$ for i in $(seq 1 3); do echo $(date) "try $i"; docker run --rm -it --net host redis:4.0.6 redis-benchmark -q -P 32 -n 10000000 -t set,get -r 1000000; done
Sat Jan 13 04:52:03 UTC 2018 try 1
SET: 863632.38 requests per second
GET: 1037344.38 requests per second

Sat Jan 13 04:52:24 UTC 2018 try 2
SET: 859254.19 requests per second
GET: 1027643.62 requests per second

Sat Jan 13 04:52:46 UTC 2018 try 3
SET: 862738.38 requests per second
GET: 1016156.94 requests per second
  • SET difference between low and high: (863632.38 - 859254.19) / 859254.19 * 100 = 0.51%
  • GET difference between low and high: (1037344.38 - 1016156.94) / 1016156.94 * 100 = 2.01%

Without PTI (nopti kernel cmdline)

$ dmesg -T | fgrep 'page tables isolation'
[Sat Jan 13 04:56:44 2018] Kernel/User page tables isolation: disabled on command line.

Regular (1M GET + 1M SET)

$ for i in $(seq 1 3); do echo $(date) "try $i"; docker run --rm -it --net host redis:4.0.6 redis-benchmark -q -n 1000000 -t set,get -r 1000000; done
Sat Jan 13 05:02:32 UTC 2018 try 1
SET: 109745.38 requests per second
GET: 110277.90 requests per second

Sat Jan 13 05:02:50 UTC 2018 try 2
SET: 113934.14 requests per second
GET: 108003.02 requests per second

Sat Jan 13 05:03:08 UTC 2018 try 3
SET: 112637.98 requests per second
GET: 112498.59 requests per second
  • SET difference between low and high: (113934.14 - 109745.38) / 109745.38 * 100 = 3.82%
  • GET difference between low and high: (112498.59 - 108003.02) / 108003.02 * 100 = 4.16%

Bonus perf stats

$ sudo perf stat -d -p 9498
^C
 Performance counter stats for process id '9498':

      51859.650102      task-clock (msec)         #    0.853 CPUs utilized
           171,547      context-switches          #    0.003 M/sec
                 4      cpu-migrations            #    0.000 K/sec
            23,724      page-faults               #    0.457 K/sec
   162,137,087,109      cycles                    #    3.126 GHz                      (49.95%)
   137,847,228,034      instructions              #    0.85  insn per cycle           (62.46%)
    26,968,561,543      branches                  #  520.030 M/sec                    (62.46%)
       167,141,761      branch-misses             #    0.62% of all branches          (62.50%)
    41,819,831,836      L1-dcache-loads           #  806.404 M/sec                    (62.50%)
     3,393,654,584      L1-dcache-load-misses     #    8.11% of all L1-dcache hits    (25.03%)
       953,380,618      LLC-loads                 #   18.384 M/sec                    (24.99%)
       430,826,721      LLC-load-misses           #   45.19% of all LL-cache hits     (37.49%)

      60.789997981 seconds time elapsed
$ sudo perf stat -d -p 11934
^C
 Performance counter stats for process id '11934':

      52241.822970      task-clock (msec)         #    0.745 CPUs utilized
           251,987      context-switches          #    0.005 M/sec
                 4      cpu-migrations            #    0.000 K/sec
            23,686      page-faults               #    0.453 K/sec
   162,948,839,545      cycles                    #    3.119 GHz                      (49.94%)
   138,960,314,319      instructions              #    0.85  insn per cycle           (62.48%)
    27,243,372,526      branches                  #  521.486 M/sec                    (62.48%)
       167,741,490      branch-misses             #    0.62% of all branches          (62.49%)
    42,188,015,450      L1-dcache-loads           #  807.553 M/sec                    (62.54%)
     3,429,701,506      L1-dcache-load-misses     #    8.13% of all L1-dcache hits    (25.02%)
       962,609,210      LLC-loads                 #   18.426 M/sec                    (24.99%)
       432,863,320      LLC-load-misses           #   44.97% of all LL-cache hits     (37.45%)

      70.121123867 seconds time elapsed

Pipelined (10M GET + 10M SET)

$ for i in $(seq 1 3); do echo $(date) "try $i"; docker run --rm -it --net host redis:4.0.6 redis-benchmark -q -P 32 -n 10000000 -t set,get -r 1000000; done
Sat Jan 13 05:04:10 UTC 2018 try 1
SET: 931532.38 requests per second
GET: 1127777.25 requests per second

Sat Jan 13 05:04:30 UTC 2018 try 2
SET: 914243.94 requests per second
GET: 1135718.25 requests per second

Sat Jan 13 05:04:49 UTC 2018 try 3
SET: 921489.12 requests per second
GET: 1127650.00 requests per second
  • SET difference between low and high: (931532.38 - 914243.94) / 914243.94 * 100 = 1.89%
  • GET difference between low and high: (1135718.25 - 1127650.00) / 1127650.00 * 100 = 0.72%

Comparison

Minimum throughput comparison across runs is presented below.

In each we calculate by how much no PTI is faster than PTI equivalent.

Regular

  • SET: (109745.38 - 94948.73) / 94948.73 * 100 = 15.58% (13.48% slower)
  • GET: (108003.02 - 94446.54) / 94446.54 * 100 = 14.35% (12.55% slower)

Pipelined

  • SET: (914243.94 - 859254.19) / 859254.19 * 100 = 6.40% (6.02% slower)
  • GET: (1127650.00 - 1016156.94) / 1016156.94 * 100 = 10.98% (9.90% slower)

Notes

Keep in mind that while these numbers look high, there were fluctuations within the same setup between runs, so some amount of noise is present, especially with non-pipelined version.

IPC in perf stat went down from 0.85 IPC to 0.74 IPC (12.95% down).

Conclusion

There is no conclusion, make yoour own.

@bobrik

This comment has been minimized.

Copy link
Owner Author

commented Jan 13, 2018

Local vs remote

Same setup as above, but here we try client over 2x10G link under the same TOR.

Network is 2 ports of Intel 82599ES 10-Gigabit SFI/SFP+ bonded together.

There is no magical packet steering involved.

KPTI

Local

$ for i in $(seq 1 3); do echo $(date) "try $i"; docker run --rm -it --net host redis:4.0.6 redis-benchmark -q -n 1000000 -t set,get -r 1000000; done
Sat Jan 13 18:12:06 UTC 2018 try 1
SET: 96227.87 requests per second
GET: 96274.19 requests per second

Sat Jan 13 18:12:27 UTC 2018 try 2
SET: 97304.66 requests per second
GET: 95283.47 requests per second

Sat Jan 13 18:12:48 UTC 2018 try 3
SET: 97522.92 requests per second
GET: 95565.75 requests per second
  • SET difference between low and high: (97522.92 - 96227.87) / 96227.87 * 100 = 1.35%
  • GET difference between low and high: (96274.19 - 95283.47) / 95283.47 * 100 = 1.04%

Network

$ for i in $(seq 1 3); do echo $(date) "try $i"; docker run --rm -it --net host redis:4.0.6 redis-benchmark -q -h 36s152 -n 1000000 -t set,get -r 1000000; done
Sat Jan 13 18:13:47 UTC 2018 try 1
SET: 100553.05 requests per second
GET: 97370.98 requests per second

Sat Jan 13 18:14:08 UTC 2018 try 2
SET: 96237.13 requests per second
GET: 95556.62 requests per second

Sat Jan 13 18:14:29 UTC 2018 try 3
SET: 99373.95 requests per second
GET: 97675.33 requests per second
  • SET difference between low and high: (100553.05 - 96237.13) / 96237.13 * 100 = 4.48%
  • GET difference between low and high: (97675.33 - 95556.62) / 95556.62 * 100 = 2.22%

No KPTI

Local

$ for i in $(seq 1 3); do echo $(date) "try $i"; docker run --rm -it --net host redis:4.0.6 redis-benchmark -q -n 1000000 -t set,get -r 1000000; done
Sat Jan 13 18:29:51 UTC 2018 try 1
SET: 104810.81 requests per second
GET: 102469.52 requests per second

Sat Jan 13 18:30:11 UTC 2018 try 2
SET: 104920.78 requests per second
GET: 104504.12 requests per second

Sat Jan 13 18:30:30 UTC 2018 try 3
SET: 103444.71 requests per second
GET: 101122.46 requests per second
  • SET difference between low and high: (104920.78 - 103444.71) / 103444.71 * 100 = 1.43%
  • GET difference between low and high: (104504.12 - 101122.46) / 101122.46 * 100 = 3.34%

Network

$ for i in $(seq 1 3); do echo $(date) "try $i"; docker run --rm -it --net host redis:4.0.6 redis-benchmark -q -h 36s152 -n 1000000 -t set,get -r 1000000; done
Sat Jan 13 18:31:10 UTC 2018 try 1
SET: 97143.97 requests per second
GET: 99492.59 requests per second

Sat Jan 13 18:31:30 UTC 2018 try 2
SET: 96581.03 requests per second
GET: 98541.59 requests per second

Sat Jan 13 18:31:51 UTC 2018 try 3
SET: 102103.33 requests per second
GET: 100020.00 requests per second
  • SET difference between low and high: (102103.33 - 96581.03) / 96581.03 * 100 = 5.72%
  • GET difference between low and high: (100020.00 - 98541.59) / 98541.59 * 100 = 1.50%

Comparison

Minimum throughput comparison below.

In each we calculate by how much no PTI is faster than no PTI equivalent.

Local

  • SET: (103444.71 - 96227.87) / 96227.87 * 100 = 7.50% (6.98% slower, previously 13.48% slower)
  • GET: (101122.46 - 94446.54) / 94446.54 * 100 = 7.06% (6.59% slower, previously 12.55% slower)

Network

  • SET: (96581.03 - 96237.13) / 96237.13 * 100 = 0.36% (0.36% slower)
  • GET: (98541.59 - 95556.62) / 95556.62 * 100 = 3.12% (3.03% slower)

Conclusion

  • Impact here is 2x smaller on local tests, even though nothing changed
  • Impact over network is in the noise, smaller than difference between tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.