Meltdown fix impact on Redis performances in virtualized environments
UPDATE: apparently kernel difference may have a serious impact, so I'll redo the test from scratch.
Test performed with AOF enabled, fsync policy 1 second, allowing the rewrites to be triggered.
Command lines used:
./redis-benchmark -q -P 32 -n 1000000 -t set,get -r 1000000 ./redis-benchmark -q -n 1000000 -t set,get -r 1000000
(Best run of 5 was used)
Patched C4.X8LARGE, Amazon Linux 2 image
Linux ip-172-31-16-156 4.9.62-10.57.amzn2.x86_64 #1 SMP Wed Dec 6 00:07:49 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Pipelined: SET: 554631.19 requests per second GET: 857632.94 requests per second Not pipelined: SET: 121285.63 requests per second GET: 119502.87 requests per second
Unpatched C4.X8LARGE, Ubuntu image with an older kernel
Linux ip-172-31-26-106 4.4.0-1022-aws #31-Ubuntu SMP Tue Jun 27 11:27:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Pipelined: SET: 602772.75 requests per second GET: 1012145.75 requests per second Not pipelined: SET: 170590.25 requests per second GET: 159261.03 requests per second
The test used two different EC2 c4.x8large instances with two different Linux distributions, one patched for Meltdown and one not. Several things are done here in order to amplify the effect of slower syscall invocations:
- The test is performed in the loopback interface: in this case the ability to perform fast context switches is important especially without pipelining.
- Non pipelined test was performed as well, where the impact should be higher (more syscalls per queries).
- AOF was enabled with a credible configuration (fsync every sec).
It's important to note that part of the change may in theory be due to the different versions of the kernel used, however this is unlikely... But there is certainly this defect in the methodology I used.
Performance hit is severe in the non pipelined case, with the non patched kernel being able to perform more or less 35% more requests per second.
As expected, with pipelining this difference is largely reduced.
AFAIK being Redis very conservative in the number of slow syscalls performed per query, there is not easy way to change the code in order to gain back the performance loss. However in our community the use of pipelining is well established because of the very significant gains you get anyway, regardless of the page table isolation issue, so this is something that mitigates the problem.
Also the fact that the benchmarks above were conceived in order to maximize the performance penalty, and the results are still acceptable, means that while it is important to investigate if a kernel upgrade is going to put too much pressure to a Redis server already near its CPU usage limits, the long term effects of this fix in the Redis experience is relatively contained.
People using Redis in servers just allocated for such a purpose, not requiring any cross process security (because all the processes and the kernels can be trusted), may consider not upgrading the kernel or to disable the fix providing specific boot parameters (AFAIK this should be possible with the new versions of the fix).
EDIT: as noted on Twitter by Cristoph Sturm, note that here the systems running the hypervisor itself are patched anyway, but there is no way to test with non patched hypervisor because Amazon already updated all their systems.
EDIT2: You may wonder if Redis could be used directly to exploit the vulnerability via the Lua scripting engine that it's inside Redis and that can be called via the EVAL command, similarly to what the POC did with the Linux kernel eBPF JIT basically. Well, attacks always get better, so who knows, but Redis does not use LuaJit, but just the vanilla Lua interpreter, so it should be very hard to mount an attack like that.