Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?

Meltdown fix impact on Redis performances in virtualized environments

UPDATE: apparently kernel difference may have a serious impact, so I'll redo the test from scratch.

Test performed with AOF enabled, fsync policy 1 second, allowing the rewrites to be triggered.

Command lines used:

./redis-benchmark -q -P 32 -n 1000000 -t set,get -r 1000000
./redis-benchmark -q -n 1000000 -t set,get -r 1000000

(Best run of 5 was used)


Patched C4.X8LARGE, Amazon Linux 2 image

Linux ip-172-31-16-156 4.9.62-10.57.amzn2.x86_64 #1 SMP Wed Dec 6 00:07:49 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

SET: 554631.19 requests per second
GET: 857632.94 requests per second

Not pipelined:
SET: 121285.63 requests per second
GET: 119502.87 requests per second

Unpatched C4.X8LARGE, Ubuntu image with an older kernel

Linux ip-172-31-26-106 4.4.0-1022-aws #31-Ubuntu SMP Tue Jun 27 11:27:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

SET: 602772.75 requests per second
GET: 1012145.75 requests per second

Not pipelined:
SET: 170590.25 requests per second
GET: 159261.03 requests per second


The test used two different EC2 c4.x8large instances with two different Linux distributions, one patched for Meltdown and one not. Several things are done here in order to amplify the effect of slower syscall invocations:

  1. The test is performed in the loopback interface: in this case the ability to perform fast context switches is important especially without pipelining.
  2. Non pipelined test was performed as well, where the impact should be higher (more syscalls per queries).
  3. AOF was enabled with a credible configuration (fsync every sec).

It's important to note that part of the change may in theory be due to the different versions of the kernel used, however this is unlikely... But there is certainly this defect in the methodology I used.

Performance hit is severe in the non pipelined case, with the non patched kernel being able to perform more or less 35% more requests per second.

As expected, with pipelining this difference is largely reduced.

AFAIK being Redis very conservative in the number of slow syscalls performed per query, there is not easy way to change the code in order to gain back the performance loss. However in our community the use of pipelining is well established because of the very significant gains you get anyway, regardless of the page table isolation issue, so this is something that mitigates the problem.

Also the fact that the benchmarks above were conceived in order to maximize the performance penalty, and the results are still acceptable, means that while it is important to investigate if a kernel upgrade is going to put too much pressure to a Redis server already near its CPU usage limits, the long term effects of this fix in the Redis experience is relatively contained.

People using Redis in servers just allocated for such a purpose, not requiring any cross process security (because all the processes and the kernels can be trusted), may consider not upgrading the kernel or to disable the fix providing specific boot parameters (AFAIK this should be possible with the new versions of the fix).

EDIT: as noted on Twitter by Cristoph Sturm, note that here the systems running the hypervisor itself are patched anyway, but there is no way to test with non patched hypervisor because Amazon already updated all their systems.

EDIT2: You may wonder if Redis could be used directly to exploit the vulnerability via the Lua scripting engine that it's inside Redis and that can be called via the EVAL command, similarly to what the POC did with the Linux kernel eBPF JIT basically. Well, attacks always get better, so who knows, but Redis does not use LuaJit, but just the vanilla Lua interpreter, so it should be very hard to mount an attack like that.

Cheers, Salvatore

tailhook commented Jan 5, 2018

The interesting questions are:

  1. Can meltdown be exploited via lua? I guess yes. So gaining root access to server (on unpatched kernel) is still a problem
  2. Can spectre be exploited via non-lua commands? Like using GETRANGE or BITOP with out of range indexes.
  3. If the (2) is true. Can it somehow be exploited through application layer? I.e. if your app uses indexes for GETRANGE that are received from user data.

Not sure if this is actionable somehow, just curious.


antirez commented Jan 5, 2018

@tailhook I believe that since Redis uses the vanilla Lua interpreter and not LuaJIT to exploit this via Lua is going to be very hard. Even harder with the other commands because there is no practical way to perform the timing part required to read the side channel.
However attacks always get better, so who knows. Just it's unlikely based on what we know today.

gothicx commented Jan 5, 2018

Those tests were executed with transparent hugepages enabled or disabled ?

I hope to get to the point where I'm concerned about the drop from 1,012,145 to 857,632 fetches a second. Redis is still quite an amazing database. It will be interesting to see how the whole landscape develops as many smart people are working on mitigating the speed decreases.

elhh82 commented Jan 6, 2018

"Performance hit is severe in the non pipelined case, with the non patched kernel being able to perform more or less 35% more requests per second." This seems to be consistent with what we have observed on our Elasticache Redis instances, we might be seeing even larger performance hit than that in some of our workloads.

Thanks for looking into this.

Sorry to be that guy, but how did you manage to establish the baseline on the EC2? I had to do a simular benchmark for something else a couple of months ago (also c4.8xlarge) and found out that results vary (greatly) after each reboot. You are sharing the resources.

viraptor commented Jan 6, 2018

The difference between 4.9.62-10.57.amzn2.x86_64 and 4.4.0-1022-aws + the system config (ubuntu/amazon2) may play a big part here. It may be worth retesting on the same system with and without the nopti kernel parameter instead.

@AlekseyKorzun not if you're using Dedicated Instances, although it doesn't appear to have been the case.


antirez commented Jan 6, 2018

Yes as I wrote in this gist, I believe my methodology is flawed and now there are evidences that what I feared, that is, big changes because of the kernel, are more significant than other factors.

thekev commented Jan 13, 2018

Confirmed. Booted up a fresh Ubuntu 14.04 on r3.large, ran the same tests, then downgraded the kernel one version (from Dec 13 2017). ~35% delta.
redis-benchmark -q -n 1000000 -t set,get -r 1000000

3.13.0-138 (unpatched):
  SET: 84473.73 requests per second
  GET: 92816.04 requests per second
3.13.0-139 (patched):
  SET: 63171.20 requests per second
  GET: 63303.16 requests per second

There are some variances between runs and reboots, but only in the ~5% range. Note the r3.large is a 2 thread/1 core, so there's some contention between the benchmark and the server process, but it's good enough to show me I need to hold back on this kernel, or launch a lot more nodes. Also, I did not enable AOF, as we just run redis as a big LRU cache.

In consideration of the loopback interface being the worst case for needing fast syscalls, I also tested from two non-local AZs. The box running redis-benchmark stayed on 3.13.0-138, and the redis-server nodes were tested with both kernels:

1.226ms average icmp round trip:

3.13.0-138 (unpatched):
  SET: 38300.97 requests per second
  GET: 38681.73 requests per second
3.13.0-139 (patched):
  SET: 38869.67 requests per second
  GET: 38907.48 requests per second

0.640ms average icmp round trip:

3.13.0-138 (unpatched):
  SET: 62897.04 requests per second
  GET: 63283.13 requests per second
3.13.0-139 (patched):
  SET: 62235.50 requests per second
  GET: 61908.00 requests per second

So, when adding a little latency, I'm not seeing that it has much effect after all. At least when not pipelined. Bring that back and it gets more interesting:

redis-benchmark -q -n 1000000 -t set,get -r 1000000 -h [redacted] -P 32

1.222ms average icmp round trip:

3.13.0-138 (unpatched)
  SET: 633312.25 requests per second
  GET: 724637.69 requests per second
3.13.0-139 (patched)
  SET: 390930.41 requests per second
  GET: 445831.47 requests per second

0.640ms average icmp round trip:

3.13.0-138 (unpatched):
  SET: 613496.94 requests per second
  GET: 704721.62 requests per second
3.13.0-139 (patched)
  SET: 442673.75 requests per second
  GET: 487804.88 requests per second

So, it definitely has an effect, when network latency isn't the limiting factor (I assume).

bobrik commented Jan 13, 2018

Bare metal results on the same Kernel version:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment