Skip to content

Instantly share code, notes, and snippets.

@antirez
Last active February 3, 2021 14:08
  • Star 41 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save antirez/9e716670f76133ec81cb24036f86ee95 to your computer and use it in GitHub Desktop.

Meltdown fix impact on Redis performances in virtualized environments

UPDATE: apparently kernel difference may have a serious impact, so I'll redo the test from scratch.

Test performed with AOF enabled, fsync policy 1 second, allowing the rewrites to be triggered.

Command lines used:

./redis-benchmark -q -P 32 -n 1000000 -t set,get -r 1000000
./redis-benchmark -q -n 1000000 -t set,get -r 1000000

(Best run of 5 was used)

Results

Patched C4.X8LARGE, Amazon Linux 2 image

Linux ip-172-31-16-156 4.9.62-10.57.amzn2.x86_64 #1 SMP Wed Dec 6 00:07:49 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Pipelined:
SET: 554631.19 requests per second
GET: 857632.94 requests per second

Not pipelined:
SET: 121285.63 requests per second
GET: 119502.87 requests per second

Unpatched C4.X8LARGE, Ubuntu image with an older kernel

Linux ip-172-31-26-106 4.4.0-1022-aws #31-Ubuntu SMP Tue Jun 27 11:27:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Pipelined:
SET: 602772.75 requests per second
GET: 1012145.75 requests per second

Not pipelined:
SET: 170590.25 requests per second
GET: 159261.03 requests per second

Discussion

The test used two different EC2 c4.x8large instances with two different Linux distributions, one patched for Meltdown and one not. Several things are done here in order to amplify the effect of slower syscall invocations:

  1. The test is performed in the loopback interface: in this case the ability to perform fast context switches is important especially without pipelining.
  2. Non pipelined test was performed as well, where the impact should be higher (more syscalls per queries).
  3. AOF was enabled with a credible configuration (fsync every sec).

It's important to note that part of the change may in theory be due to the different versions of the kernel used, however this is unlikely... But there is certainly this defect in the methodology I used.

Performance hit is severe in the non pipelined case, with the non patched kernel being able to perform more or less 35% more requests per second.

As expected, with pipelining this difference is largely reduced.

AFAIK being Redis very conservative in the number of slow syscalls performed per query, there is not easy way to change the code in order to gain back the performance loss. However in our community the use of pipelining is well established because of the very significant gains you get anyway, regardless of the page table isolation issue, so this is something that mitigates the problem.

Also the fact that the benchmarks above were conceived in order to maximize the performance penalty, and the results are still acceptable, means that while it is important to investigate if a kernel upgrade is going to put too much pressure to a Redis server already near its CPU usage limits, the long term effects of this fix in the Redis experience is relatively contained.

People using Redis in servers just allocated for such a purpose, not requiring any cross process security (because all the processes and the kernels can be trusted), may consider not upgrading the kernel or to disable the fix providing specific boot parameters (AFAIK this should be possible with the new versions of the fix).

EDIT: as noted on Twitter by Cristoph Sturm, note that here the systems running the hypervisor itself are patched anyway, but there is no way to test with non patched hypervisor because Amazon already updated all their systems.

EDIT2: You may wonder if Redis could be used directly to exploit the vulnerability via the Lua scripting engine that it's inside Redis and that can be called via the EVAL command, similarly to what the POC did with the Linux kernel eBPF JIT basically. Well, attacks always get better, so who knows, but Redis does not use LuaJit, but just the vanilla Lua interpreter, so it should be very hard to mount an attack like that.

Cheers, Salvatore

@tailhook
Copy link

tailhook commented Jan 5, 2018

The interesting questions are:

  1. Can meltdown be exploited via lua? I guess yes. So gaining root access to server (on unpatched kernel) is still a problem
  2. Can spectre be exploited via non-lua commands? Like using GETRANGE or BITOP with out of range indexes.
  3. If the (2) is true. Can it somehow be exploited through application layer? I.e. if your app uses indexes for GETRANGE that are received from user data.

Not sure if this is actionable somehow, just curious.

@antirez
Copy link
Author

antirez commented Jan 5, 2018

@tailhook I believe that since Redis uses the vanilla Lua interpreter and not LuaJIT to exploit this via Lua is going to be very hard. Even harder with the other commands because there is no practical way to perform the timing part required to read the side channel.
However attacks always get better, so who knows. Just it's unlikely based on what we know today.

@gothicx
Copy link

gothicx commented Jan 5, 2018

Those tests were executed with transparent hugepages enabled or disabled ?

@xeoncross
Copy link

I hope to get to the point where I'm concerned about the drop from 1,012,145 to 857,632 fetches a second. Redis is still quite an amazing database. It will be interesting to see how the whole landscape develops as many smart people are working on mitigating the speed decreases.

@elhh82
Copy link

elhh82 commented Jan 6, 2018

"Performance hit is severe in the non pipelined case, with the non patched kernel being able to perform more or less 35% more requests per second." This seems to be consistent with what we have observed on our Elasticache Redis instances, we might be seeing even larger performance hit than that in some of our workloads.

Thanks for looking into this.

@AlekseyKorzun
Copy link

Sorry to be that guy, but how did you manage to establish the baseline on the EC2? I had to do a simular benchmark for something else a couple of months ago (also c4.8xlarge) and found out that results vary (greatly) after each reboot. You are sharing the resources.

@viraptor
Copy link

viraptor commented Jan 6, 2018

The difference between 4.9.62-10.57.amzn2.x86_64 and 4.4.0-1022-aws + the system config (ubuntu/amazon2) may play a big part here. It may be worth retesting on the same system with and without the nopti kernel parameter instead.

@ruimarinho
Copy link

@AlekseyKorzun not if you're using Dedicated Instances, although it doesn't appear to have been the case.

@antirez
Copy link
Author

antirez commented Jan 6, 2018

Yes as I wrote in this gist, I believe my methodology is flawed and now there are evidences that what I feared, that is, big changes because of the kernel, are more significant than other factors.

@thekev
Copy link

thekev commented Jan 13, 2018

Ubuntu 12.04 3.13.0-141 kernel default:

SET: 109745.38 requests per second
GET: 108613.02 requests per second

Ubuntu 12.04 3.13.0-141 kernel with nopti:

SET: 132925.70 requests per second
GET: 130616.51 requests per second

Tests done on r4.xlarge. Let's not pick any more nits about testing methods. Same kernel, just disable the feature. Boom/done.

@bobrik
Copy link

bobrik commented Jan 13, 2018

Bare metal results on the same Kernel version: https://gist.github.com/bobrik/c67189e88efcc2a1491c54c15f5fe006

@jaipaddy
Copy link

jaipaddy commented May 14, 2018

I am using memtier_benchmark to evaluate redis 3.2.0 and am "surprisingly" seeing some performance improvements when applying the Meltdown patch to the hypervisor only, i.e. un-patched guest OS on un-patched hypervisor vs un-patched guest OS on patched hypervisor for the pipeline and non-pipeline mode. I am seeing ~35% increase in performance for non-pipeline requests and ~15% increase for pipeline requests. I do see similar numbers for the patched hypervisor as mentioned in https://news.ycombinator.com/item?id=16079457 but am surprised to see that the hypervisor patch by itself can improve performance which doesn't make sense. Is memtier_benchmark doing something wrong, has anyone else seen this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment