We are going to find how you can track a memory leak using the most recent and performant tools. This article's goal is to give an up-to-date and as-simple-as-it-can-be reference on main steps towards tracking a memory leak. If you want to get the most out of it, I've added (IMHO) very useful links all along the article.
If you still want to enjoy the read and do not have a leak, you can create one.
And if your app is still leaking, or if you found another way around your leak, comment below!
DISCLAIMER: If you are not using Jemalloc, please use it.
Or at least set MALLOC_ARENA_MAX=2
. Come back to this article if you still have
evergrowing memory.
Another quick note about Jemalloc, sometimes LD_PRELOAD
can be overridden, preventing
Jemalloc from being used. We had that exact problem using bin/qgtunnel
.
You can make sure that Jemalloc is running using MALLOC_CONF=stats_print:true ruby -e '1 + 1' | grep jemalloc
.
Ok, now let's start talking about real memory leaks. Maybe. If you are reading this article, chances are that you are experiencing OOM issues, or that your app starts swapping at some point and you don't understand why.
The first thing to do in that case should be to ask yourself: what does that memory evolution mean? Question to which I know 3 answers:
- My app is too big for my available RAM
- I have a memory bloat
- I have a memory leak
Fortunately, those 3 answers can be distinguished fairly easily. First, increase your available RAM and then take a look at memory evolution.
NOTE: if you cannot do that in production, or do not want to do that, you can use derailed exec perf:mem_over_time
. See below Using derailed benchmarks.
If it keep growing slowly, you can be fairly sure you have a leak. this gives you an indication of how much your app really consumes, and if it is too much, you will have to drop useless gems (and maybe more). If memory spikes suddenly at some point, you have a bloat.
gem install bundler-leak
bundle leak update
bundle leak
That's it, if the last command shows something other than No leaks found, you are lucky here's the leak. Fix it, ship it and let's go! See rubymem/bundler-leak for more information.
Unfortunately, this tool (and maybe others that I don't know about) is not enough for two reasons:
- Not all leaks are referenced here (hence if you find one, please advise @memruby).
- Sorry to say this, but more often than not, the issue is in our codebase, not the lib we use.
There are already some articles telling you how to do that. Hence I'll focus on finding the leak locally (not on your production machine) and using existing tools rather than coding them ourselves.
So, once you are sure your issue is a leak, you can start searching for it. This task is harder than searching for a memory bloat since memory slowly grows over time, and you cannot just benchmark before and after a method to see evolution, you wouldn't know if it was just retained because no GC occurred yet, or because it was leaking.
Fortunately, there exists one really great ruby tool: ObjectSpace.dump_all
(and a bit more)
which lets us analyze our application memory at a given time. And we can take advantage of that to take
a few snapshots, and then compare them.
Now is a good time to talk about derailed_benchmarks
. If you don't
know about it yet, it is a tool for benchmarking rails (and rack) applications performances
either for time or memory issues. You may dig through the README for more information, I'll
stick with a simple and essential way here:
gem install derailed_benchmarks
RAILS_ENV=production RACK_ENV=production derailed_benchmarks exec <cmd>
The idea here is to generate three consecutive dumps (A, B, and C for instance) separated by a few requests to your application, and then compare them to detect what is retained and what is not. I will not detail what is the content of those dumps here, just how to exploit it (here for content).
So what we are going to do is checking which objects are in B, and not in A. Meaning, objects that have been created during requests received between both memory snapshots.
Then we'll keep only objects that are both in B and C. Meaning, objects that were present at B and retained afterward.
For instance, if you have three dumps with the next objects in memory:
A | B | C |
---|---|---|
0x01 |
0x01 |
0x01 |
0x02 |
||
0x03 |
0x03 |
We can say that our object at 0x01
is a long-term object due to our usual memory growth at boot. However
0x02
is ALLOCATED at B, hence if big it could be the cause of a bloat. However, it cannot be the cause
of our leak since it has been garbage collected in C. Finally, 0x03
is RETAINED since it appears after
some time (B) and stays there (C), it has great chances to be the reason of our leak.
To do that in your code, you just need to derailed exec perf:heap_diff
. This will save three dumps for you, and
run a three-way diff against those. You can then just check which retained line concern your
application.
$ TEST_COUNT=2000 RACK_ENV=production PATH_TO_HIT=/v1/status derailed exec perf:heap_diff
$ heapy diff tmp/2021-05-06T11:15:43+02:00-heap-0.ndjson tmp/2021-05-06T11:15:52+02:00-heap-1.ndjson tmp/2021-05-06T11:16:04+02:00-heap-2.ndjson \
| grep klaxit-via
Retained DATA 1984 objects of size 158720/13482787 (in bytes) at: /Users/ulysse/Dev/klaxit/klaxit-via/app.rb:56
Retained IMEMO 1982 objects of size 142704/13482787 (in bytes) at: /Users/ulysse/Dev/klaxit/klaxit-via/app.rb:56
If you get a few lines with a huge size, bingo! It is now a matter of understanding why the code is
leaking, and fixing that. Otherwise, consider testing another endpoint, changing PATH_TO_HIT
. I
suggest to always start with a simple endpoint as it will tell you if the leak lies in your whole
stack or only in a single endpoint.
It is now time to commit, run derailed exec perf:mem_over_time
on the former leaking endpoint, and
see that nice plateau after a few requests, indicating that you've solved the leak (congrats!)
Première passe, je relirai à nouveau un peu plus tard.
=>
The first thing to do in that case shall be to ask
(should ou shall d'ailleurs?)=> Il faudrait pointer vers tes réferences ici, parce que ce n'est pas très clair
=> Il me semble bien ici
=> En lisant ça, j'imagine que tu veux dire (B inter C) - (A inter B), il doit y avoir moyen de clarifier le
- (A inter B)
. (peut-être avec un petit schéma d'ensemble?)