Skip to content

Instantly share code, notes, and snippets.

@karuppiah7890
Created September 22, 2023 12:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save karuppiah7890/0c7db67c83116ebc457397ee98a7117c to your computer and use it in GitHub Desktop.
Save karuppiah7890/0c7db67c83116ebc457397ee98a7117c to your computer and use it in GitHub Desktop.
Ingestion Service memory leak issue and solution. Write ✍️ Blog post

ClinicJS Doctor command showing heap issue

Running test against local. Using autocannon. Running test against staging environment using autocannon. Running clinic heap profiler tool along with autocannon to find memory leak issue

Show how heap dump is done

Share the blog post with team :)

Put all the details in it. ALL the details :)


Post fix ideas 💡 that seemed like it would have helped

Run service and hit it with many API calls and then once when memory usage is high, take a heap dump and visualize the data / items / objects in the heap dump which would show what was using up the heap memory (basically the memory)


Saw alerts that ingestion service was down - multiple times, over a span of a few days

Saw ups and downs in graph 📉📈📉📈📉📈

Show the weird graph from New Relic

We saw that the times when we got alert and the graph had clear pattern. Whenever the service down alert came, the service memory usage peaked to a value and then it went down. Same thing happened multiple times. It was like a sequence of triangular 📐▶️ waves 🌊, like sine wave or cos wave

NodeJs site had mentioned that NodeJs can crash. We also noticed, by chance, that the systemd server had uptime as a very low value, it basically meant the service was started recently - it was actually restarted after a stop / crash

Did memory profiling and found some data and came to a conclusion that the loopback ➿➰ logging library was the problem - the Winston logger library

A colleague, Hari, said that Winston is a popular library and that we don't need to change the logging library, cuz of the idea that Winston is pretty good (idk, I don't have any opinions on this 😅). And asked if we were using an older version and if there was any issue or bug 🪲🐛 and if we can upgrade

I did lot of digging but didn't come to any conclusion. I was kinda stuck. Later we didn't notice the issue anymore. This was <> ago

Recently we started getting alerts for high memory usage and also service down alerts

This time I did digging again and I was able to collect the heap profile locally by load testing the local instance. Of course the production could have been facing a different problem too, I'm realizing it now and I gotta come up with solutions for validating the assumption that I was able to reproduce the high memory usage problem. I guess we just have to inspect the same data in production - do heap profiling while production traffic is coming, do heap dump and try to understand what's happening, what's going on

In between all this, I was wondering why the process went down and how. I was wondering if it was the Kernel OOM killer which kills processes if the available memory is too low - but we had more than 1GB free 🆓 RAM available. I validated my assumptions and found out that it wasn't the Linux OOM killer by checking kernel logs in journalctl, while doing this I had an idea to check NodeJs logs. I then started checking NodeJs logs with journalctl and found out crash logs, with some GC data etc and how maximum heap memory was reached and that it's crashing. This gave the answer to what happened to the NodeJs process and why it went down. The answer gave some relief 😮‍💨

I did load testing with autocannon tool and heap profiling with ClinicJS tool

ClinicJS heap profiling gave me nice flame graph to interact with and find the culprit causing memory leak of sorts

This was the second time I was seeing these flame graphs. I had seen them the last time too. But this time I spent more time doing local testing. I used ClinicJS along with autocannon, to make things easy for me to get heap profile

I still wasn't sure what code to change or what to do to fix this issue of high memory usage. The files and line numbers and methods in the stack that showed as the items causing lot of memory allocations, they all didn't make sense - as in, I couldn't make sense out of it, unfortunately and I didn't know what the program's line of execution was, to be able to pin point what all happened when a request hit the API

So, I just had more data but it didn't go anywhere

Looking at ClinicJS help, I noticed it asked me to run doctor command before heap profiling etc. I ran doctor command instead of heap profiler and got this nice graph which said there might be some issue at memory level. You don't say. This is how the graph looked like - 📉📈📉📈

It also has information ℹ️ about this data, which is useful to read and follow up on to understand and learn more

I looked at multiple GitHub issues in different respositories

I still was all over the place, confused, not know what next to do

Then I started noticing another thing, this is something I had noticed before too, and validated again this time. The two endpoints when hit, cause memory issues and shows up in heap profile. And I thought it was the logs, but then the issue didn't happen for status endpoint for the same service and this endpoint also had logs - actually this log was a log 🪵 middleware, implemented as an interceptor to capture every request, response code etc

I started reading the service code more. And also started looking for difference in status endpoint and ingestion endpoints

I noticed that we use logger at controller level in ingestion endpoints implementation, and also at service level. In both cases we were injecting the logger - dependency injection concept

As I told, my plan was to find the root cause and do something appropriately with more data in hand - especially about what can we do to fix the issue and what's the underlying issue. High level issue is clear

Later in a discussion my CTO, my boss, said that I can just change the logging library

Now changing the logging library was my next task and plan. For doing this I was checking the code already written. The loggers, the annotations, and how dependency injection code is written etc

Somehow while doing this, I cam across a service being of transient scope and I read what's transient scope and started suspecting it

While reading GitHub issues too, some people said some stuff to avoid. For example using the same Winston transport multiple times or something is a problem

Putting 2 and 2 together, and reading some loopback docs, I was able to do a guess - of how the memory leak happened. I fixed the issue and tested it out, it worked. But my fix caused some logging features to be gone. So then I came up with another fix which hax minimal code changes and all the features of the code was maintained

With this guess of how things work and putting a fix based on that and testing things believing I fixed stuff, I went ahead and deployed it to staging and noticed that it did improve memory usage, especially when traffic was high - due to load testing I did

Next we deployed to Sandbox ⌛ and Production. I could visibly see how the response time in Production decreased due to my change. And how the memory usage has gone down to a great extent

Finally now our customers will not face any impact due to this, it will fix any issues they had


Was this a memory leak? Was logger instance created but not deleted? And that's why memory leak happened? Or it's just that lot of loggers got created and the GC happened less often, hence leaving lot of loggers to be alive in the heap and causing high memory usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment