User logged a new issue that the memory is leaking with the new version of @azure/event-hubs
SDK 5.9.0.
We unintentionally introduced a bug in 5.9.0.
It is intended to run the Azure SDK under non-optimal conditions or long-term.
We currently run these on Kubernetes (AKS) pods, also has capabilities such as chaos-mesh to introduce artificial network issues to understand how our SDKs behave upon unintended conditions.
Our stress tests are connected to AppInsights, allows us to log events and even dump files to file share.
#internal.wiki/Reliability-Testing
Such a framework helps in simulating long running situations like these.
Adding stress tests to the Event Hubs SDK Azure/azure-sdk-for-js#25661.
I had setup the stress testing framework and ran a couple of tests to even acknowledge and understand the memory leak. We did see the leak from our stress tests.
Again, we now have the stress testing infrastructure running, where you can run long running stress tests, spanning as long as you want to perform your experiments. The tests that I ran were as long as 20 days.
The memory exploded when the long-lived parent signals are latched on to all the child signals.
Also, we remembered an issue that was discussed long ago Azure/azure-sdk-for-js#12030.
Fix being: Azure/azure-sdk-for-js#25682 (Jeremy's PR)
Looking at the heap snapshots, it was hinting that the issue had to do with AbortSignals.
You can generate heapsnapshots and load them using the "heapdump" npm package, as simple as below.
import heapdump from "heapdump";
heapdump.writeSnapshot(`dump.heapsnapshot`);
-
Alloc. Size: how much memory has been allocated to be used.
-
Freed Size: how much memory has been freed for new objects.
-
Size Delta: the change in the overall amount of free memory.
-
The shallow size is the amount of memory held by an object itself (generally, arrays and strings have larger shallow sizes).
-
Retained size column displays the size of memory that can be freed once an object is deleted.
5.9.0 version being the problematic one, fix incoming in 5.10.0
Fix being: Azure/azure-sdk-for-js#25682
How will we avoid such issues from even publishing..
- Run perf and stress tests before every release, every other significant PR at least for AMQP SDKs, to ensure no performance hits or memory leaks.
- Add memory metrics in the Perf results, just like the ops/sec, and the heapdumps option
- Add the same optional settings to the stress tests as well
- Studying the heaps and cpuprofiles to find if there are any more low-hanging fruits