Skip to content

Instantly share code, notes, and snippets.

@HarshaNalluru
Last active August 2, 2023 20:46
Show Gist options
  • Save HarshaNalluru/2f1931ea30e8335678310053aa342db0 to your computer and use it in GitHub Desktop.
Save HarshaNalluru/2f1931ea30e8335678310053aa342db0 to your computer and use it in GitHub Desktop.
How we found and fixed a Memory Leak in Event Hubs SDK 5.9.0?

Background

User logged a new issue that the memory is leaking with the new version of @azure/event-hubs SDK 5.9.0. We unintentionally introduced a bug in 5.9.0.

Stress-testing framework

It is intended to run the Azure SDK under non-optimal conditions or long-term.

We currently run these on Kubernetes (AKS) pods, also has capabilities such as chaos-mesh to introduce artificial network issues to understand how our SDKs behave upon unintended conditions.

Our stress tests are connected to AppInsights, allows us to log events and even dump files to file share.

#internal.wiki/Reliability-Testing

Such a framework helps in simulating long running situations like these.

Testing

Adding stress tests to the Event Hubs SDK Azure/azure-sdk-for-js#25661.

I had setup the stress testing framework and ran a couple of tests to even acknowledge and understand the memory leak. We did see the leak from our stress tests.

Memory usage for the test

image

Again, we now have the stress testing infrastructure running, where you can run long running stress tests, spanning as long as you want to perform your experiments. The tests that I ran were as long as 20 days.

Back to the test & The fix

The memory exploded when the long-lived parent signals are latched on to all the child signals.

Also, we remembered an issue that was discussed long ago Azure/azure-sdk-for-js#12030.

Fix being: Azure/azure-sdk-for-js#25682 (Jeremy's PR)

Heap Snapshots

Looking at the heap snapshots, it was hinting that the issue had to do with AbortSignals.

You can generate heapsnapshots and load them using the "heapdump" npm package, as simple as below.

import heapdump from "heapdump";
heapdump.writeSnapshot(`dump.heapsnapshot`);

Loading snapshot

image

Comparison view

image

image

image

image

  • Alloc. Size: how much memory has been allocated to be used.

  • Freed Size: how much memory has been freed for new objects.

  • Size Delta: the change in the overall amount of free memory.

Summary view

image

  • The shallow size is the amount of memory held by an object itself (generally, arrays and strings have larger shallow sizes).

  • Retained size column displays the size of memory that can be freed once an object is deleted.

5.9.0 version being the problematic one, fix incoming in 5.10.0

Upon fixing... checkpoint store test

  • Blue - 5.9.0
  • Red - 5.10.0
  • White - 5.8.0 image

Fix being: Azure/azure-sdk-for-js#25682

Going forward

How will we avoid such issues from even publishing..

  • Run perf and stress tests before every release, every other significant PR at least for AMQP SDKs, to ensure no performance hits or memory leaks.
  • Add memory metrics in the Perf results, just like the ops/sec, and the heapdumps option
    • Add the same optional settings to the stress tests as well
  • Studying the heaps and cpuprofiles to find if there are any more low-hanging fruits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment