andreyvit/MemoryLeakReport.md

## MemoryLeakReport.md

      
    Raw
  

              MemoryLeakReport.md
            
          
    Memory Leak Report

Many web apps leak memory on IE11 despite running fine on other browsers. We've identified two main causes of this.
The first kind of leak is simple, entirely predictable and caused by a quirk in how IE's JavaScript engine handles closures. Let's call it an “undead closure leak”. This leak goes away if you properly clean up all references or reload the page, and thus affects only AJAXy parts of the apps.
The second kind of leak is a complex permanent one, leaking the entire page context even if you reload the page. Let's call it a “GC singularity leak”. It's triggered by certain JavaScript code patterns, seemingly because IE's garbage collector never finishes its job (our theory is that it exhibits exponential complexity and encounters some sort of timeout/threshold).
This might not be an exhaustive list; internet posts point to other issues, notably when handling IFRAMEs, but we haven't encountered any of these.
The Undead Closures

Here's some innocent-looking code:
function () {
  var foo = someVeryLargeObject;       // 1
  function unused() {
    alert(foo);                        // 2
  }
  Math.someGlobalFunc = function () {  // 3
    return 0;
  };
}()
Executing this fragment leaks someVeryLargeObject because functions unused and Math.someGlobalFunc share the same closure in IE's JavaScript engine. Within the same scope, all functions share a single closure object; anything that they reference, combined, is kept alive as long as any of the functions are alive.
The leaked memory goes away on page reload. In fact, this isn't really a leak as far as IE is concerned — the reference truly is alive.
Avoiding Closure Sharing Issues

This pattern occurs very frequently, and needs to be understood and avoided by developers. There are 3 key ingredients to triggering the leak:

A local variable that references a large object, e.g. an IFRAME constructed for a pop-up.
A function that references the local variable. Referencing a global variable like window or document  is not problematic.
A long-lived function within the same local scope. It can be an event listener that is never removed (e.g. window.addEventListener("scroll", ...)), or a repeating timer, or a function assigned to a global variable (e.g. window.$ = ...), or a method added to a built-in class (Math.easeInOutSine = ...).

The GC Singularity

Truly problematic code is created by combining React with the undead closures pattern, triggering abnormal behavior of IE's garbage collector. In this case, the leaked memory accumulates across page reloads, and the behavior is semi-random (depending on the specific code, may leak always, never, almost always, almost never, or it might leak always but with varying amounts of memory).
Sometimes IE tries to combat this accumulation of memory by restarting the entire process. This may happen within 700–1700 MB of memory usage, and helps with the leak. Unfortunately, this workaround is rarely triggered; far more often, IE simply hangs or crashes while navigating.
We've produced a small-ish reproduction test case using React; unfortunately, it does not really tell much, and we weren't able to reproduce the problem without React.
Note that the leak requires a certain amount of garbage to be produced; in simpler cases, the GC is able to finish and collect everything.
We found that React versions 16.5+ don't leak, while 16.0+ leak, and 15.0+ (pre-fiber) leak less, but still leak. We've bisected the difference down to React commit 97af3e, a part of the fibers transition, which fixes the leak, albeit entirely by accident and without any clear or even hypothetical reason why.
That's good news, of course, but any new React commit may restore the leak via another accident. Which is why we recommend setting up a continuous integration system to monitor the difference in memory usage of IE11 for all production changes.
Fixing the Problem

Many libraries we use exhibit the undead closures leak, but typically the leaked amount is minuscule. The real problem is the libraries that leak user-provided objects or DOM objects on each invocation, and especially the ones that leak the document object (because that seems to be a key ingredient in triggering the singularity).
React

React should be upgraded to 16.5.0+; preferably, to 16.6.3. Both have been confirmed to resolve the leak on our test cases. If we're lucky, this alone might save us from the singularity.
jQuery

Adding jQuery significantly increases the leak once the singularity is triggered, although jQuery is harmless on its own. All versions seem to behave similarly. Of course, we can't avoid jQuery, and it does not seem to be problematic by itself, so no action here.
Highcharts

Highcharts contain one of the undead closure patterns that leaks the entire document. Normally, this wouldn't be a problem (leaking the document is harmless, it is alive until page reload anyway), but, together with React, it is a key component in triggering the GC singularity.
Note that the functionality of Highcharts itself is irrelevant here; the problem lies in the 9-line undead pattern demoed above. Sadly, this also means that upgrading to the latest version of Highcharts doesn't change anything.
We don't stricly need to act here, except maybe transitioning to a simple global non-UMD packaging of Highcharts.
New Relic

New Relic's JavaScript collector wraps all native methods that accept a function (think addEventListener, setTimeout, fetch etc), significantly contributing to the singularity and making it much easier to trigger.
Fancybox

This library used undead closures to leak every popup's DOM. We should make sure we never use it again.
Problematic Code

Note that the same pattern — a sharing of closures — seems to be contributing to both issues. It is not enough to trigger the singularity alone, by far, but once the singularity is triggered, the extra leaks increase the amount of memory leaked per page.
A few specific patterns make the singularity so much easier to trigger:

UMD, Webpack and Babel. Every module gets wrapped into an anonymous function that accepts document, window and/or other large global objects as argument, typically immediately followed by an assignment to a local variable. This alone is a huge chunk of the shared closures pattern, you just need to bring your own functions inside. We've trimmed the Highcharts library down to exactly this pattern — a function assigned to Math.easeInOutSine plus an unrelated function that references doc (which holds the document), all nested inside a UMD wrapper.
React refs and event handler functions. If created within a render method, these produce a new undead closure on every render. We should educate developers to use the new React.createRef API and to bind event listeners in constructors, which is a best practice anyway. Also, many of our components assign refs ‘just in case’, without actually using them; we should avoid that.
We have multiple places that set up global event listeners or create never-ending timers. These can be refactored to avoid sharing a large closure. But, better yet, these need to be reviewed, because they seem to be entirely avoidable in many cases.

Methodology (for the Brave and Curious)

Things to Know Before You Begin


Opening IE Dev Tools contaminates the entire browser, causing a completely different memory behavior. Closing Dev Tools does not fix that. Reboot the machine afterwards, and knock on wood three times.
Generally, you should reboot the machine between tests. Killing all instances of iexplore.exe using Process Manager helps most of the time, but sometimes causes weird intra-test interference of unknown origin. We've learned the hard way. Please do reboot. (And note that simply closing the IE window is definitely not enough, because after you trigger the singularity, the process tends to stay behind for a long time.)
You should clear the browser cache between tests. Not doing that definitely causes intra-test interference.
You should run every test at least twice, important tests 5 times,  and key tests should be run 10–30 times. Note that for 95% of the tests, the results are completely consistent across runs; don't let that discourage you from discovering those 5% where you'll be amazed what happens next.
Reloading a page is not the same as navigating between two pages. Reloading a single page causes more leaks, and is not representative of the user behavior, so you should navigate instead. Avoid using the Back button, that triggers something completely different as well. Just make two identical pages (or alias the same page under different URLs), and navigate between the two using bookmarks or links.
You need to do about 100 navigations per test to be sure. Singularity-style memory leak does not cause memory to increase steadily. For a number of initial page loads, GC will mostly keep up, and memory usage will grow in a few big reasonably-looking jumps as garbage accumulates. The real steady page-to-page growth may only come later. The magic threshold seems to be around 700 MB, although it varies a lot. You may be stuck around that number for a while.
We've been testing with 2 GB RAM VMs. We've tried 4 GM RAM, that seemingly makes the leaks slightly longer to reproduce.
We found that iMac Pro can handle 8-10 parallel testing VMs with smaller tests, and 4-6 VMs with larger tests.
Just as a bonus fact, pressing F5 twice with a steady hand and enlightened heart (not too quickly, not too slowly) often causes the garbage to be properly collected and memory usage to drop down to almost zero, even when all hope seemed lost.

IE Tester

We've been stepping up our automation over time, until we figured out that nothing short of a fully-automated scriptable parallel testing system driving 10 VMs has any chance of figuring this stuff out.
Here's a quick summary of the journey:

manual testing is fun, just press F5 and watch memory usage grow (or not); those were the days, before we discovered all the complications mentioned in the previous section;
Selenium + WebDriver work, but affect IE's memory behavior, which is unsurprising giving that they traverse live DOM via COM;
screenshot-based Sikuli automation is cleaner, but is very flaky and a huge pain to run;
then we wrote our own driver in Go, which combines the approach of Sikuli (screenshot matching and mouse clicking) with Win32 API-based process monitoring (killing the IE on out-of-memory or when anything looks weird), COM-based navigation and dynamic scripting.

Meanwhile, we also needed to modify pages as they got served to experiment with various changes:

Charles Proxy served us well for a long while, using its broad abilities to modify and override requests;
then we made an HTTP proxy server in Go to handle more complex modifications, and configured Charles to use it as a downstream proxy;
then we extended the Go proxy server to handle everything that Charles did;
then we introduced named rules and profiles, so that each test's proxy configuration would be immutable and reproducible;
then we merged the testing tool and the proxy, so that the tests would specify their proxy settings inline.

Finally, we've split the tester into a controller (runs on a main machine and schedules work) and an agent  (runs on each VM, accepts work from the controller, reports progress and results back), so that we could run a farm of 10 VMs instead of just one.
We went from Sikuli tests requiring constant attention and manual restarts to never having to look at the VMs and a fully automated system that happily runs overnight. (Note: rare bugs still cause things to hang or crash occasionally, but checking the status every few hours is generally enough.)
The resulting IE tester project is a single pure-Go binary that can be easily cross-compiled for Windows. It is driven by a custom DSL program, which is reloaded by all workers before each test. The program defines the scenarios (pages, navigation, delays, screenshots to compare) and the proxy profiles and rules. You can modify the program without recompiling the workers.
The controller uses a single human-readable and human-modifiable (and Emoji-rich) queue.txt file as its database. It checks and updates it every 3 seconds as long as any workers are connected. The human is expected to add/edit stuff in this file, and to monitor the results. (The human may occasionally overwrite the machine's changes, and vice versa. That's a small inconvenience that can be fixed by editing via VS Code, which is smart enough to avoid overwriting external changes and to avoid losing the unsaved changes.)
Investigating the Problem

We followed a few lines of inquiry:

cutting HTML and JavaScript down until it stops leaking (this was our tool of choice at the end)
tinkering with stuff to see what affects the leak (trying different library combinations, HTML tags, versions, JavaScript modes, HTTP headers — this was our second tool of choice)
tinkering with stuff until the leak gets worse (increasing the amount of data rendered, adding extra unused code or data — we've applied this at key steps to make diagnosis faster and to replace complicated chunks of code with smaller simpler chunks doing more iterations)
adding stuff to make the leak go away (we've tried various approaches to automatic cleanup and to un-override things that New Relic overrides; this largely failed to produce useful results, but uncovered a number of spectacular IE crashes)
reproducing the culprits in a new file, trying to build up a combination that leaks independently of the main site (we've failed spectacularly with this approach, in more than one way — initial unreliable testing contributed to a lot of time wasted here)

In the end, methodically bisecting pieces of HTML and JavaScript proved to be most useful approach. It is very labour-intensive, though; it helps to run multiple VMs, and in some cases we've allowed ourselves to make bisection choices after 30 pages instead of 100, after carefully observing the behavior of the leak we're interested in.
We've learned some useful things along the way:

minified and unminified versions of a library sometimes behave differently, but most of the time the behavior is, surprisingly, the same;
many de-minifiers produce invalid code, only the best ones are suitable to de-minify a large bundle for IE;
packaging matters, as mentioned above, so simply replacing a Webpack bundle with the libraries it contains sometimes produces different results;
production and development versions of React are like two different libraries;
Safari dev tools have a useful option to gray out non-executed code; this can be used to quickly nail down the less interesting parts of the code.

We've stripped the live site down to the part that renders the header. With a development account, the header contains a list of all companies and projects, which is enough to cause a GC singularity. The minimal reproduction test case above is the result of analyzing this header. This isn't the only combination that triggers the leak, but it's the most actionable and useful one that we found.
Caching

We've experimented with disabling browser cache via Cache-Control and Expires headers. It had a huge effect on many tests; with caching disabled, we were sometimes able to come up with a set of page modifications (notably, disabling New Relic and WalkMe scripts, and modifying some of the dependencies) that completely stopped the leak.
With caching enabled, we were unable to produce a simple set of changes that resolves the leak on the entire live pages of our site. However, in our final round of tests with stripped-down examples, caching had no effect.
We've tried to separately disable caching for JavaScript files and for the main HTML file. Both had an effect on the leak, and both had to be disabled to avoid the leak in the cases mentioned above.
Browser Settings

Internet Explorer has a number of settings that could potentially affect the problem:


We've tried private mode manually; it did not seem to have any effect, although we didn't use automated tests to verify that.


We've tried disabling IE protected mode; it did not seem to have an effect, but similarly we only performed manual tests.


We've tried disabling GPU acceleration. It seems to change the amount of memory leaked in the first kind of leak; perhaps some elements may be holding on to GPU textures. It does not have a big effect on the outcome, though. A caveat is that we did these experiments early on, before learning of all the complications.


We've tried switching the IE process to 64-bit while running the Sikuli stage of our scripting pipeline. This allows the browser to live a bit longer, because it lifts a hard address space limit of 2 GB. Does not seem to affect the leak itself, though.


Additional Tools

While our main tools are a text editor and IETester, before we go down that long, long route, it makes sense to get better acquainted with the leaked memory and GC behavior overall.
WinDbg

Windows has a family of powerful (and free) debugging tools, dating back decades. These include WinDbg, NTSD and others, currently available as “Debugging Tools for Windows” download. (NTSD used to even ship with Windows.) All of them run using the same debugging engine internally; WinDbg provides a GUI on top of it, while CDB and NTSD are console apps, and KD is a kernel debugger.
Unlike traditional debuggers like the one in Visual Studio, these debuggers are extremely powerful, but are also driven by keyboard commands (kinda like gdb), not mouse.
The reason we're interested in these is that you can meaningfully debug Internet Explorer using these tools. Here are the steps (NOTE: all commands here link to their reference documentation):


Install the tools and launch WinDbg. Note: you want the X86 version.


Configure access to the Microsoft Symbol Server so that you get actual function names inside Windows libraries and even in the Internet Explorer binary. Create a directory like C:\Symbols, and then use System control panel to set _NT_SYMBOL_PATH environment variable to srv*c:\symbols*https://msdl.microsoft.com/download/symbols under User variables: 


Reproduce the case you're interested in.
When you want to look at a leak, wait until IE leaks over 1.5 GB of memory — then pretty much any random memory location will contain the leaked data.
When investigating a case of IE not displaying anything when approaching its memory limit, find a reproducible case that hangs IE for a few minutes.


Attach WinDbg to IE process (F6).


Make a minidump with full memory contents: .dump /ma c:\dev\ie.dmp (if /ma fails due to reading errors, try /mA instead) — of course, c:\dev\ needs to already exist


Detach from IE: .detach


Open the minidump via File → Open Crash Dump.


Explore the dump.
First look at the call stack. Give it a couple of hours to download all symbols the first time you do this.
Then investigate the memory. The dump contains a copy of the entire IE memory space. If the leak is big enough, a random memory location should be inside the leaked data. Feel free to explore the contents of memory.
Use !address to obtain the memory layout and learn which memory regions to look at. There's going to be a LOT of memory regions; you need something like this script to summarize the list.
Be sure to scroll around enough to get a good general idea of what sort of data gets leaked.


Memory Leak Dump

So, you've looked at the memory itself, but why and when does it get leaked? The user-mode dump heap (UMDH) utility will help here.
You need to follow the UMDH guide, but in a few words:


Be sure to configure _NT_SYMBOL_PATH as described in the previous section. We need the function names!


Use GFlags tool to enable “Create user mode stack trace database” on iexplore.exe (press TAB after typing the image name; image is a synonym for file): 


Prepare Internet Explorer in the initial, pre-leak state.


Run umdh -p:1234 -f:ie1.txt, where 1234 is the PID (Process ID) of iexplore.exe process (the leaf one).


Reproduce the leak.


Run umdh -p:1234 -f:ie2.txt.


Find the difference between the two files by running umdh -d ie1.txt ie2.txt >ie.txt. This will take a long time (can be hours), and will download the symbols if WinDbg hasn't already downloaded them.


Review ie.txt file; see UMDH manual on interpreting the results.


Very important — use GFlags to disable heap recording on iexplore.exe, otherwise your future tests will be useless (yes, happened to us).


This gives you a list of leaked heap allocations, grouped by call stack (and with a readable meaningful call stack for each one).
Here's an example:
+ 2672800 ( 2672800 -      0)  12850 allocs	BackTrace1FE937A0
+   12850 (  12850 -      0)	BackTrace1FE937A0	allocations

    ntdll!RtlpCallInterceptRoutine+26
    ntdll!RtlAllocateHeap+45C82
    msvcrt!malloc+90
    jscript9!HeapAllocator::NoThrowAllocZero+F
    jscript9!SmallNormalHeapBlock::New+3A
    jscript9!HeapBucketT<SmallNormalHeapBlock>::CreateHeapBlock+47
    jscript9!HeapBucketT<SmallNormalHeapBlock>::SnailAlloc+53
    jscript9!Recycler::AllocZero+172961
    jscript9!JsUtil::BaseDictionary<Js::PropertyRecord const *,Js::SimpleDictionaryPropertyDescriptor<unsigned short>,RecyclerNonLeafAllocator,DictionarySizePolicy<PowerOf2Policy,1,2,1,4>,Js::PropertyRecordStringHashComparer,Js::PropertyMapKeyTraits<Js::PropertyRecord const *>::Entry>::Initialize+204
    jscript9!JsUtil::BaseDictionary<Js::PropertyRecord const *,Js::SimpleDictionaryPropertyDescriptor<unsigned short>,RecyclerNonLeafAllocator,DictionarySizePolicy<PowerOf2Policy,1,2,1,4>,Js::PropertyRecordStringHashComparer,Js::PropertyMapKeyTraits<Js::PropertyRecord const *>::Entry>::BaseDictionary<Js::PropertyRecord const *,Js::SimpleDictionaryPropertyDescriptor<unsigned short>,RecyclerNonLeafAllocator,DictionarySizePolicy<PowerOf2Policy,1,2,1,4>,Js::PropertyRecordStringHashComparer,Js::PropertyMapKeyTraits<Js::Prop+32
    jscript9!Js::SimpleDictionaryTypeHandlerBase<unsigned short,Js::PropertyRecord const *,0>::SimpleDictionaryTypeHandlerBase<unsigned short,Js::PropertyRecord const *,0>+A1
    jscript9!Js::SimpleDictionaryTypeHandlerBase<unsigned short,Js::PropertyRecord const *,0>::New+54
    jscript9!Js::PathTypeHandlerBase::CreateNewScopeObject+2F
    jscript9!Js::JavascriptOperators::LoadHeapArguments+59

This says that 2.5 MB has been leaked over 12850 allocations, and (looking at CreateNewScopeObject call) these all seem to be JavaScript scope objects.
Pay attention to the total size of memory leaks found using this tool. In our testing, this was a small percentage of the overall memory leaked. That means that most of the leaked memory wasn't allocated on the heap, which in turn means either memory was allocated in large page-sized chunks (and the heap was bypassed), or something funky is going on — data structures and code generated by a JIT compiler would be a perfect fit for this.
Even though we can't see the most significat leaks via this tool, we do get some very useful ideas. For example, we've got a lot of leaked scope objects. They are small, but they still are leaked, which is a useful thing to know; and they may point to the remaining large leaked objects.
Process Explorer

Okay, that was hardcore. Let's move on to something much simpler: a great way to monitor memory usage of a process.
Process Explorer (by Sysinternals, later purchased by Microsoft) allows you to view all sorts of process statistics, and includes memory, I/O and CPU graphs. Double-click a process to open extra details.

Setting Up the VMs

We've been testing using “IE11 on Windows 8.1” virtual machine from Microsoft, set up with 2 GB RAM and 1 CPU, at 1366x768 resolution.
You need to generate a self-signed root certificate. IE11 is picky, and some ways to generate a certificate accepted by Safari and Chrome gets rejected by IE. A recommended approach is to install Charles Proxy and exports its root certificate (from the Help menu). Install the certificate as a trusted root on the Windows VM.
We've set up a folder (windows-env) shared by all VMs, and run ietester.exe in that folder using the following config in ietester.cfg:
# agent
-prg-dir prg
-manager-host 10.211.55.2
-rproxy-host 10.211.55.2

# manager
-queue-dir .

We then create a shortcut called “Run Tests” that executes ietester -g from the mapped folder, and add that shortcut to the Startup folder (Win-R shell:startup). This makes the tester run automatically after each reboot, which is crucial because the tester reboots the machine after each test.
Then we open IE and save the login password:

Navigate to app.procore.com.
Enter your credentials, do NOT check “Remember me”.
Submit and allow IE to save your password.

Then you can shut down this VM and clone it multiple times.
For each VM, set a unique system hostname (right-click Start → System → “Computer name, domain and workgroup settings” pane → “Change settings” → “Change...”). IMPORTANT: the first letter of each hostname must be unique. Use a list like “57 Selected Stars for Navigation”.