Zhentar/EtwStackCaching.md

## EtwStackCaching.md

      
    Raw
  

              EtwStackCaching.md
            
          
    ETW Stack Caching

"Stack Caching" (or Stack Compression as PerfView calls it) is a feature of ETW designed to reduce trace buffer & etl file sizes by de-duplicating stack traces. Naturally, as an ETW feature it is documented solely through obtuse (likely accidental) references and hints in Microsoft tooling. And so the documentation is left to stubborn reverse engineers dedicated ETW enthusiasts such as myself.
The Windows version studied for this was Windows 10 1809 64-bit. I do not think this feature has changed significantly since its introduction, but I have not verified that.
Basics

In trace buffers, the compressed stacks are emitted with the Stackwalk task guid, like regular stackwalks, but with opcodes for events (as labeled by WPA) like "Stack Walk: Delete Definition" and "Stack Walk: Reference [User]". "Reference" entries contain a 'StackKey' value that uniquely identifies a stack trace definition. "Stack Walk: Delete Definition" is logged when cached stacks are evicted; from the MOF definition and the name you might assume that it is logged in pairs with "Stack Walk: Create Definition", but it does not appear that Create events are ever logged (probably because logging two events would spoil the space savings and only logging after the final usage will work for circular buffers). There is also a "Stack Walk: Rundown Definition" event which records all stacks remaining in the cache at the end of a trace.
Enabling Stack Caching

If one is using the .NET TraceEvent library, you can simply set StackCompression property. Otherwise, there are two options for enabling it. When calling StartTrace, one can append it to the end of the EVENT_TRACE_PROPERTIES struct (the basic approach I have described in another gist, used in combination with the constants/struct described by Geoff Chappel), or by using NtSetSystemInformation like so:
enum SYSTEM_INFORMATION_CLASS 
{
    SystemPerformanceTraceInformation = 0x1f
}
enum EVENT_TRACE_INFORMATION_CLASS 
{
    EventTraceStackCachingInformation = 0x10
}

[DllImport("ntdll.dll")]
static extern UInt32 NtSetSystemInformation([In] SYSTEM_INFORMATION_CLASS InfoClass,[In] void* Info,[In] UInt32 Length);

struct EVENT_TRACE_STACK_CACHING_INFORMATION {
    public EVENT_TRACE_INFORMATION_CLASS EventTraceInformationClass;
    public UInt64 TraceHandle;
    public bool Enabled;
    public byte Reserved1;
    public byte Reserved2;
    public byte Reserved3;
    public UInt32 CacheSize;
    public UInt32 BucketCount;
}

static unsafe uint EnableStackCaching(UInt64 handle)
{
    var info = new EVENT_TRACE_STACK_CACHING_INFORMATION 
    {
        EventTraceInformationClass = EVENT_TRACE_INFORMATION_CLASS.EventTraceStackCachingInformation,
        TraceHandle = traceHandle,
        Enabled = true
    }
    return NtSetSystemInformation(SYSTEM_INFORMATION_CLASS.SystemPerformanceTraceInformation, &info, sizeof(EVENT_TRACE_STACK_CACHING_INFORMATION))
}
Tuning Stack Caching

We can see from the struct definitions there are two parameters we can adjust, CacheSize and BucketCount. But what do they mean? What values are appropriate? Well, there's no documentation anywhere to be found; for this you have to dig deep into kernel internals... or read what I'm about to share!
The stack cache structure is a hashmap lookup with linked list chaining. BucketCount is the number of buckets in this hashmap. The minimum and default value is 256 buckets, and the maximum 4096. The powers-of-2 definitions might lead you to think that hash codes are mapped to buckets using bitmasks, but this is not the case - it is using division. The simple hash algorithm used probably gets adequate entropy in the low bits, but it is likely better to pick a prime number rather than a power of 2 for your bucket count.
CacheSize is the total amount of memory that is allocated for entries within the hashmap. With a 64-bit kernel, the minimum & default size is 0x300000 bytes (3 MB), the maximum 0x3200000bytes (50 MB). This is all allocated upfront, using non-paged memory, in 0x128 byte pieces.
Both parameters are clamped rather than validated - that is, providing values outside of the defined ranges will simply select the nearest valid value rather than return an invalid parameter error.
The 0x128 byte pieces seems odd at first - how can 256 frame stack traces fit in there? That's barely 1 byte per frame! The answer is that each 0x128 byte piece holds only 32 frames; not only does the hashmap chain entries in linked lists, but the entries themselves are also linked lists of stack frames! This allows the kernel to optimize the number of stack traces that can be kept in the cache without requiring any dynamic allocations. However...
Each bucket is capped to 4 stack traces. When there is a successful lookup, the stack is moved to the front of the linked list, keeping the linked list in MRU order. When there is a miss, a new stack is added to the front, and if that would exceed 4 the tail is evicted. If there aren't enough free pieces to hold the new stack, a regular stack trace event is logged. I haven't followed that logic in too much depth, but it appears that undersizing your cache could lead to pathological cases where most of the cached traces are shallow stacks while the largest stacks don't get cached effectively. The cost of oversizing your cache on the other hand is just that a bit of non-paged memory hangs out in RAM for the duration of the trace at the tail end of the free list, never accessed. Though note that even if you have the maximum 4096 buckets, filled with 256 frame stacks, and 4096 cores concurrently each adding a new 256 frame stack to a different bucket, you will still only hit 46.25MB, so the 50MB cap is complete overkill. Likewise the 3MB for 256 buckets default is basically impossible to use. My quick spitball estimation is that 4MB per 1024 buckets should be more than sufficient for most applications.