travisdowns/cache-counters-rant.md

## cache-counters-rant.md

      
    Raw
  

              cache-counters-rant.md
            
          
    The counters that are the easiest to understand and the best for making ratios that are internally consistent (i.e., always fall in the range 0.0 to 1.0) are the mem_load_retired events, e.g., mem_load_retired.l1_hit and mem_load_retired.l1_miss.
These count at the instruction level, i.e., the universe of retired instructions. For example, could make a reasonable hit ratio from mem_load_retired.l1_hit / mem_inst_retired.all_loads and it will be sane (never indicate a hit rate more than 100%, for example).
That one isn't perfect though, in that it may not reflect the true costs of cache misses and the behavior of the program for at least the following reasons:

It appplies only to loads and can't catch misses imposed by stores (AFAICT there is no event that counts store misses).
It only counts loads that retire - a lot of the load activity in your process may be due to loads on a speculative path that never retire. Loads on a speculative path may bring in data that is never used, causing misses and data movement that is not accounted for. More insidiously, loads on a speculative path may bring in data that you ultimately do end up using on the good path, but the event that gets counted will be from the good-path retired load at which point the data may be in the cache, following the bad-path miss. Effectively, the speculative loads are laundering misses into hits because you take the miss on the bad path, but the counter increments later based only what happens later on the good path.
It doesn't handle gathers well - because it counts at the instruction level, but gathers logically consist of multiple loads, up to 8 in AVX2 code (or 16 in AVX-512), and you'd probably like to know the results of all of them. The event only lets you know if any load missed, or any load hit, so a single gather can be counted as both a hits and miss, but you don't know how many of each.
It's a bit awkward when it comes to prefetching: if prefetching is doing work, how the event counts is effectively a race between whether the prefetch arrives before the load does, or vice-versa. In practice, performance will be similar if the load and prefetched data arrive at around the same time (i.e., the prefetch hid nearly all the latency), but the event can swing wildly depending on small timing differences which change the "winner", without really changing the performance. Sometimes you can turn prefetching off to get the "expected" counts, but if this significantly changes the performance of your code the measurement is pretty useless.

Enter into these already muddy waters the l1d.replacement event, as the primary alternative. This one doesn't count at the instruction level at all, but rather events that occur at the L1D cache: it counts the number of lines replaced in the L1D. Its characteristics are more or less the mirror opposite of the mem_load_retired events:

It counts all replacement events, whether they were triggered by loads, stores, prefetch or something else (perhaps a remote snoop hit).
It counts at cache line granularity: so if N loads all miss to the same line it counts once (the mem_load_retired OTOH will count 1 miss and N-1 fb_hit events).
It counts replacements triggered by speculative paths identically to non-speculative.
It can tell you only about the one replacement event per line, not how many subsequent loads required that same line and had to block (i.e., nothing to correspond to the mem_load_retired.fb_hit event).

Whether you like those behaviors or not is largely context dependent.
For example, although this avoids hiding the bad-path speculative loads, maybe you don't want to count those. Consider a branch-based binary search over N elements, with the branches basically taken randomly (looking for a randomly located element). You expect this to load log2(N) elements, at worst (if none were cached initially), right?
Well l1d.replacement will give you a much larger number than that, maybe 5x larger - because the CPU is constantly running ahead of the search, guessing left or right at each level, and loading elements that aren't even on the "textbook" binary search path. You don't really have to wait for those wrong-path elements though, so maybe you don't care about them. Also, it's weird to report that you executed 10 load instructions but got 50 l1 misses out of them - it doesn't mesh with our model of each load instruction incurring a single hit or miss. That's why l1d.replacement event doesn't play nice in ratios: it seems to make for an OK numerator, but what are you going to use as the denominator? There's nothing obvious that would dynamically count all the possible reasons a replacement could happen.
Another similar alternative is to use the l2_rqsts counters. These count at the l2, again at the cache line/request granularity like l1d.replacement. To create a synthetic L1D miss counter out of these can the sum of l2 hits and misses, for example (under the assumption that l1 misses turn into l2 requests). The primary advantage of the l2_rqsts counter is that these events are fine-grained in their cause: you have for example l2_rqsts.demand_data_rd_hit for demand data loads, and l2_rqsts.rfo_hit for RFO (store) hits and so on. You can distinguish requests triggered by prefetching (L1 and L2 prefetching) from demand requests too. You'll chew up more counters doing that though.
My advice? First you have to understand your code, at the assembly level (unfortunately): know how many load instructions you expect and check that the perforamnce counters agree, before doing anythign else.
Then, you should use both the mem_load_retired and l1d.replacement/l2_rqsts counters, and try to understand the values in terms of your mental model of your code, and how the counters count (as above). If you're lucky, they will have similar values, in which case you can assume that the many gotchas above didn't apply (or they did but sneakily cancelled out - but that's fairly unusual because usually the sign of the divergence is usually the same: l1d.replacements being higher). If they are quite different, you'll want to know the cause. If you can't figure out what's going on simply by squinting at the code, drinking coffee and applying brainpower, you've got a lot of experiments you perform:

Replacing branches with branch-free idioms to remove speculative paths and observe the counter changes.
Forget the L1 events and try to model the system at the L2 where the options are clearer and you have finer-grained counters.
Run the system under some type of instrumenting cache profiler like cachegrind to get an idea of the non-speculative load and caching behavior.
Use perf record + report/annotate to get an idea of exactly where events are coming from rather than looking at aggregates, and then perhaps create a more isolated benchmark if code you don't care about is counfounding things. You can also use something like pmu-tools/jevents that let you isolate the region of interest.
Turn on and off the various prefetchers to isolate prefetching effects.
Use synthetic inputs to your algorithm that will isolate various effects (e.g., input that avoids branch misprediction, input that can be perfectly prefetched or cannot be prefetched, etc).
Examine the (uops_dispatched_port.port_2 + port_3 - port_4) counters to understand the number of actually dispatched loads, compared to mem_inst_retired.all_loads to get a bound on the number of speculative loads (this is tricky though because these counters also count replays).
Use execution barries like lfence in the design experiments of experiments to solve timing/overlap related things like fb_hit vs l1_hit.
Use intrusive cache modeling, e.g., a std::array-like thing that implements operator[] in a way that tracks access and applies them against a cache model to give you "textbook numbers".

All of these have helped me get to the bottom of the cache-realted performance numbers in the past.