asahilina/agx-coherency-and-tlbs.md

## agx-coherency-and-tlbs.md

      
    Raw
  

              agx-coherency-and-tlbs.md
            
          
    AGX coherency, caching, and TLBs

These are just some notes on my current understanding of the subtleties of the AGX memory model and the TLB/caching issues I'm seeing.
Hypervisor shenanigans

TLBI instructions do not broadcast to the GPU from EL1 with stage 2 translation enabled. That's it. That's what the bug was.
GPU side

MMU (UAT)

The AGX MMU has 64 context slots (63 usable for user contexts). The page table base addresses are stored in a page (Apple calls it "TTBAT") in main memory. There are user and kernel halves (low and high), but we can ignore the high half from the GPU perspective since it is only used in very specific cases. The permission bits are funny and the same page tables are shared (and interpreted differently) by the ASC, but they are basically ARM page tables. Each context slot has an ASID (how many bits?), like a TTBR in ARM.
GPU user objects are always mapped either GPU=R- or GPU=RW by macOS:

GPU=RW, EL1=RW, GL1=RW, perm=10011, Shared, Local, Owner=OS, AF=1, SH=0
GPU=R-, EL1=RW, GL1=RW, perm=10111, Shared, Local, Owner=OS, AF=1, SH=0

The page tables themselves are accessed by macOS as ff:OS (Normal-Cached Outer Sharable). The TTBAT and the handoff area use Normal-Uncached Outer Sharable (44:OS), which is the same as the DCP boot framebuffer (Apple calls this "real time").
TLBs

The GPU MMU's TLBs are managed with AP-side TLBI instructions (!) in the Outer Sharable domain. This means that the CPU pretends GPU VAs are CPU VAs (!!) and just issues TLBI instructions directly. So shooting down GPU TLBs might inadvertently shoot down unrelated CPU TLBs. In the future, this will need some ASID reservation mechanism to avoid the conflicts... though for now it doesn't really matter much. macOS solves this by just reserving the GPU VA ranges in the CPU userspace VA allocator globally... (ouch! T_T).
TLBI instructions use the ASID configured in the TTBAT. ASIDs are 8 bits.
Caching

There are obviously caches in the GPU, though their coherency is not currently known. The render commands take a list of "attachments", which normally contains the framebuffer/z-buffer, but not including them in this list doesn't seem to break anything. I suspect this list is similar to MTRRs on x86 and configures those memory ranges as write-combine (bypassing the cache), which makes sense for framebuffers since you wouldn't want them to evict useful cache lines on the way to main memory.
On the AP side, all GPU-accessible buffers seem to be mapped as Normal-Cached Outer Sharable, which implies the caches are coherent (at least some of the time). I haven't dumped the PTEs yet, but AT instruction output on written VAs shows attributes ff:OS.
ASC (coprocessor) side

AXI2AF Bridge

There is an AXI2AF (AXI to Apple Fabric) bridge that bridges the main SoC fabric to the AXI coprocessor interface. This bridge needs to be configured with some hardcoded pokes to pass TLBIs properly. These pokes are in m1n1 right now.
MMU

The coprocessor shares page tables with the GPU kernel half. This includes firmware code/data. The permission bits are interpreted differenly for the ASC, so you can have things that are only accessible to the ASC (this includes all the FW code/data bits).
The ASC listens to CPU TLBI instructions like the GPU. macOS issues firmware-side TLBIs with ASID 0x40 (64), but this field is actually ignored and you can use any ASID (at least for top-half kernel addresses).
macOS almost never unmaps/remaps things in firmware address space, so it largely works even without TLBIs inside a broken hypervisor (...).
Caches

It seems the ASC's caches are not coherent with the AP caches. However, the AP caches are coherent with the overall SoC fabric. So we have an asymmetric cache situation: the AP can pretend everything is coherent, while the ASC has to do cache management.
Structures are mapped (by the OS) using two caching modes in the ASC: uncached and cached. Uncached is used for global flags, statistics type stuff, AP→ASC counters, ring buffer pointers, and ASC→AP ring buffers. From the point of view of both sides, these mappings look coherent (there's just no cache on the ASC side). Cached is used for other things, like command buffers and firmware objects allocated/donated by the AP for the ASC. The ASC invalidates these ranges before accessing them.
I've seen these mapping modes for the ASC PTEs:

GPU=--, EL1=RW, GL1=RW, perm=10110, Shared, Global, Owner=OS, AF=1, SH=0
GPU=RW, EL1=RW, GL1=RW, perm=10011, Shared, Global, Owner=OS, AF=1, SH=0
GPU=--, EL1=RW, GL1=RW, perm=10110, Normal, Global, Owner=OS, AF=1, SH=0
GPU=RW, EL1=RW, GL1=RW, perm=10011, Normal, Global, Owner=OS, AF=1, SH=0

Things can be ASC-only or shared with the GPU, and can use Shared (uncached) or Normal (cached) attributes. SH always seems to be 0 (Non Shared), which matches with the idea that sharability/cache coherency doesn't actually matter/work from the ASC side.
Unmapping ASC-cached buffers requires a special dance to ensure the caches are flushed. This is implemented in my driver right now.
UAT PPL Handoff

There is a shared page used to coordinate communications between the PPL in macOS and the uPPL in the ASC. These are privileged pieces of code that manage page tables (so they can ensure memory map control even if the rest of the OS/firmware is compromised). The handoff contains:

Magic numbers
A Dekker lock for UAT TTB modifications
Some state flags
A field containing the current user context mapped by the ASC, if any.
Flush structs for each context 0-63, plus 64 for the ASC.

It's not clear what the Dekker lock is for, since TTB writes are effectively atomic 64-bit stores and the ASC never writes them anyway. I implement it anyway, just in case.
Funny anecdote: after I switched the handoff area to uncached, I got an IMPDEF (!!) sync fault because I was doing a compare-exchange on TTBs when freeing VMs, and those atomic RMW instructions apparently don't work in uncached mode... I wasn't expecting an IMPDEF fault though!
The flush structs are used (for an unknown purpose, possibly dummy?) during TLB invals, but are also used (with a known purpose) for the cached mapping unmap dance.
macOS TLB invals for GPU VAs look like this:
# [cpu3] [HandoffTracer] MMIO: R.8   MAGIC_FW = 0x4b1d000000000002 ()
## This is the same PTE that was already present (no change)
# [cpu3] [AGXTracer@/arm-io/gfx-asc] UAT write L0 at 1:0x1500000000 (#0x354) -> 0x00E0000961DF4C0B
# [cpu3] [AGXTracer@/arm-io/gfx-asc] UAT map 1:0x1500d50000 -> 0x961df4000 (0xe0000961df4c0b (OS=1, UXN=1, PXN=1, OFFSET=0x25877d, nG=1, AF=1, SH=0, AP=0, AttrIndex=2, TYPE=1, VALID=1))
# [cpu3] [HandoffTracer] MMIO: R.4   FLUSH_STATE[1] = 0x0 ()
# [cpu3] [HandoffTracer] MMIO: W.8   FLUSH_ADDR[1] = 0x1500d50000 ()
# [cpu3] [HandoffTracer] MMIO: W.8   FLUSH_SIZE[1] = 0x4000 ()
# [cpu3] [HandoffTracer] MMIO: R.1   UNK2 = 0x0 ()
# [cpu3] [HandoffTracer] MMIO: R.4   UNK = 0x0 ()
# [cpu3] [HandoffTracer] MMIO: R.8   FLUSH_ADDR[1] = 0x1500d50000 ()
# [cpu3] [HandoffTracer] MMIO: W.8   FLUSH_ADDR[1] = 0x1500d50000 ()
# [cpu3] [HandoffTracer] MMIO: R.8   FLUSH_ADDR[1] = 0x1500d50000 ()
# [cpu3] [HandoffTracer] MMIO: W.8   FLUSH_ADDR[1] = 0xdead001500d50000 ()
# [cpu3] [HandoffTracer] MMIO: W.4   FLUSH_STATE[1] = 0x2 ()
## There seems to sometimes be a long delay here with activity from other threads, possibly waiting for something? Unclear what...
# [cpu3] [HandoffTracer] MMIO: R.8   MAGIC_FW = 0x4b1d000000000002 ()
# [cpu3] [AGXTracer@/arm-io/gfx-asc] UAT write L0 at 1:0x1500000000 (#0x354) -> 0x0000000000000000
# [cpu3] [AGXTracer@/arm-io/gfx-asc] UAT unmap 1:0x1500d50000 (0x0 (OS=0, UXN=0, PXN=0, OFFSET=0x0, nG=0, AF=0, SH=0, AP=0, AttrIndex=0, TYPE=0, VALID=0))
# [cpu3] Pass: msr TLBI VAE1OS, x8 = 1000001500d50 (OK) (TLBI VAE1OS)
# [cpu3] [HandoffTracer] MMIO: R.1   UNK2 = 0x0 ()
# [cpu3] [HandoffTracer] MMIO: R.4   FLUSH_STATE[1] = 0x2 ()
# [cpu3] [HandoffTracer] MMIO: W.4   FLUSH_STATE[1] = 0x0 ()

It is a mystery what this is for. I don't see anything that would wake up the firmware to actually process this in any way. I think it's just noise.
macOS unmaps for ASC-cached pages involve a special dance that actually wakes up the coprocessor and calls into the uPPL. This is because the PPL has to securely ensure those pages have been flushed from the cache:
# [cpu3] [HandoffTracer] MMIO: R.8   MAGIC_FW = 0x4b1d000000000002 ()
# [cpu3] [0xfffffe00135b46d0] MMIO: R.8   0x9fff78010 (gfx_shared_region, offset 0x10) = 0x823350003
## This first remaps the pages as uncached (AttrIndex=2).
# [cpu3] [AGXTracer@/arm-io/gfx-asc] UAT write L0 at 0:0xfa00c000000 (#0x10a) -> 0x00C00009109BC44B
# [cpu3] [AGXTracer@/arm-io/gfx-asc] UAT map 0:0xfa00c428000 -> 0x9109bc000 (0xc00009109bc44b (OS=1, UXN=1, PXN=0, OFFSET=0x24426f, nG=0, AF=1, SH=0, AP=1, AttrIndex=2, TYPE=1, VAL
ID=1))
# [cpu3] [AGXTracer@/arm-io/gfx-asc] UAT write L0 at 0:0xfa00c000000 (#0x10b) -> 0x00C000090FD8044B
# [cpu3] [AGXTracer@/arm-io/gfx-asc] UAT map 0:0xfa00c42c000 -> 0x90fd80000 (0xc000090fd8044b (OS=1, UXN=1, PXN=0, OFFSET=0x243f60, nG=0, AF=1, SH=0, AP=1, AttrIndex=2, TYPE=1, VAL
ID=1))
## Then there's a TLB invalidate... but there's a bug here! The address is 0xfa00c430000 (the *end* of the range) while it should be the start!
# [cpu3] Pass: msr TLBI RVAE1OS, x14 = 40801ffe80310c (OK) (TLBI RVAE1OS)
## Then the PPL puts the range into the handoff area and sets the state to 1 (pending cache inval)
# [cpu3] [HandoffTracer] MMIO: R.4   FLUSH_STATE[64] = 0x0 ()
# [cpu3] [HandoffTracer] MMIO: W.8   FLUSH_ADDR[64] = 0xffffffa00c428000 ()
# [cpu3] [HandoffTracer] MMIO: W.8   FLUSH_SIZE[64] = 0x8000 ()
# [cpu3] [HandoffTracer] MMIO: R.1   UNK2 = 0x0 ()
# [cpu3] [HandoffTracer] MMIO: W.4   FLUSH_STATE[64] = 0x1 ()
## And then issues an op via a special ring buffer to wake up the ASC and tell it to call into the uPPL
## The uPPL them issues a cache flush/inval for this range
# [cpu3] [AGXTracer@/arm-io/gfx-asc] [kickep]   FWRing Kick 0x84000000000000 (TYPE=0x8, KICK=0x0)
# [cpu3] [AGXTracer@/arm-io/gfx-asc] FW Kick~! 0x0
# [cpu3] [AGXTracer@/arm-io/gfx-asc] [17:FWCtl] Message @0.16:
FWCtlMsg @ 0xffffffa0000c0200:
 FWCM.[  0.  8] addr = 0xffffffa00c428000
 FWCM.[  8.  4] unk_8 = 0x0
 FWCM.[  c.  4] context_id = 0x40
 FWCM.[ 10.  2] unk_10 = 0x1
 FWCM.[ 12.  2] unk_12 = 0x2
## Once this completes (how does PPL know? There's no visible polling, could just be too fast?) it does the unmaps...
# [cpu3] [HandoffTracer] MMIO: R.8   MAGIC_FW = 0x4b1d000000000002 ()
# [cpu3] [0xfffffe00135b51c8] MMIO: R.8   0x9fff78010 (gfx_shared_region, offset 0x10) = 0x823350003
# [cpu3] [AGXTracer@/arm-io/gfx-asc] UAT write L0 at 0:0xfa00c000000 (#0x10a) -> 0x0000000000000000
# [cpu3] [AGXTracer@/arm-io/gfx-asc] UAT unmap 0:0xfa00c428000 (0x0 (OS=0, UXN=0, PXN=0, OFFSET=0x0, nG=0, AF=0, SH=0, AP=0, AttrIndex=0, TYPE=0, VALID=0))
# [cpu3] [AGXTracer@/arm-io/gfx-asc] UAT write L0 at 0:0xfa00c000000 (#0x10b) -> 0x0000000000000000
# [cpu3] [AGXTracer@/arm-io/gfx-asc] UAT unmap 0:0xfa00c42c000 (0x0 (OS=0, UXN=0, PXN=0, OFFSET=0x0, nG=0, AF=0, SH=0, AP=0, AttrIndex=0, TYPE=0, VALID=0))
## Flushes the TLB again (this time with the right address!)
# [cpu3] Pass: msr TLBI RVAE1OS, x14 = 40801ffe80310a (OK) (TLBI RVAE1OS)
## And checks and clears the flush state flag, which the uPPL set to 2 to indicate it flushed the cache.
# [cpu3] [HandoffTracer] MMIO: R.1   UNK2 = 0x0 ()
# [cpu3] [HandoffTracer] MMIO: R.4   FLUSH_STATE[64] = 0x2 ()
# [cpu3] [HandoffTracer] MMIO: W.4   FLUSH_STATE[64] = 0x0 ()

Note the bug in the first TLBI... this is one reason why I have my doubts this whole thing actually works properly and reliably in macOS. It only seems to do these unmaps when destroying GPU contexts, and only for a few small structures, so it's entirely possible that it's just not reliable and it works by accident... The vast majority of ASC-shared structures are allocated out of pools that are mapped ahead of time, and never unmapped. For my driver, I plan to have grow-only heaps (for uncached and cached modes) and just add pages when needed, never freeing/unmapping pages once they are used, and allocate shared structures out of there. That minimizes PTE churn and eliminates the entire TLB invalidation issue, and I think we can get away with never freeing pages for the small firmware struct arenas.
Problems seen

This section is obsolete and kept only for posterity. The problems were all caused by TLBIs not actually working (at all) when run inside a guest VM, without explicit trapping and passthrough. This is now implemented in m1n1 and everything works as expected.
Kernel side issues

Sometimes the Linux kernel crashes due to memory corruption. Every time this happens, the memory involved seems to have been previously GPU-mapped, then unmapped, and usually some other page mapped in its place. The bad page often contains tile array pointers (which is one of the buffers that gets mapped/unmapped every render). I thought this was the GPU writing back tile pointers to a stale TLB, but I'm not so sure any more, as TLB invals do seem to work in m1n1 with the tile array, and cause a GPU fault as soon as the TLBI happens (the GPU faults within microseconds, and I've never seen a tile array write happen after the TLBI from the AP's point of view). I now suspect this might be the ASC clearing the tile array prior to a render, with a stale TLB (see next section).
ASC side issues

The ASC side often crashes in funny ways. I've seen:

Faulting trying to cache invalidate a Barrier command way beyond its end, which suggests the ASC read the command tag as something else (possibly zero, which is a TA command and much larger).
Non-sequential stamp update asserts. It's not clear exactly when these happen, but it usually means there was an inconsistency in the stamp values the ASC was trying to update (maybe it read a stale command?)
After I started 0x42-filling freed GEM buffers, I once saw the firmware assert with an unknown SKU command 0x2. SKU command IDs are masked with 0x3f, so that suggests it tried to parse a freed GEM buffer as a MicroSeq instead of the real one.
It trying to dereference a Linux kernel virtual pointer, which suggests it read some random kernel page as its own.
More randomness, NULL derefs, etc.

In all these cases, looking at the MMU PTEs from the hypervisor side shows everything mapped properly, though in at least some cases using the previous mapping matches the effects I observed (the hypervisor AGX tracer UAT page table cache can actually reproduce this problem itself, since it doesn't listen to TLBIs). So this all sounds like TLB invals against the ASC are not reliable (they were completely broken before I discovered the AXI2AF pokes, but they still seem somewhat broken...).
GPU side issues

After mitigating the ASC side issues by never freeing anything, keeping the GPU turned on (setting a long shutdown timeout) causes GPU-side issues. This is usually a GPU timeout, without an explicit fault (the fault register shows no fault). I'm not sure what's going on here... I do remember seeing timeouts and weirdness with too-small tiled vertex buffers in the past, so I have suspicions about that part of the puzzle. I need to make this more reproducible and see if I can reproduce it in the Python side...
In particular, kmscube usually works but I've also seen it crash on the 2nd frame. Starting a GNOME session crashes things pretty reliably. I've also seen visual corruption at one point. The good news is that after setting the GPU shutdown timeout to >1s, all this still happens often even with full logging and hypervisor tracing, so it's not a subtle timing issue any more and it should be reproducible in Python...
I've also seen outright hangs (no timeout message from firmware, no completions either).
The GPU shutdown hack

Waiting for the GPU to power down after every render batch seems to largely solve the stability issues. I initially thought this was due to clearing GPU TLBs, but I think there's more to it than that. I now think that after the GPU powers down, the ASC also does shortly after (probably WFI or deep WFI?). So waiting for the GPU to shut down (by polling the ASC-maintained global) effectively almost waits for the ASC to shut down. This explains why the ASC-side issues went away (even when I was freeing and re-mapping ASC structs) and why there is still sometimes instability depending on the GPU powerdown timeout: the ASC idle mode probably clears the caches and/or TLBs, and that almost always happened when I waited like this.
However, the GPU shutdown also does something to fix GPU-side issues.