Skip to content

Instantly share code, notes, and snippets.

@fxkamd
Last active April 26, 2024 15:34
Show Gist options
  • Save fxkamd/ffd02d66a2863e444ec208ea4f3adc48 to your computer and use it in GitHub Desktop.
Save fxkamd/ffd02d66a2863e444ec208ea4f3adc48 to your computer and use it in GitHub Desktop.
Observations about HSA and KFD backends in TinyGrad

This is Felix Kuehling, long time KFD driver architect. I started looking into the TinyGrad source code yesterday, focusing on ops_kfd.py, ops_hsa.py and driver/hsa.py, to understand how TinyGrad talks to our HW and help with the ongoing debugging effort from the top down. This analysis is based on this commit: https://github.com/tinygrad/tinygrad/tree/3de855ea50d72238deac14fc05cda2a611497778

I'm intrigued by the use of Python for low-level programming. I think I can learn something from your use of ctypes and clang2py for fast prototyping and test development. I want to share some observations based on my initial review.

ops_kfd looks pretty new, and I see many problems with it based on my long experience working on KFD. I think it's interesting, but probably not relevant for the most pressing problems at hand, so I'll cover that last.

ops_hsa uses ROCr APIs to manage GPU memory, create a user mode AQL queue for GPU kernel dispatch, async SDMA copies, and signal-based synchronization with barrier packets between the two. There is also some host-side synchronization used for lazy cleanup of reusable signals and freeing memory. I only see one potential problem so far:

  • AQLQueue.blit_packets writes multiple packets, header first. This is problematic because the AQL packet processor can start reading packets with a valid header even before you ring the doorbell and update the write-index and doorbell. I only see this used in HSAGraph, and I don't understand the rest of TinyGrad well enough yet to know, whether this can happen in a typical ResNet run
  • Even in submit_kernel and submit_barrier, you may need a memory barrier before writing the header, to make sure the writes complete in the right order in the CPU. I don't know if python does that implicitly, e.g. because of overheads in the interpreter

Now my notes on ops_kfd. There is a good chance I missed something and I pick up something new every time I look a the code, so please take these with a grain of salt:

  • In HWComputeQueue.submit AQL packet headers must be written after the packet contents. You may also need a memory barrier to ensure the writes complete in the rigth order in the CPU. The AQL packet processor can start working on packets as soon as it sees a valid header, even before you ring the doorbell
  • Sharing device.completion_signal: This can cause race conditions when overwriting or waiting for a signal value before the previous dispatch has completed. Before reusing a signal, you need to wait for it. KFDAllocator.copyout waits for the signal, but then reuses it for multiple SDMA commands in the loop. The wait in the end may get triggered by something that's not the last SDMA command. To avoid this, I'd only signal after the last SDMA command. In copyin I don't see any waiting at all before using the signal
  • AQLAllocator.transfer seems to use the destination device for the data copy. I would expect writing to be faster than reading (easier to hide latency), so using the source device may perform better
  • Is there some code I'm missing to map either the source or destination on the other GPU for AQLAllocator.transfer?
  • Operations on wptr and doorbells may not be atomic: This could cause race conditions if the HW sees half-complete values. I don't know ctypes very well, so I don't know what atomicity guarantees it makes
  • No virtual address alignments to optimize for huge pages: This will lead to bad TLB efficiency, more page table allocations, slower memory allocation and reduced access performance
  • No suballocator for small VRAM allocations: Similar to above, if you have many small allocations, it will lead to more memory management overhead and reduced access performance
  • Highest queue priority, I don't think this gains anything if all queues end up with the same priority but may risk other issues by starving kernel queues (if you ever need interop, mostly for video processing)
  • Mapping only one doorbell page per GPU: Each process has two doorbell pages per GPU. You should map both. Otherwise you may have problems if you're using more SDMA queues later that end up using some of the doorbells in the second page due to how doorbells get routed in the HW
  • Queue overruns are only detected after corrupting the queues
  • No fallback to shader-based copies when SDMA queues run out: There are a limited number of SDMA queues in the HW and we don't oversubscribe them at the moment because low latency is one of the big advantages of using SDMA over shader-based copies. When they run out, SDMA queue creation will fail. ROCr has a fallback to use shader-based copies for this. As long as you run a small number of processes concurrently and use a small number of SDMA queues per device, this is no problem
  • Using same BO for compute and SDMA read/write pointers
    • Not a problem now, but be aware that the SDMA engine writes some queue usage information and internal scratch data after the RPTR
  • Circumventing ROCr breaks rocm-gdb. You won't be able to use it for debugging compute kernels
@fxkamd
Copy link
Author

fxkamd commented Apr 5, 2024

The AQL packet processor in the firmware invalidates the headers before it updates the read index. BTW, the encoding for invalid packets is format=1. format=0 is used for vendor packets. So just a 0-initialized queue buffer is full of valid vendor packets as far as the firmware is concerned. Yikes. It shouldn't read ahead of the write-index, I hope. I'm just reading up on this again the the HSA spec. See section 2.9 in HSA Platform System Architecture Specification.

There is also section 2.5 in HSA Runtime Programmer’s Reference Manual. It talks more about single/multi-producer queues and the submission ABI.

@fxkamd
Copy link
Author

fxkamd commented Apr 5, 2024

Two questions I've had while working on this

What's the difference between GTT and USERPTR on allocation? I copied what the ROCr driver uses for the different allocations, but what's the actual difference?

GTT memory is allocated as a buffer object in the kernel mode driver. This memory is not pageable in the traditional sense, though the memory manager does have its own swap mechanism. However, in the upstream kernel, it limits us to less than 1/2 of system memory capacity due to some limitations with memory accounting and the OOM killer.

To get around that, we use pageable memory (plain mmap) for most system memory allocations, and then map them for GPU access using userptr BOs or our newer SVM API. This memory can be paged freely by the Linux kernel, so our driver needs to handle MMU notifiers when that happens, so we can keep our GPU page tables in sync. Fortunately we've optimized things to make that rare (except when NUMA balancing is on).

We use pageable memory for most system memory allocations so we get access to the full system memory capacity. We use GTT for memory that must be accessed in kernel mode, or shared between processes with DMABufs. If you don't need the full capacity, using GTT may be slightly faster (no MMU notifiers, faster to allocate and free).

Is there a lower level queue than AQL queue? I see a KFD_IOC_QUEUE_TYPE_COMPUTE as well as KFD_IOC_QUEUE_TYPE_COMPUTE_AQL Can this be used to dispatch kernels and get lower level control, maybe with PM4 packets?

KFD_IOC_QUEUE_TYPE_COMPUTE uses PM4 packets. We use that in kfdtest for low-level testing with tiny hand-written assembly shaders. IME, doing dispatches with PM4 packets is a pain, requires low-level HW specs for writing GPU-specific dispatch registers and handling scratch memory, synchronization, cache flushes, HDP flushes etc. manually. There is lots of potential for getting things subtly wrong, or causing more hardware hangs by corrupting the wrong registers.

AQL is our "architected queuing language" that came out of the HSA initiative. It's very abstract by design to make it portable with a stable ABI, multi-producer semantics, well defined memory coherence, synchronization primitives and kernel calling conventions, which makes all of these features usable in 3rd-party runtime code such as your own. It also enables interoperability between different language runtimes. You've already seen the PM4 escape through a vendor packet that we use for icache flushes in the code-object loader. I find it unfortunate that we need that escape. Other than that, all our compute language runtimes for ROCm use AQL.

@geohot
Copy link

geohot commented Apr 5, 2024

Cool, makes sense re: GTT and USERPTR. I switched most to GTT since we aren't doing any big allocations, but found USERPTR was required for readinto to work. Haven't noticed a performance difference.

Ahh, PM4 looks like what I was putting on the NVIDIA queues. https://github.com/geohot/cuda_ioctl_sniffer/blob/master/gpu_driver.cc I have a whole bunch of tinygrad work to do first, but I do want to move to PM4 eventually. Would love to see it documented. For now AQL is fine though if we aren't hitting dispatch bugs. The key thing is that our drivers are O(1) regardless of queue length and that we have lightweight ways to sync the queues, which is fine at the AQL+SDMA level.

I have a very repeatable GPU crash in KFD, appears as just a hang and looks like the same one we were hitting before. Happens on both 6.0.2 and 6.0.3 (looks to happen faster on 6.0.3). Occurs around 500 steps into training.

On current tinygrad master (164329a8ea71ac63eeec5adb526b1ab1a4eb5982) in the ResNet trainer
BS=768 GPUS=6 WANDB=1 BEAM=4 MODEL=resnet KFD=1 python3 examples/mlperf/model_train.py

Python traceback is a wait for 10s+ for a signal that never comes on the SDMA queue.

  File "/home/tiny/tinygrad/examples/mlperf/model_train.py", line 259, in <module>
    globals()[nm]()
  File "/home/tiny/tinygrad/examples/mlperf/model_train.py", line 144, in train_resnet
    next_proc = data_get(it)
  File "/home/tiny/tinygrad/examples/mlperf/model_train.py", line 126, in data_get
    return x.shard(GPUS, axis=0).realize(), Tensor(y, requires_grad=False).shard(GPUS, axis=0), cookie
  File "/home/tiny/tinygrad/tinygrad/tensor.py", line 142, in realize
    Tensor.corealize([self])
  File "/home/tiny/tinygrad/tinygrad/tensor.py", line 139, in corealize
    run_schedule(create_schedule(flatten([x.lazydata.lbs if isinstance(x.lazydata, MultiLazyBuffer) else [x.lazydata] for x in lst])))
  File "/home/tiny/tinygrad/tinygrad/engine/realize.py", line 5, in run_schedule
    def run_schedule(schedule:List[ScheduleItem]): CommandQueue(schedule)()
  File "/home/tiny/tinygrad/tinygrad/engine/commandqueue.py", line 99, in __call__
    fxn.exec([si.output, si.input])
  File "/home/tiny/tinygrad/tinygrad/device.py", line 48, in exec
    et = self(rawbufs, var_vals)
  File "/home/tiny/tinygrad/tinygrad/device.py", line 82, in __call__
    self.copy(dest, src)
  File "/home/tiny/tinygrad/tinygrad/device.py", line 72, in copy
    dest.allocator.copy_from_fd(dest._buf, src._buf.ud.fd, src._buf.offset, src.nbytes)
  File "/home/tiny/tinygrad/tinygrad/runtime/ops_kfd.py", line 281, in copy_from_fd
    if i != 0: self.device._wait_signal(self.device.signal_sdma)
  File "/home/tiny/tinygrad/tinygrad/runtime/ops_kfd.py", line 361, in _wait_signal
    if ret.wait_result != 0: raise RuntimeError(f"wait_result: {ret.wait_result}, {timeout} ms TIMEOUT!")
RuntimeError: wait_result: 1, 10000 ms TIMEOUT!

This is what's in dmesg.

[26262.469879] amdgpu 0000:83:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1202
[26262.470548] amdgpu 0000:83:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[26262.471185] amdgpu 0000:83:00.0: amdgpu: Failed to evict queue 5
[26262.471845] amdgpu: Failed to evict process queues
[26262.481377] amdgpu 0000:83:00.0: amdgpu: GPU recovery disabled.
[26262.481743] amdgpu: Failed to evict queues of pasid 0x8006
[26271.215221] amdgpu 0000:83:00.0: amdgpu: Failed to remove queue 4
[26271.350674] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[26271.351787] [drm:amdgpu_mes_flush_shader_debugger [amdgpu]] *ERROR* failed to set_shader_debugger
[26281.549387] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=20001, emitted seq=20005
[26281.551659] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[26281.553705] amdgpu 0000:83:00.0: amdgpu: GPU recovery disabled.

After the crash happens the GPU stays at 100% in rocm-smi. It was GPU 3 that crashed, the rest seemed to go back into low power mode when the process exited. I have UMR reg dumps of all the GPUs, anything I can look for in there?

(I can't rule out that we aren't doing something wrong in our KFD driver, so you are welcome to wait for a simpler repro)

@fxkamd
Copy link
Author

fxkamd commented Apr 5, 2024

AQLQueue.blit_packets writes multiple packets, header first. This is problematic because the AQL packet processor can start reading packets with a valid header even before you ring the doorbell and update the write-index and doorbell. I only see this used in HSAGraph, and I don't understand the rest of TinyGrad well enough yet to know, whether this can happen in a typical ResNet run

So, does the CP read packets that are after the write pointer? We memcpy packets and only then move write_pointer + doorbell.

Yeah, I think it shouldn't read ahead of the write index. Your sequence is different from what I would expect for AQL queues. It's probably fine since you're using it in single-producer mode.

Even in submit_kernel and submit_barrier, you may need a memory barrier before writing the header, to make sure the writes complete in the right order in the CPU. I don't know if python does that implicitly, e.g. because of overheads in the interpreter

We test on x86 which is TSO (so I expect we should not see any writes reordering visible to the memory subsystem). And I hope python doesn't do any reorderings.

I'm not a memory-model lawyer. These discussions can get very tricky. Typically this is handled by compilers and their implementation of atomics with the right acquire/release flags, or hidden inside synchronization primitives such as pthread mutexes or barriers. When you're dealing with AQL queues, HSA signals provide the same functionality. But when you're coordinating python code with the GPU directly, bypassing ROCr, you can rely on neither. In the kernel mode drivers we tend to use explicit memory barriers when doing lock-less synchronization between threads or with external devices.

@fxkamd
Copy link
Author

fxkamd commented Apr 9, 2024

Some more background about the hang you were seeing and a glimpse into the work we're doing on this issue.

ROCm uses user mode queues. Hangs in those queues are not handled by the kernel mode driver as long as the queue is preemptible. For example if you dispatch a persistent shader kernel that basically executes an infinite loop, that queue is going to "soft hang" indefinitely. But as long as the GPU scheduler firmware (MES) can preempt the queue, there is no problem. Other queues can still run, new processes can come along and create more queues and get their work executed on the GPU (assuming the persistent one isn't blocking all the wave slots).

If your kernel causes a page fault (e.g. some out-of-bounds memory access) that is handled by the kernel mode driver, and as long as the queue is still preemptible, it can just terminate your process and not affect any other process in the system. So far this is how things are supposed to work.

What you're seeing in the kernel log is when the MES scheduler firmware is becoming unresponsive after a queue in the CP failed to respond to a preemption request. We're looking into ways to improve the robustness in the scheduler or the driver to recover such situations without a GPU reset if possible, by killing the wavefronts of the offending queue. If that fails, a full GPU reset is still the last resort. This will kill all the applications currently running on the GPU. But new processes should be able to use the GPU after the reset.

When you disable SDMA, you're only disabling its use in the user mode runtime. It's always needed in kernel mode for some buffer management operations. An SDMA hang detected by the kernel mode driver could also be a symptom of something else going wrong in the GPU.

We're finding and fixing some issues with the GPU reset programming sequence in our Linux driver on Navi3. We're also working on the robustness of the MES scheduler so we can recover from more situations without a full GPU reset. At the same time we're looking into understanding what's causing the hangs in the first place. We have some reproductions of such issues with Tinygrad in AMD now, so we're making progress. Getting to the bottom of that may require a bunch of low-level driver hacking and JTAG debugging of the hardware state. Our goal is to make handling of application errors as robust as possible, so that you can get back to debugging your application.

@2eQTu
Copy link

2eQTu commented Apr 9, 2024

@fxkamd Thank you for the updates and frank technical discussion. I'm not affiliated with tinygrad, but am following along too. Looking forward to whatever aspects of the firmware component(s) can eventually be open-sourced. Seeing further down the stack is always helpful.

@geohot
Copy link

geohot commented Apr 10, 2024

The same hang appears in HIP, HSA, and KFD, and I've confirmed removing SDMA (using kernels to copy instead) doesn't fix it. They all rely on AQL (and the complex MEC code to parse and run it), so all that seems left to do to try to fix it is PM4. tinygrad/tinygrad#4110

What I like with PM4 is I can use umr to see exactly what packet the GPU stalled on. I have gotten some hangs with PM4, but none of the non preemptible type. Thanks for clearing up the difference between the two hangs.

Is PM4 documented anywhere? Ideally what I want to do is treat the GPU like one queue that gets as close to the hardware as possible. Are the regCOMPUTE registers actually dispatching hardware, or is there more firmware somewhere scheduling them to shaders?

@fxkamd
Copy link
Author

fxkamd commented Apr 10, 2024

If the hang is caused by the packet processor or something that is controls directly, then you can maybe catch it with PM4 packets and UMR. Chances are that it's something triggered by the shader engines, or downstream from them. In that case, all UMR will tell you is, that it's hanging at the dispatch initiator. UMR can dump wavefronts, maybe that will tell you something: https://gitlab.freedesktop.org/tomstdenis/umr/-/blob/main/doc/sphinx/source/wave_status.rst?ref_type=heads

My understanding of PM4 dispatch is, that you use register writes to set up the state for the next dispatch (dimensions, kernel arguments, code object, scratch, register and LDS allocation, etc.). Then the actual dispatch packet kicks off hardware in the CP and other HW blocks that generate workgroups that then get scheduled on the compute units. That hardware (and maybe some firmware) also handles the tracking of completed wave fronts so that it knows when the entire dispatch is completed. Your PM4 commands also need to handle cache/HDP invalidation (before) and flushing (after) to ensure memory coherence around your dispatch, and signaling back to the host (through memory and interrupts) so that your runtime can wait for completed dispatches using an HSA signal. The RELEASE_MEM packet can handle cache flushing and signaling all in one.

The best public documentation of PM4 packets is probably in the open-source code that uses them: the amdgpu kernel mode driver (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdgpu/nvd.h) and Mesa. Also our kfdtest uses PM4 user mode queues, though it doesn't implement all the same dispatch functionality as AQL: https://github.com/ROCm/ROCT-Thunk-Interface/blob/master/tests/kfdtest/src/Dispatch.cpp

I saw you were concerned about out-of-order execution with AQL. I'm waiting for a co-worker to write up a good explanation of what it means. My superficial understanding is that it's a bit of a mis-nomer. The order of AQL packets is guaranteed as you set the BARRIER bit in all your packet headers. This is more about the scheduling of workgroups across the hierarchy of shader engines, arrays and compute units. This feature is needed on Navi3 for reliable CWSR (compute wave save restore), which MES and our driver depend on for preempting long-running compute waves.

@geohot
Copy link

geohot commented Apr 10, 2024

Yea, have to look more into the wave dumping stuff and understanding the status of the shader engines. Understood re: ACQUIRE_MEM and RELEASE_MEM, and now I understand what HDP is (the PCI-E bus) and why I need to flush it.

That's where I've been getting PM4 stuff from, was hoping there was something better. This is the sort of hardware documentation I'm hoping AMD releases, what happens after I poke regCOMPUTE_DISPATCH_INITIATOR?

My concern was that after reading this that instead of root causing the deadlock, the out-of-order bit was set and the test repro no longer crashed, so it was considered fixed. My real workload crashed even faster.
https://repo.radeon.com/.hidden/cfa27af7066b8ebd5c73d75110183a62/docs/Change%20Summary_6.0.3_Known_Issues%20(1).pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment