Skip to content

Instantly share code, notes, and snippets.

@fxkamd
Last active April 26, 2024 15:34
Show Gist options
  • Save fxkamd/ffd02d66a2863e444ec208ea4f3adc48 to your computer and use it in GitHub Desktop.
Save fxkamd/ffd02d66a2863e444ec208ea4f3adc48 to your computer and use it in GitHub Desktop.
Observations about HSA and KFD backends in TinyGrad

This is Felix Kuehling, long time KFD driver architect. I started looking into the TinyGrad source code yesterday, focusing on ops_kfd.py, ops_hsa.py and driver/hsa.py, to understand how TinyGrad talks to our HW and help with the ongoing debugging effort from the top down. This analysis is based on this commit: https://github.com/tinygrad/tinygrad/tree/3de855ea50d72238deac14fc05cda2a611497778

I'm intrigued by the use of Python for low-level programming. I think I can learn something from your use of ctypes and clang2py for fast prototyping and test development. I want to share some observations based on my initial review.

ops_kfd looks pretty new, and I see many problems with it based on my long experience working on KFD. I think it's interesting, but probably not relevant for the most pressing problems at hand, so I'll cover that last.

ops_hsa uses ROCr APIs to manage GPU memory, create a user mode AQL queue for GPU kernel dispatch, async SDMA copies, and signal-based synchronization with barrier packets between the two. There is also some host-side synchronization used for lazy cleanup of reusable signals and freeing memory. I only see one potential problem so far:

  • AQLQueue.blit_packets writes multiple packets, header first. This is problematic because the AQL packet processor can start reading packets with a valid header even before you ring the doorbell and update the write-index and doorbell. I only see this used in HSAGraph, and I don't understand the rest of TinyGrad well enough yet to know, whether this can happen in a typical ResNet run
  • Even in submit_kernel and submit_barrier, you may need a memory barrier before writing the header, to make sure the writes complete in the right order in the CPU. I don't know if python does that implicitly, e.g. because of overheads in the interpreter

Now my notes on ops_kfd. There is a good chance I missed something and I pick up something new every time I look a the code, so please take these with a grain of salt:

  • In HWComputeQueue.submit AQL packet headers must be written after the packet contents. You may also need a memory barrier to ensure the writes complete in the rigth order in the CPU. The AQL packet processor can start working on packets as soon as it sees a valid header, even before you ring the doorbell
  • Sharing device.completion_signal: This can cause race conditions when overwriting or waiting for a signal value before the previous dispatch has completed. Before reusing a signal, you need to wait for it. KFDAllocator.copyout waits for the signal, but then reuses it for multiple SDMA commands in the loop. The wait in the end may get triggered by something that's not the last SDMA command. To avoid this, I'd only signal after the last SDMA command. In copyin I don't see any waiting at all before using the signal
  • AQLAllocator.transfer seems to use the destination device for the data copy. I would expect writing to be faster than reading (easier to hide latency), so using the source device may perform better
  • Is there some code I'm missing to map either the source or destination on the other GPU for AQLAllocator.transfer?
  • Operations on wptr and doorbells may not be atomic: This could cause race conditions if the HW sees half-complete values. I don't know ctypes very well, so I don't know what atomicity guarantees it makes
  • No virtual address alignments to optimize for huge pages: This will lead to bad TLB efficiency, more page table allocations, slower memory allocation and reduced access performance
  • No suballocator for small VRAM allocations: Similar to above, if you have many small allocations, it will lead to more memory management overhead and reduced access performance
  • Highest queue priority, I don't think this gains anything if all queues end up with the same priority but may risk other issues by starving kernel queues (if you ever need interop, mostly for video processing)
  • Mapping only one doorbell page per GPU: Each process has two doorbell pages per GPU. You should map both. Otherwise you may have problems if you're using more SDMA queues later that end up using some of the doorbells in the second page due to how doorbells get routed in the HW
  • Queue overruns are only detected after corrupting the queues
  • No fallback to shader-based copies when SDMA queues run out: There are a limited number of SDMA queues in the HW and we don't oversubscribe them at the moment because low latency is one of the big advantages of using SDMA over shader-based copies. When they run out, SDMA queue creation will fail. ROCr has a fallback to use shader-based copies for this. As long as you run a small number of processes concurrently and use a small number of SDMA queues per device, this is no problem
  • Using same BO for compute and SDMA read/write pointers
    • Not a problem now, but be aware that the SDMA engine writes some queue usage information and internal scratch data after the RPTR
  • Circumventing ROCr breaks rocm-gdb. You won't be able to use it for debugging compute kernels
@fxkamd
Copy link
Author

fxkamd commented Apr 10, 2024

If the hang is caused by the packet processor or something that is controls directly, then you can maybe catch it with PM4 packets and UMR. Chances are that it's something triggered by the shader engines, or downstream from them. In that case, all UMR will tell you is, that it's hanging at the dispatch initiator. UMR can dump wavefronts, maybe that will tell you something: https://gitlab.freedesktop.org/tomstdenis/umr/-/blob/main/doc/sphinx/source/wave_status.rst?ref_type=heads

My understanding of PM4 dispatch is, that you use register writes to set up the state for the next dispatch (dimensions, kernel arguments, code object, scratch, register and LDS allocation, etc.). Then the actual dispatch packet kicks off hardware in the CP and other HW blocks that generate workgroups that then get scheduled on the compute units. That hardware (and maybe some firmware) also handles the tracking of completed wave fronts so that it knows when the entire dispatch is completed. Your PM4 commands also need to handle cache/HDP invalidation (before) and flushing (after) to ensure memory coherence around your dispatch, and signaling back to the host (through memory and interrupts) so that your runtime can wait for completed dispatches using an HSA signal. The RELEASE_MEM packet can handle cache flushing and signaling all in one.

The best public documentation of PM4 packets is probably in the open-source code that uses them: the amdgpu kernel mode driver (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdgpu/nvd.h) and Mesa. Also our kfdtest uses PM4 user mode queues, though it doesn't implement all the same dispatch functionality as AQL: https://github.com/ROCm/ROCT-Thunk-Interface/blob/master/tests/kfdtest/src/Dispatch.cpp

I saw you were concerned about out-of-order execution with AQL. I'm waiting for a co-worker to write up a good explanation of what it means. My superficial understanding is that it's a bit of a mis-nomer. The order of AQL packets is guaranteed as you set the BARRIER bit in all your packet headers. This is more about the scheduling of workgroups across the hierarchy of shader engines, arrays and compute units. This feature is needed on Navi3 for reliable CWSR (compute wave save restore), which MES and our driver depend on for preempting long-running compute waves.

@geohot
Copy link

geohot commented Apr 10, 2024

Yea, have to look more into the wave dumping stuff and understanding the status of the shader engines. Understood re: ACQUIRE_MEM and RELEASE_MEM, and now I understand what HDP is (the PCI-E bus) and why I need to flush it.

That's where I've been getting PM4 stuff from, was hoping there was something better. This is the sort of hardware documentation I'm hoping AMD releases, what happens after I poke regCOMPUTE_DISPATCH_INITIATOR?

My concern was that after reading this that instead of root causing the deadlock, the out-of-order bit was set and the test repro no longer crashed, so it was considered fixed. My real workload crashed even faster.
https://repo.radeon.com/.hidden/cfa27af7066b8ebd5c73d75110183a62/docs/Change%20Summary_6.0.3_Known_Issues%20(1).pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment