This is Felix Kuehling, long time KFD driver architect. I started looking into the TinyGrad source code yesterday, focusing on ops_kfd.py, ops_hsa.py and driver/hsa.py, to understand how TinyGrad talks to our HW and help with the ongoing debugging effort from the top down. This analysis is based on this commit: https://github.com/tinygrad/tinygrad/tree/3de855ea50d72238deac14fc05cda2a611497778
I'm intrigued by the use of Python for low-level programming. I think I can learn something from your use of ctypes and clang2py for fast prototyping and test development. I want to share some observations based on my initial review.
ops_kfd looks pretty new, and I see many problems with it based on my long experience working on KFD. I think it's interesting, but probably not relevant for the most pressing problems at hand, so I'll cover that last.
ops_hsa uses ROCr APIs to manage GPU memory, create a user mode AQL queue for GPU kernel dispatch, async SDMA copies, and signal-based synchronization with barrier packets between the two. There is also some host-side synchronization used for lazy cleanup of reusable signals and freeing memory. I only see one potential problem so far:
- AQLQueue.blit_packets writes multiple packets, header first. This is problematic because the AQL packet processor can start reading packets with a valid header even before you ring the doorbell and update the write-index and doorbell. I only see this used in HSAGraph, and I don't understand the rest of TinyGrad well enough yet to know, whether this can happen in a typical ResNet run
- Even in submit_kernel and submit_barrier, you may need a memory barrier before writing the header, to make sure the writes complete in the right order in the CPU. I don't know if python does that implicitly, e.g. because of overheads in the interpreter
Now my notes on ops_kfd. There is a good chance I missed something and I pick up something new every time I look a the code, so please take these with a grain of salt:
- In HWComputeQueue.submit AQL packet headers must be written after the packet contents. You may also need a memory barrier to ensure the writes complete in the rigth order in the CPU. The AQL packet processor can start working on packets as soon as it sees a valid header, even before you ring the doorbell
- Sharing device.completion_signal: This can cause race conditions when overwriting or waiting for a signal value before the previous dispatch has completed. Before reusing a signal, you need to wait for it. KFDAllocator.copyout waits for the signal, but then reuses it for multiple SDMA commands in the loop. The wait in the end may get triggered by something that's not the last SDMA command. To avoid this, I'd only signal after the last SDMA command. In copyin I don't see any waiting at all before using the signal
- AQLAllocator.transfer seems to use the destination device for the data copy. I would expect writing to be faster than reading (easier to hide latency), so using the source device may perform better
- Is there some code I'm missing to map either the source or destination on the other GPU for AQLAllocator.transfer?
- Operations on wptr and doorbells may not be atomic: This could cause race conditions if the HW sees half-complete values. I don't know ctypes very well, so I don't know what atomicity guarantees it makes
- No virtual address alignments to optimize for huge pages: This will lead to bad TLB efficiency, more page table allocations, slower memory allocation and reduced access performance
- No suballocator for small VRAM allocations: Similar to above, if you have many small allocations, it will lead to more memory management overhead and reduced access performance
- Highest queue priority, I don't think this gains anything if all queues end up with the same priority but may risk other issues by starving kernel queues (if you ever need interop, mostly for video processing)
- Mapping only one doorbell page per GPU: Each process has two doorbell pages per GPU. You should map both. Otherwise you may have problems if you're using more SDMA queues later that end up using some of the doorbells in the second page due to how doorbells get routed in the HW
- Queue overruns are only detected after corrupting the queues
- No fallback to shader-based copies when SDMA queues run out: There are a limited number of SDMA queues in the HW and we don't oversubscribe them at the moment because low latency is one of the big advantages of using SDMA over shader-based copies. When they run out, SDMA queue creation will fail. ROCr has a fallback to use shader-based copies for this. As long as you run a small number of processes concurrently and use a small number of SDMA queues per device, this is no problem
- Using same BO for compute and SDMA read/write pointers
- Not a problem now, but be aware that the SDMA engine writes some queue usage information and internal scratch data after the RPTR
- Circumventing ROCr breaks rocm-gdb. You won't be able to use it for debugging compute kernels
This applies to AQL queues only. The reason is, that the AQL queue ABI supports multiple producers. They can concurrently allocate space on the queue with atomic operations on the write index, then write their packets and finally ring the doorbell. Ringing the doorbell from one producer thread doesn't mean that all other producers have filled in their packets yet, so the doorbell is just a wakeup-call and the packet processor checks the packet headers to know when the packets are ready. The other way around, the queue will continue to process all valid packets and not wait for another doorbell to make sure it catches up with older work that was written into later queue slots.
Makes sense.
That's good to hear. The problem is PCIe latency. As long as the SDMA engines can track enough outstanding reads to hide that latency, you can sustain full bandwidth. It's easier with writes, because they can be posted and SDMA doesn't need to wait for their completion.
Ah, I missed that.
Both the CPU and GPU use 4-level page tables (5 levels on the latest Intel and AMD CPUs). I'm using our GPU terminology here, but the structure is similar on x86 CPUs with different names: Each page table block (PTB) in the lowest level of the page table has 512 entries (PTEs) that each point to a 4KB page. The PTB represents 2MB of address space aligned at a 2MB boundary. The next level up is a page directory block (PDB) that has 512 entries (PDEs) pointing to a PTB each. It represents 1GB of address space. Instead of pointing to PTBs, it can point to a contiguous 2MB block. This saves memory in the page table, makes memory mapping faster. And it makes the TLB cache more efficient because it can represent much more address space with the same number of entries.
Our VRAM allocator in the kernel mode driver is optimized for 2MB or larger allocations and minimizes fragmentation into smaller pieces. But to map a contiguous 2MB block with a single PTE, it needs to be aligned on a 2MB boundary in virtual address space.
Similarly, the Linux memory manager supports "transparent huge pages" where it prefers 2MB pages for large allocations.
The performance impact depends on your workload. If you have mostly linear accesses or a small working data set, TLB efficiency is less important. But large working sets with random access patterns benefit from better TLB efficiency. We call this TLB reach. Our GPUs' TLBs are optimized to reach all local memory with 2MB pages. If you used 4KB page for everything, that would drop by a factor 512 in the worst case.
You're right. Maybe I was thrown off by the address alignment to the nearest page in the line just above. I think that should be:
You can run out either by creating more queues (I see the HSA backend uses 2 SDMA engines), or by running multiple processes. If you don't do either of those things, you'll be fine without a fallback.
I haven't used it myself because I haven't written GPU kernels. But I've worked closely with the tools and ROCr architects on making it work. rocm-gdb is integrated with gdb (hopefully in upstream gdb at some point--it depends on some updates to DWARF). It lets you debug GPU kernels similar to how you'd debug a multi-threaded application on the CPU. Among other things you can enumerate running wave fronts, inspect registers, memory, single step programs, set breakpoints and stop when a GPU kernel throws a segfault.
But this depends on a trap handler (like an interrupt handler running in the GPU compute unit) to handle a bunch of exceptions and debug traps and send information back to the runtime or the debugger. The trap handler is loaded by ROCr. There is a handshake between debugger and ROCr to allow rocm-gdb to attach to the program before or after ROCr is initialized. rocm-gdb depends on a new-enough ROCr runtime to do all that.