Skip to content

Instantly share code, notes, and snippets.

@ingenieroariel
Created March 20, 2023 15:58
Show Gist options
  • Save ingenieroariel/d78128babb2eccf44939d8c425d3f7b5 to your computer and use it in GitHub Desktop.
Save ingenieroariel/d78128babb2eccf44939d8c425d3f7b5 to your computer and use it in GitHub Desktop.
Opportunities for Apple M1 Rust GPU driver in Generative Art

Visual Art with Vulkan

Human: Based on the text below, generate a recommendation for a new kind of visually stunning generative art that uses the most advance features.

Then along came Vulkan, and said there was a better way. In Vulkan, there is no explicit synchronization of buffers. Instead, the app developer is responsible for manually keeping track of dependencies between things they submit to the GPU, and Vulkan provides several tools to tell the system what it needs: barriers, events, fences, and timeline semaphores.

Vulkan is pretty complicated, so we won’t go into all the details… but essentially, these tools give the app fine-grained control over what has to wait for what and when. There is no implicit buffer synchronization any more, which is great! The kernel driver no longer needs to keep track of possibly dozens or hundreds of buffers, but instead only the very specific sync requirements that the app requests.

(By the way, Metal supports both explicit sync and implicit sync for some reason, but I digress…)

Under the hood, Linux implements explicit sync using a standard mechanism called sync objects. Each sync object is basically a container for a completion, which is actually a DMA fence. If you’ve ever used async programming frameworks, you’ve probably heard of promises. DMA fences are basically the GPU version of a promise! Sync objects are actually originally an OpenGL concept, but they have since been adapted and extended to work with Vulkan’s more complex requirements.

In the explicit sync world, when an app submits GPU work to the kernel, it gives it a list of input sync objects and a list of output sync objects. The kernel driver checks all the input sync objects and registers their fences as dependencies of the GPU work. Then it creates a new (pending) completion fence for the work, and inserts it into the output sync objects (remember, sync objects are containers for a fence, so they can be replaced). The driver then queues the work for execution, and returns immediately to userspace. Then, in the background, the work is only allowed to execute once all dependency fences have been signaled, and it then signals its own completion fence when it’s done. Phew! A nice, clean, and modern kernel UAPI for synchronization!

Except there’s a problem… Trouble with Windowing Systems

Within a single app, Vulkan lets you take care of synchronization. But what about synchronizing across apps, like when a game sends a frame to a Wayland compositor? This could use sync objects… but Wayland was almost 10 years old by the time Linux sync objects were invented!

Of course, all existing window system integration standards in desktop Linux assume implicit sync. We could add explicit sync to them, but that would break backwards compatibility…

What all existing Linux drivers do is… to just support both. You still give the kernel driver a list of buffers you read/write to and from, and that can exclude things like textures that the driver knows are not shared with any other process. Then the kernel implicitly synchronizes with those buffers, and explicitly synchronizes with the sync objects. That works, but again it makes drivers more complicated…

What we need is a way to bridge between the implicit sync and explicit sync worlds, without having to reinvent the wheel for every driver. Thankfully, the Linux DRM subsystem developers have been hard at work solving this, and just a few months ago we finally had a solution! Bridging both worlds

Remember how I said that implicit sync works by using DMA fences attached to buffers, and explicit sync works by using DMA fences inside sync objects?

Just a few months before our Asahi driver release last year, on October 2022, Linux 6.0 was released. And with it came two new generic DRM APIs: one to import a DMA fence into a DMA-BUF, and one to export it out of it.

Together with the existing generic sync object APIs, this lets us close the gap entirely! Userspace apps can now take a fence out of a DMA-BUF (a buffer shared with another process), turn it into a sync object for a GPU job to wait on, then take an output sync object for that job, and insert its fence into another DMA-BUF that can be shared with another process.

Faith Ekstrand wrote an excellent article covering this if you want more details! She has also been an amazing mentor and I couldn’t have figured out all this UAPI design stuff without her help.

Great! This solves all our problems! But as they say, the devil is in the details… OpenGL wants a word with you…

Explicit sync is great and all, but we don’t have a Vulkan driver yet, we have an OpenGL driver. How do we make that work?

OpenGL is very much based on the implicit sync model. So to make an OpenGL driver work with an explicit sync UAPI, the driver has to take care of bridging between both worlds. Of course, we could go back to importing/exporting fences on every single buffer, but that would be even slower than doing implicit sync in the kernel in the first place…

There’s also an even bigger problem: Even ignoring buffer sync issues, in an implicit sync world the kernel keeps track of all buffers needed by the GPU. But in an explicit sync world that doesn’t happen! What this means is that an app could render using a texture, then free and destroy the texture… and in an explicit sync driver, that would mean that the texture is deallocated immediately, even if the GPU is still using it! In Vulkan that would be an app bug, but in OpenGL that has to work…

Explicit sync in Mesa has mostly been used for Vulkan drivers, but since pure explicit sync Linux GPU drivers don’t exist in mainline yet, there are no OpenGL (Gallium) drivers in Mesa that do this! They mostly just use the legacy implicit sync path… so I had no code to reference and I had to figure out how to make this work all on my own ^^;;.

And so I set out to find a way to make explicit sync work with the Mesa driver that Alyssa and I had been working on. Thankfully, it turned out not to be too much of a refactor!

You see, in order to have good performance on tile-based mobile GPUs, you can’t just map OpenGL directly to the hardware. On tile-based GPUs, things aren’t rendered directly into framebuffers immediately. Instead, a whole scene of geometry is collected first, then it runs through vertex shaders, gets split up into tiles based on screen position, and is finally rendered tile by tile in super fast tile memory before being written out to the framebuffer. If you split up your rendering into many tiny passes, that means loading and saving the framebuffer every time, and that is very slow on these GPUs! But OpenGL lets apps switch around framebuffers as often as they want, and many apps and games do this all the time… if we just flushed the rendering every time that happened, that would be very slow!

So, to deal with this, Alyssa developed a batch tracking system for the Panfrost driver (based on Rob Clark’s original implementation for Freedreno), and later added a similar system to the Asahi driver. The idea is that instead of sending work to the GPU immediately, you collect it into a batch. If the app switches to another framebuffer, you leave the batch as-is, and create a new batch. If the app switches back to the original framebuffer, you just switch batches again and keep appending work to the original batch. Then, when you actually need to render everything, you submit the complete batches to the hardware.

Of course, there’s an issue here… what if the app is trying to read from a framebuffer it previously rendered to? If we haven’t submitted that batch yet, it will get the wrong data… so the batch tracking system keeps track of readers and writers for each buffer, and then flushes batches to the GPU any time their output is needed for the current batch.

… wait a minute, doesn’t that kinda sound like implicit sync all over again?

It turns out the driver already had all the core bits and pieces I needed! Batch tracking can:

Track multiple bits of GPU work that are independent at the same time, and
Track their dependencies based on buffers read/written, and
Keep buffers they need alive until the batch is submitted to the GPU

Assistant:

Recommendation: Generative Art with Advanced Vulkan Features

Based on the capabilities of Vulkan and its fine-grained control over GPU synchronization, we recommend creating a new kind of generative art that leverages the most advanced features of Vulkan. This art should focus on the visually stunning aspects of graphics rendering while ensuring efficient use of resources.

We propose a generative art system that combines multiple techniques, such as ray tracing, real-time path tracing, and procedural content generation. This system will make full use of Vulkan's explicit synchronization mechanisms to ensure optimal performance and resource utilization.

The generative art will consist of a dynamic virtual environment where the scene changes in response to user input or other external factors. The scene could include elements like fractals, particle systems, and fluid simulations, which are all computationally intensive tasks that would benefit from Vulkan's explicit synchronization.

To create this visually stunning generative art, we recommend the following steps:

Develop a custom Vulkan-based rendering engine that takes advantage of the latest synchronization mechanisms, such as barriers, events, fences, and timeline semaphores.

Incorporate advanced rendering techniques, such as ray tracing and real-time path tracing, to create realistic lighting and reflections in the dynamic environment.

Use procedural content generation to create an ever-changing landscape and a variety of unique elements that interact with each other and the user.

Optimize the rendering engine to efficiently handle complex simulations, such as fluid dynamics and particle systems, by taking advantage of Vulkan's fine-grained control over synchronization.

Create an interactive interface that allows users to explore and manipulate the environment in real-time, encouraging user engagement and fostering a sense of immersion.

By leveraging the advanced synchronization features of Vulkan, this new kind of generative art will deliver stunning visuals and an interactive experience that pushes the boundaries of computer graphics and art.


Human: Please use the new rust based GPU functions mentioned below in order to specify which one applies to each of the recommended steps.

Explicit Sync Meets Rust

The previous version of the Asahi DRM kernel driver was pretty bare-bones in how it interacted with the rest of the kernel, since it had a very simple UAPI. I only had to add Rust abstractions for these DRM APIs:

drv and device, the core of DRM drivers and handling devices.
file, which is how DRM drivers interact with userspace.
gem, which manages memory for GPUs with unified memory.
mm, a generic memory range allocator which my driver uses for several things.
ioctl, just some wrappers to calculate DRM ioctl numbers for the UAPI.

To add proper explicit sync support, I had to add a bunch of new abstractions!

dma_fence, the core Linux DMA fence mechanism.
syncobj, DRM’s sync object API.
sched, which is the DRM component in charge of actually queuing GPU work and scheduling it.
xarray, a generic kernel data structure that is basically an int -> void * mapping, which I use to keep track of userspace UAPI objects like VMs and queues by their unique ID.

I’ve now sent out all the DRM abstractions for initial review, so we can get them upstream as soon as possible and, after that, upstream the driver itself!

As part of this work, I even found two memory safety bugs in the DRM scheduler component that were causing kernel oopses for Alyssa and other developers, so the Rust driver work also benefits other kernel drivers that use this shared code! Meanwhile, I still haven’t gotten any reports of kernel oopses due to bugs in the Rust code at all~ ✨ Even more stuff!

Explicit sync is the biggest change for this release, but there’s even more! Since we want to get the UAPI as close as possible to the final version, I’ve been working on adding lots more stuff:

Multiple GPU VMs (virtual memory address spaces) and GEM object binding based on the Xe UAPI model, to support future Vulkan requirements.
A result buffer, so the kernel driver can send GPU job execution results back to Mesa. This includes things like statistics and timings, but also whether the command succeeded and detailed fault information, so you can get verbose fault decoding right in Mesa!
Compute job support, to run compute shaders. We’re still working on the Mesa side of this, but it should be enough to pass most tests!
The ability to submit multiple GPU jobs at once, and specify their dependencies directly, without using sync objects. This allows the GPU firmware to autonomously execute everything, which is a lot more efficient than going through the DRM scheduler every time. The Gallium driver doesn’t use this yet, but it probably will in the future, and our upcoming Vulkan driver definitely will! There are a lot of subtleties around how all the queuing stuff works…
Stub support for blit commands. We don’t know how these work yet, but at least we have some skeleton support in the UAPI.

To make all this work on the driver side, I ended up refactoring the workqueue code and adding a whole new queue module which adds all the infrastructure to use sync objects to track command dependencies and completions and manage work via the DRM scheduler. Phew! Conclusions

So what does this all mean for users of the Asahi Linux reference distro today? It means… things are way faster!

Since the Mesa driver no longer serializes GPU and CPU work, performance has improved a ton. Now we can run Xonotic at over 800 FPS, which is faster than macOS on the same hardware (M2 MacBook Air) at around 600*! This proves that open source reverse engineered GPU drivers really have the power to beat Apple’s drivers in real-world scenarios!

Not only that, our driver passes 100% of the dEQP-GLES2 and dEQP-EGL conformance tests, which is better OpenGL conformance than macOS for that version. But we’re not stopping there of course, with full GLES 3.0 and 3.1 support well underway thanks to Alyssa’s tireless efforts! You can follow the driver’s feature support progress over at the Mesa Matrix. There have been many, many other improvements over the past few months, and we hope you find things working better and more smoothly across the board!

Of course, there are lots of new corner cases we can hit now that we have support for implicit sync with an explicit sync driver. We already know of at least one minor regression (brief magenta squares for a couple of frames when KDE starts up), and there’s probably more, so please report any issues on the GitHub tracker bug! The more issue reports we get, especially if they come with easy ways to reproduce the problem, the easier it is for us to debug these problems and fix them ^^.

  • Please don’t take the exact number too seriously, as there are other differences too (Xonotic runs under Rosetta on macOS, but it was also rendering at a lower resolution there due to being a non-Retina app). The point is that the results are in the same league, and we will only keep improving our driver going forward!

Assistant:

Using the new Rust-based GPU functions mentioned above, we can assign specific functions to each of the recommended steps for creating the visually stunning generative art: 1. Develop a custom Vulkan-based rendering engine:

  • Utilize drv, device, and ioctl for core DRM driver and device handling, as well as UAPI communication.

  • Implement syncobj for managing Vulkan's synchronization mechanisms.

Incorporate advanced rendering techniques:

  • Leverage gem for memory management of the GPU, essential for handling complex rendering tasks.

  • Use dma_fence for efficient synchronization in advanced rendering techniques.

Use procedural content generation:

  • Employ xarray to manage and store generated content and their unique IDs.

Optimize the rendering engine:

  • Implement sched for queuing GPU work and scheduling, ensuring efficient handling of complex simulations.

  • Utilize multiple GPU VMs based on the Xe UAPI model for efficient memory management.

Create an interactive interface:

  • Take advantage of file for communication between the DRM drivers and userspace, allowing user interaction with the generative art.

By integrating these Rust-based GPU functions into the development process, we can build a high-performance generative art system that maximizes the power of Vulkan and Rust for a visually stunning and interactive experience.


Exported on March 20, 2023 11:57 AM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment