nihalpasham/CubeCL_Architecture_Overview.md

## CubeCL_Architecture_Overview.md

      
    Raw
  

              CubeCL_Architecture_Overview.md
            
          
    CubeCL

#gpu #kernel #rust
High Level Overview:


GPU kernels in Rust
Comptime

Automatic vectorization
Instruction and shape specialization
Loop unrolling


Autotuning

Shading Languages


WGSL - WebGPU Shading Language
GLSL - OpenGL
HLSL - High-level shading language
MSL - Metal Shading Language

Single Source Programming Model


CUDA
ROCm
SYCL

CubeCL Transforms Under the Hood


Rust -> WGSL
Rust -> CUDA

High-Level CubeCL Architecture:

CubeCL provides runtimes (cubecl_wgpu and cubecl_cuda) that are built on top of the following backends: Wgpu and Cuda.
From my understanding, the current implementation includes the following constructs: ComputeClient, ComputeServer, and a Channel, which serves as the abstraction for sending requests from the client to the server.
Instantiating a ComputeClient involves two steps:

Setting up the necessary data structures for each backend (e.g., wgpu_setup for Wgpu).
Creating a client using the data structures from the setup, along with instantiating a MemoryManagement type to manage GPU memory allocation and deallocation strategies.

The client essentially wraps a Channel and a FeatureSet, which is a list of features supported by each runtime.
Once we have a ComputeClient, we can perform various tasks, such as creating or accessing resources (e.g., GPU buffers) and executing kernels. Note that invoking methods on the client will eventually route them to the ComputeServer, which holds the necessary Wgpu structures to actually create and access these resources.
Translation of Rust Kernels to Native Kernels (e.g., Rust -> WGSL)

use cubecl::prelude::*;

#[cube(launch_unchecked)]
fn gelu_array<F: Float>(input: &Array<F>, output: &mut Array<F>) {
    if ABSOLUTE_POS < input.len() {
        output[ABSOLUTE_POS] = gelu_scalar::<F>(input[ABSOLUTE_POS]);
    }
}

#[cube]
fn gelu_scalar<F: Float>(x: F) -> F {
    x * (F::erf(x / F::sqrt(2.0.into())) + 1.0) / 2.0
}
CubeCL's unique selling point (USP) is its ability to write GPU kernels in Rust, as demonstrated above. However, there are a few things to keep in mind:

All types used in a CubeCL function must implement the CubeType trait. In the example above, both F and Array<F> are CubeCL types. They both implement the CubeType trait, while F also implements the Float trait.
CubeCL kernels are procedural macros that expand into Rust functions. These generated functions, which are semantically similar to the original ones, produce the Intermediate Representation (IR) when invoked.


Key point:
Instead of directly generating the IR, the macro first creates a new Rust function.

The Flow:

In the above example, the CubeCL function annotated with the #[cube(launch_unchecked)] macro expands into a module containing a GeluArray struct that implements the Kernel trait.

pub struct GeluArray<F: Float, __R: cubecl::prelude::Runtime> {
        settings: cubecl::prelude::KernelSettings,
        __ty: ::core::marker::PhantomData<(__R, F)>,
    }

The GeluArray struct holds the KernelSettings struct.
KernelSettings allows us to configure various parameters, including the vectorization factor for kernel inputs and outputs.
Once we configure our KernelSettings, we instantiate a KernelLauncher and register the associated kernel inputs and outputs for the kernel launch.
Kernel launching involves several levels of indirection:

The KernelLauncher invokes the ComputeClient's execute method to initiate kernel execution.
This method uses a Channel to route the call to the ComputeServer (in our case, the WgpuServer), which executes the kernel with the provided bindings.


Kernel execution involves preparing the pipeline state.

At this stage, the kernel is compiled into source code (i.e., WGSL).
Remember, the kernel is simply our GeluArray struct, which implements the Kernel trait. The Kernel trait requires two methods:


pub trait Kernel: Send + Sync + 'static + Sized {
    /// Convert to a kernel definition.
    fn define(&self) -> KernelDefinition;
    /// Identifier for the kernel, used for caching kernel compilation.
    fn id(&self) -> KernelId {
        KernelId::new::<Self>()
    }
}

Vectorization factor: For example, Elem::Float(FloatKind) with a vectorization factor of 4 represents a 4-element vector of floating-point numbers, which could be processed in a SIMD manner.


Binding struct: It's a memory binding, which connects the tensor handle and the actual memory (storage) on the compute server.


Preparing the Kernel

Kernel preparation involves two main steps:

Kernel Expansion
Kernel Definition

In the example above:


Kernel definition begins with instantiating the KernelBuilder struct and populating it with the kernel’s inputs, outputs, context, and the number of inputs and outputs.


Two ordered maps are required to convert and store the inputs and outputs as Variables. The order of insertion is crucial.

Expanding the kernel input means registering an input and returning the element to be used for kernel expansion.
Here, "element" refers to either an ExpandElement or ExpandElementTyped, which are simply wrapper types for Variables.


Now that we have a fully initialized KernelBuilder and expanded kernel inputs/outputs, we proceed to actual kernel expansion.


Kernel Expansion

In this phase, the body of the kernel function is expanded. In the gelu example:

Several important data structures are involved in this process:

Operation: CubeCL operations that can be legally used in a GPU compute shader.
Variable: Holds data or CubeCL values that can be referenced during GPU compute shader operations.
Scope: A container that holds CubeCL operations and variables.
CubeContext: A wrapper type for Scope, containing root and non-root scopes and a VariablePool.
ExpandElement: A wrapper type for CubeCL Variables.
ExpandElementTyped: The typed version of ExpandElement.


CubeCL operations behave like conventional operations, taking input operands and returning a result. This behavior is modeled in CubeCL IR.
#[cube(launch_unchecked)]
fn gelu_array<F: Float>(input: &Array<F>, output: &mut Array<F>) {
    if ABSOLUTE_POS < input.len() {
        output[ABSOLUTE_POS] = gelu_scalar::<F>(input[ABSOLUTE_POS]);
    }
}

In our gelu example, the if condition:

ABSOLUTE_POS < input.len()
expands to:
/// Expanded Cube function
pub fn __expand<F: Float>(
    context: &mut cubecl::frontend::CubeContext,
    input: <Array<F> as cubecl::frontend::CubeType>::ExpandType,
    output: <Array<F> as cubecl::frontend::CubeType>::ExpandType,
) -> () {
    let _cond = {
        let _lhs = ABSOLUTE_POS::expand(context);
        let _rhs = input.clone().__expand_len_method(context);
        cubecl::frontend::lt::expand(context, _lhs, _rhs)
    };
...
...
...

ABSOLUTE_POS (or _lhs) is a Variable.
input.len() (or _rhs) is also a Variable.
The less-than operator (<) expands into the lt::expand operation, with _lhs and _rhs as inputs, along with the context.
All operations (and their operands) are added to the provided context (Scope).
The order in which they are pushed onto a CubeContext (i.e., scope) is crucial.


Note: _lhs and _rhs are actually ExpandElementTyped<UInt>s.

Kernel Definition

Once the kernel function is expanded, the next step is creating a kernel definition. The main data structures involved are:

KernelIntegrator: Enables the creation of a KernelDefinition based on a KernelExpansion and KernelSettings.
KernelExpansion: Contains the necessary information to generate a KernelDefinition.
KernelDefinition: Represents the finalized kernel after expansion and integration, functioning as CubeCL's intermediate representation.

The first step is to instantiate a KernelIntegrator by passing KernelSettings and invoking the integrator’s integrate method. This method combines the inputs and outputs (from the kernel expansion) into input/output bindings and returns a KernelDefinition.
Kernel Definition to Target Compute Shader

As mentioned earlier, a KernelDefinition is the intermediate representation (IR) in CubeCL.

The final step is to map this IR to the target compute shader source code. In our case, this is WGSL.
Essentially, we map all variables and operations in CubeCL to the target shader source using the corresponding shader compiler—specifically, the WgslCompiler in our case.

In other words, the KernelDefinition (IR) is mapped to the target compute shader source code, in this case, WGSL. The WgslCompiler translates (or maps) each IR variable, operation, and input/output binding into its corresponding shader source equivalent.
CubeCL Artefacts:


Example CubeCL Macro Expansion: https://gist.github.com/nihalpasham/133a935304e22054b0fe92efde43caec
Example CubeCL IR: https://gist.github.com/nihalpasham/6e4c0edf5b1a0b199c05c186a5a75b2d
Example CubeCL Generated WGSL Shader: https://gist.github.com/nihalpasham/0ed25f2dbcb08278f79d6ceabf38a60b