raphlinus/subgroup.md

## subgroup.md

      
    Raw
  

              subgroup.md
            
          
    Considerations for subgroups

One feature that is clearly out of scope for WebGPU 1.0 but is desired for the near future is subgroups. It is a way to move data between threads within a workgroup with less overhead and latency than workgroup shared memory, but poses more challenges for portability. While almost all modern GPU hardware supports subgroup operations, the feature poses significant compatibility challenges. In particular, while workgroup size is determined by the programmer within generous ranges (WebGPU requires a minimum maximum of 256), subgroup sizes vary by hardware and also compiler heuristics. Shaders need be written in a way that adapts to a wide range of subgroup sizes, which is quite challenging.
This issue will be written largely from the perspective of accelerating prefix sum operations (an important primitive within Vello), but there are many potential applications. One relatively recent development is cooperative matrix operations, which are supported in most newer GPU hardware and can dramatically increase throughput for matrix operations commonly used in machine learning inference.
In general, the set of subgroup operations is fairly consistent across hardware and APIs (on devices new enough to support them), but support for size control varies more widely. I personally consider that to be a bigger barrier, both to standardization and developer experience.
Previous work

The topic of subgroups has been much discussed, and there have been detailed, concrete proposals. The most recent of those is #1459, which in turn links a number of other relevant issues. I think it is a good starting point for a common variety of subgroup functions and built-in values.
The prefix sum usecase

The subgroup feature should be useful for a wide range of workloads, but I'll go into just a bit more detail on prefix sums, as it's illustrative.
With just workgroup shared memory, a typical way to implement a workgroup-wide prefix sum is a Hillis-Steele scan. A simple version that uses workgroup memory is as below:
    var<workgroup> a: array<u32, WG_SIZE>;

    for (var i = 0; i < firstTrailingBit(WG_SIZE); i++) {
        a[local_id.x] = x;
        workgroupBarrier();
        if local_id.x >= 1 << i {
            x = a[local_id.x - (1 << i)] + x;
        }
        workgroupBarrier();
    }
This version works well, but has 2 * lg(WG_SIZE) barriers and a significant amount of shared memory traffic. (Note: other optimizations are possible, including ping-pong between two arrays to reduce the number of barriers)
With subgroups, performance improves substantially. The following code (names of things up for grabs) works in the case where the workgroup size is less than or equal to the square of the subgroup size, for example when subgroup size is 32 and workgroup size is 1024. The array declaration is also written assuming that subgroup_size can be treated as a constant, which is dubious (about which more below).
In this code, I assume that subgroup_size is the size of the subgroup, subgroup_invocation_id is the index of the thread within the subgroup, and subgroup_id is the index of the subgroup within the workgroup. In general I would expect subgroup_id * subgroup_size + subgroup_invocation_id == local_invocation_index.
    var<workgroup> a: array<u32, WG_SIZE / subgroup_size>;

    let subgroup_scan = subgroupInclusiveAdd(x);
    if subgroup_invocation_id == subgroup_size - 1 {
        a[subgroup_id] = subgroup_scan;
    }
    workgroupBarrier();
    let reduced = select(0, a[subgroup_invocation_id], subgroup_invocation_id < subgroup_id);
    let prefix = subgroupAdd(reduced);
    x = prefix + subgroup_scan;
When workgroup size is greater than the square of the subgroup size, then one barrier is not sufficient. One way to express this is to have various blocks, each of which is guarded by if subgroup_size == 8 {} etc, and hope that the compiler does constant propagation and dead code elimination to generate code about as well optimized as if it had been specialized for a subgroup size.
(A detail: the above code assumes the existence of an inclusive prefix sum, which is supported on Vulkan and Metal but not HLSL. The inclusive add is the same as the exclusive add plus the argument, so this could easily be polyfilled, or, if we decide not to include that variant, the above code could be changed to read let subgroup_scan = subgroupExclusiveAdd(x) + x;)
Potential future work: perform experiments on a range of hardware to quantify the performance gain.
The key takeaways are:

The amount of workgroup shared storage depends on the subgroup size. In the case of prefix sums, the optimum array size is generally workgroup size / subgroup size.
The algorithm depends quite a bit on subgroup size.

API support for subgroup size control

There are a number of useful features relating to subgroup size control, and support varies across APIs and versions. In all cases, the shader should be able to query the subgroup size (as a built-in input value, in WGSL terminology), as it is effectively impossible to write correct code otherwise. It is also highly desirable to query minimum and maximum subgroup size through the API, to select the most appropriate permutation of the shader. In the case where minimum and maximum are equal, the application can then select a version of the shader specialized for a particular subgroup size.
Less important but supported in many APIs is the ability to query the subgroup size of a compiled shader through the API, and also to set a subgroup size within the range supported by the hardware. The latter is rightfully considered a somewhat dangerous feature, as it can lead to serious performance degradation. (Setting too high a subgroup size can increase register pressure and lead to spilling, particularly on Intel)
Vulkan

Vulkan 1.1 added support for subgroups, but the initial version had a number of limitations.

There is no way for the application to query or control the actual subgroups size via the API.
There is nominally an API query for subgroup size (the subgroupSize field of VkPhysicalDeviceVulkan11Properties), but it gives incorrect values, in practice the maximum subgroup size.
The gl_SubgroupSize builtin reports incorrect values, in practice it represents the maximum subgroup size.

It may be possible to derive the subgroup size by calculating the product of gl_WorkgroupSize divided by gl_NumSubgroups.
These limitations were addressed by the [subgroup size control] extension, which is now mandatory in Vulkan 1.3. When this extension is enabled, the application can query minimum and maximum subgroup size, can specify a subgroup size, and there is a flag to change the semantics of gl_WorkgroupSize to represent the actual workgroup size.
Direct3D

Direct3D12 added subgroups in Shader Model 6. Using them depends on DXIL bytecode, as the feature is not supported in DXBC. Subgroups are called "waves" in HLSL terminology.
The subgroup size is readily queried from shader code with the WaveGetLaneCount function. Like Vulkan, it is guaranteed to be a power of two between 4 and 128.
Minimum and maximum wave sizes are reported as the WaveLaneCountMin and WaveLaneCountMax fields of the D3D12_FEATURE_DATA_D3D12_OPTIONS1, though the documentation does not inspire confidence. (TODO followup: doing an internet search for device capability reports has Intel devices reporting 16 for both minimum and maximum; what's going on?)
Shader Model 6.6 adds a WaveSize attribute, which allows specifying a subgroup size.
Metal

Metal added initial support for subgroups in Metal 2.0 on macOS, and 2.2 on iOS (corresponding to the A13 GPU and iPhone 11), where they are called "SIMD groups."
A shader can access the subgroup size through the threads_per_simdgroup attribute. Querying of minimum and maximum subgroup size is not supported through the API, but this is obviously a desirable feature. Because of the smaller diversity of supported hardware, it may be practical to infer subgroup sizes from other data. In particular, on all Apple Silicon that has shipped so far, the subgroup size is 32. The maximum subgroup size is also constrained to be 64 (this constraint is inherent in the definition of the simd_vote type).
Querying the subgroup size of a compiled shader can be done by reading the threadExecutionWidth property of a MTLComputePipelineState.
There is one additional capability available in Metal that may be helpful. The size of workgroup shared arrays can be late-bound, using the setThreadgroupMemoryLength API call. Thus, when the size of an array is dependent on subgroup size, it is possibe to write the Metal shader with a variable-length array, compile the shader, query the subgroup size, then use this API call to set the appropriate array sizes. Exposing this functionality in a portable way is challenging.
The ballot type

One subgroup feature is the ballot, where each thread in a subgroup supplies a boolean value, and those are packed into a single word, the same size as the subgroup. The optimal type for this ballot result is thus dependent on subgroup size.
Vulkan deals with this by defining the ballot type as uvec4, which at 128 bits is adequate for all subgroup sizes supported by Vulkan (powers of two between 4 and 128). HLSL is similar, returning uint4 from the ballot operations.
Metal deals with this by defining the ballot type as simd_vote, which in practice is a thin wrapper around uint64_t.
To be most consistent with existing practice, WGSL could just use vec4<u32> as the ballot type. However, if more optimization were desired, we could define a ballot type, with a width dependent on maximum subgroup size.
Recommendations

Minimum and maximum subgroup sizes should be available in WGSL in a mechanism similar to pipeline overrides, specifically so arrays can be sized appropriately. For prefix sum, where array size is workgroup size divided by subgroup size, it suffices to divide by minimum subgroup size, which may result in some wasted storage (when the actual subgroup size is larger than the minimum), but is always correct, and significantly better than if the minimum subgroup size is not plumbed.
Future

In the current shader compilation model, subgroup size is something of a layering violation. Ideally it would be available as a constant so compiled code could be specialized for it, but when it is determined by compiler heuristic, it cannot be known until the shader is analyzed at least enough to estimate register pressure and other factors.
One way to resolve this dilemma is to speculatively compile the shader at each of the supported subgroup sizes, in each case instantiating min/max to be equal, then choosing the one which the heuristic determines will have the best performance. It would be good to design the feature to enable this. Thus, I recommend that the minimum and maximum values are defined as the possible range for a particular shader compilation, and might be tighter than the actual range supported by the hardware. That's a reason to define them at the WGSL level, rather than simply make them available for API query and allow the application to plumb them as pipeline overrides.
Another potential feature that is not supported by current APIs is specifying a preference or hint for subgroup size. The prefix sum application benefits from larger subgroup sizes, but obviously if register pressure causes spilling or a big drop in occupancy, it should be reduced. [Note: adding such a hint may be controversial, as there is a history of such features not pulling their weight in practice]
Perhaps future work is quantifying the potential performance gains from using a more carefully tuned preference than relying on existing compiler heuristics. For workloads that do not rely on inter-thread communication, there is some evidence that subgroup size does not matter much (see Does subgroup/wave size matter? by Faith Ekstrand).