Consider any RISC-V vector instruction whose output operand is a vector (register group). Recall that when VL = 0, the destination register will be left undisturbed, regardless of mask/tail policy. Indeed, the assembly language provides a means for the programmer to specify the output vector: for example, in the case of a reduction, the programmer might write
vfredusum.vs v0, v1, v0
to ensure that the source "scalar" v0
is forwarded to the output in the case VL = 0.
The RVV intrinsics API, on the other hand, does not expose this functionality. That is, if the intrinsics programmer writes
// vfloat32m8_t vx; // vfloat32m1_t vacc; vacc = vfredusum_vs_f32m8_f32m1 (vx, vacc, vl);
there is no guarantee that the aforementioned register allocation will result, potentially leading to incorrect code in the case VL = 0.
Let’s flesh out the example in the problem statement to motivate why this is important to the intrinsics programmer.
float sum (float x0, float const* x, size_t N) { float acc = x0; for (size_t i = 0; i < N; ++i) acc += x[i]; return acc; }
Here is a possible (vectorized) implementation:
beqz a1, 2f vsetivli zero, 1, e32, m1, ta, ma vfmv.s.f v8, fa0 1: vsetvli a2, a1, e32, m8, ta, ma vle32.v v0, (a0) sub a1, a1, a2 sh2add a0, a2, a0 vfredosum.vs v8, v0, v8 bnez a1, 1b vfmv.f.s fa0, v8 2: ret
Compressing instructions appropriately, the static code size is 2+4+4+4+4+2+4+4+2+4+2 = 36B. The dynamic code size is 4B when N = 0, and 20*ceil(4*N/VLEN) + 16 otherwise.
However, recalling the behavior when VL = 0,
observe that the loop head branch (initial beqz
) can be omitted without affecting correctness.
The static code size reduces to 34B,
and the dynamic code size becomes 20*ceil(4*N/VLEN) + 14, even when N = 0.
In summary, when N is nonzero --- presumably the common case --- we can reduce static and dynamic code size each by two bytes by leveraging the VL = 0 behavior.
In addition to the general benefits of reducing the number of dynamic instructions, there are also benefits to avoiding branches in particular: we might avoid the penalty of a branch misprediction, while also reducing pressure on the branch predictor’s memory.
Serendipitously, the following intrinsics implementation
vfloat32m1_t vacc = vfmv_s_f_f32m1 (x0, 1); do { size_t vl = vsetvl_e32m8 (N); vfloat32m8_t vx = vle32_v_f32m8 (x, vl); vacc = vfredosum_vs_f32m8_f32m1 (vx, vacc, vl); x += vl; N -= vl; } while (N); return vfmv_f_s_f32m1_f32 (vacc);
generates (note*) the desired (minimal) code.
Unfortunately, this relies on the compiler deciding to implement the reassignment of vacc
in place,
for which there is no guarantee.
The question is how to empower the intrinsics programmer to specify this semantics.
(*) requires using policy API in 2022.08.2 toolchain.
How should the intrinsics programmer express this algorithm? Is it possible to achieve the same code-size savings as the assembly programmer?
Currently, the recommended way to write portable, correct code is to explicitly handle the case of VL = 0.
In the previous example, one approach is for the programmer to replace the do { … } while (N);
with while (N) { … }
.
This does not achieve the code-size savings.
In principle, an optimizing compiler could elide the loop head branch (effectively reverting the do
-to-while
transformation),
provided that the register allocation for the vfredosum.vs
is appropriately constrained.
It is unclear how difficult it would be to implement such an optimization,
or how well it would generalize.
Subjectively, it also violates the programmer’s intent.
We could constrain the semantics of the reduction intrinsics so that the "scalar" input is always forwarded to the output in the case VL = 0. Unfortunately, in use-cases where the original input is needed subsequently, the compiler will have to generate a copy or spill, likely defeating the code-size savings in addition to the performance impact. While it’s not clear how common such use-cases are, this seems like a risky strategy.
One workaround that appears to achieve the desired code-size savings involves using a tail-undisturbed form of the intrinsic:
vacc = vfredusum_vs_f32m8_f32m1_tu (undisturbed, vx, vacc, vl);
The semantics are that the new leading input argument, undisturbed
, provides the "undisturbed" elements.
(These policy intrinsics were a recent addition to the API, and this behavior was exposed in a more limited and convoluted manner in the legacy API.)
LLVM’s implementation constrains register allocation so that a register holding (a copy of) undisturbed
is used as the destination register;
this incidentally achieves the desired forwarding behavior when VL = 0.
Unfortunately, we are relying on implementation-defined behavior. And even if we constrain the semantics to codify the desired behavior, this approach may still incur a performance penalty on some implementations, due to an unnecessary use the tail-undisturbed policy.
There is a similar workaround involving using "trivially" masked instructions, with mask undisturbed policy,
but this is not relevant to reductions, where masking is understood with respect the input vector vx
, rather than the output.
Motivated by the previous approach, we consider defining new intrinsics, similar to the tail-undisturbed ones, having an extra input vector:
... = vfredusum_vs_f32m8_f32m1 (forward, vx, vacc, vl);
The semantics are simply that forward
will be forwarded to the output when VL = 0.
This avoids the issues particular to tail policy.
And in the case of tail-undisturbed reductions, we could simply use undisturbed
for this role.
(Masked tail-agnostic reductions would be modified with a forward
argument.)
Unfortunately, the intrinsics API is already quite large (~70,000 intrinsics in the most recent proposal),
and further duplication is undesirable, especially if this extension goes beyond reductions.
If we do pursue this direction,
perhaps it would be best to redefine the existing intrinsics,
rather than introducing new ones.
If the programmer does not wish to specify forward
, they can always use the vundefined_*()
helper function, like in the legacy API.
It is desirable that the intrinsics programmer can achieve the same code-size savings as the assembly programmer, without performance penalty. Unfortunately, the current intrinsics API proposal does not adequately provide this functionality, even with the help of an optimizing compiler. The way forward seems to be to define new intrinsics, perhaps redefining existing ones.
Our motivating example has focused on reductions.
In fact, all of the motivating examples we’ve discovered involve reductions.
Before pursuing this further, we should clarify the scope, if any beyond reductions.
For example, vadd.vx
also leaves its destination register undisturbed when VL = 0,
but it is unclear whether we need a forward
version of it.
For example, consider an application where we add a scalar to zero or more leading elements of a vector,
and then subsequently access an element outside of those that were modified.
It seems certain that such an application would use some kind of undisturbed vadd.vx
,
otherwise the unmodified elements accessed subsequently would have unpredictable values (due to agnostic policy).
Is there a use-case where we only require undisturbed behavior in the case that no elements were modified?
Thanks for producing the writeup. The other issue not listed so far is that the extra branch also pollutes branch predictors.