Skip to content

Instantly share code, notes, and snippets.

@nick-knight
Last active November 2, 2022 20:12
Show Gist options
  • Save nick-knight/b02a16c35d743a672f1c284b0e8b9092 to your computer and use it in GitHub Desktop.
Save nick-knight/b02a16c35d743a672f1c284b0e8b9092 to your computer and use it in GitHub Desktop.
Supporting VL = 0 Behavior in RVV Intrinsics

Supporting VL = 0 Behavior in RVV Intrinsics

Problem Statement

Consider any RISC-V vector instruction whose output operand is a vector (register group). Recall that when VL = 0, the destination register will be left undisturbed, regardless of mask/tail policy. Indeed, the assembly language provides a means for the programmer to specify the output vector: for example, in the case of a reduction, the programmer might write

vfredusum.vs v0, v1, v0

to ensure that the source "scalar" v0 is forwarded to the output in the case VL = 0.

The RVV intrinsics API, on the other hand, does not expose this functionality. That is, if the intrinsics programmer writes

// vfloat32m8_t vx;
// vfloat32m1_t vacc;
vacc = vfredusum_vs_f32m8_f32m1 (vx, vacc, vl);

there is no guarantee that the aforementioned register allocation will result, potentially leading to incorrect code in the case VL = 0.

Case Study

Let’s flesh out the example in the problem statement to motivate why this is important to the intrinsics programmer.

float sum (float x0, float const* x, size_t N) {
    float acc = x0;
    for (size_t i = 0; i < N; ++i)
        acc += x[i];
    return acc;
}

Here is a possible (vectorized) implementation:

  beqz          a1, 2f
  vsetivli      zero, 1, e32, m1, ta, ma
  vfmv.s.f      v8, fa0
1:
  vsetvli       a2, a1, e32, m8, ta, ma
  vle32.v       v0, (a0)
  sub           a1, a1, a2
  sh2add        a0, a2, a0
  vfredosum.vs  v8, v0, v8
  bnez          a1, 1b
  vfmv.f.s      fa0, v8
2:
  ret

Compressing instructions appropriately, the static code size is 2+4+4+4+4+2+4+4+2+4+2 = 36B. The dynamic code size is 4B when N = 0, and 20*ceil(4*N/VLEN) + 16 otherwise.

However, recalling the behavior when VL = 0, observe that the loop head branch (initial beqz) can be omitted without affecting correctness. The static code size reduces to 34B, and the dynamic code size becomes 20*ceil(4*N/VLEN) + 14, even when N = 0.

In summary, when N is nonzero --- presumably the common case --- we can reduce static and dynamic code size each by two bytes by leveraging the VL = 0 behavior.

In addition to the general benefits of reducing the number of dynamic instructions, there are also benefits to avoiding branches in particular: we might avoid the penalty of a branch misprediction, while also reducing pressure on the branch predictor’s memory.

Serendipitously, the following intrinsics implementation

vfloat32m1_t vacc = vfmv_s_f_f32m1 (x0, 1);
do {
    size_t vl = vsetvl_e32m8 (N);
    vfloat32m8_t vx = vle32_v_f32m8 (x, vl);
    vacc = vfredosum_vs_f32m8_f32m1 (vx, vacc, vl);
    x += vl;
    N -= vl;
} while (N);
return vfmv_f_s_f32m1_f32 (vacc);

generates (note*) the desired (minimal) code. Unfortunately, this relies on the compiler deciding to implement the reassignment of vacc in place, for which there is no guarantee. The question is how to empower the intrinsics programmer to specify this semantics.

(*) requires using policy API in 2022.08.2 toolchain.

Solutions and Non-Solutions

How should the intrinsics programmer express this algorithm? Is it possible to achieve the same code-size savings as the assembly programmer?

Handle VL = 0 explicitly

Currently, the recommended way to write portable, correct code is to explicitly handle the case of VL = 0.

In the previous example, one approach is for the programmer to replace the do { …​ } while (N); with while (N) { …​ }. This does not achieve the code-size savings. In principle, an optimizing compiler could elide the loop head branch (effectively reverting the do-to-while transformation), provided that the register allocation for the vfredosum.vs is appropriately constrained. It is unclear how difficult it would be to implement such an optimization, or how well it would generalize. Subjectively, it also violates the programmer’s intent.

Tie input and output registers

We could constrain the semantics of the reduction intrinsics so that the "scalar" input is always forwarded to the output in the case VL = 0. Unfortunately, in use-cases where the original input is needed subsequently, the compiler will have to generate a copy or spill, likely defeating the code-size savings in addition to the performance impact. While it’s not clear how common such use-cases are, this seems like a risky strategy.

Use undisturbed policy intrinsics

One workaround that appears to achieve the desired code-size savings involves using a tail-undisturbed form of the intrinsic:

vacc = vfredusum_vs_f32m8_f32m1_tu (undisturbed, vx, vacc, vl);

The semantics are that the new leading input argument, undisturbed, provides the "undisturbed" elements. (These policy intrinsics were a recent addition to the API, and this behavior was exposed in a more limited and convoluted manner in the legacy API.) LLVM’s implementation constrains register allocation so that a register holding (a copy of) undisturbed is used as the destination register; this incidentally achieves the desired forwarding behavior when VL = 0.

Unfortunately, we are relying on implementation-defined behavior. And even if we constrain the semantics to codify the desired behavior, this approach may still incur a performance penalty on some implementations, due to an unnecessary use the tail-undisturbed policy.

There is a similar workaround involving using "trivially" masked instructions, with mask undisturbed policy, but this is not relevant to reductions, where masking is understood with respect the input vector vx, rather than the output.

Define new intrinsics

Motivated by the previous approach, we consider defining new intrinsics, similar to the tail-undisturbed ones, having an extra input vector:

... = vfredusum_vs_f32m8_f32m1 (forward, vx, vacc, vl);

The semantics are simply that forward will be forwarded to the output when VL = 0. This avoids the issues particular to tail policy. And in the case of tail-undisturbed reductions, we could simply use undisturbed for this role. (Masked tail-agnostic reductions would be modified with a forward argument.)

Unfortunately, the intrinsics API is already quite large (~70,000 intrinsics in the most recent proposal), and further duplication is undesirable, especially if this extension goes beyond reductions. If we do pursue this direction, perhaps it would be best to redefine the existing intrinsics, rather than introducing new ones. If the programmer does not wish to specify forward, they can always use the vundefined_*() helper function, like in the legacy API.

Discussion

It is desirable that the intrinsics programmer can achieve the same code-size savings as the assembly programmer, without performance penalty. Unfortunately, the current intrinsics API proposal does not adequately provide this functionality, even with the help of an optimizing compiler. The way forward seems to be to define new intrinsics, perhaps redefining existing ones.

Our motivating example has focused on reductions. In fact, all of the motivating examples we’ve discovered involve reductions. Before pursuing this further, we should clarify the scope, if any beyond reductions. For example, vadd.vx also leaves its destination register undisturbed when VL = 0, but it is unclear whether we need a forward version of it. For example, consider an application where we add a scalar to zero or more leading elements of a vector, and then subsequently access an element outside of those that were modified. It seems certain that such an application would use some kind of undisturbed vadd.vx, otherwise the unmodified elements accessed subsequently would have unpredictable values (due to agnostic policy). Is there a use-case where we only require undisturbed behavior in the case that no elements were modified?

@nick-knight
Copy link
Author

@kasanovic Thanks for your comment. No, I don't trust the compiler to do this. And the compiler engineers aren't exactly sure how to do it, outside of pattern-matching special cases.

I have attempted to clarify the problem statement, and added a case study that makes explicit the nature of the (static and dynamic) code-size savings at stake. I have described several non-solutions to the problem, and one solution.

I also added a discussion at the end, which attempts to restate the question I tried to pose to you on Slack.

@kasanovic
Copy link

Thanks for producing the writeup. The other issue not listed so far is that the extra branch also pollutes branch predictors.

@nick-knight
Copy link
Author

The other issue not listed so far is that the extra branch also pollutes branch predictors.

Thanks, I added a sentence mentioning this additional benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment