Update: this proposal was approved by the RVV intrinsics task group and implemented in v0.11 of the API.
In the same revision, a __riscv_
prefix was added to every intrinsic function.
Aside from fixing errors, I will not make changes to the rest of this document.
Note: our community recently added a new "policy" API to the RVV intrinsics, to support tail and mask policies (PR #137). This proposal concerns an alternative way to support this functionality. The proposed API is arguably simpler, but it introduces some incompatibilites with the original ("legacy") API.
You can skip this section if you are already familiar with RVV mask/tail policies, although you should probably review the distinction, made below, between "masked" and "unmasked" instructions, because this definition is somewhat inconsistent in the V-extension specification.
RVV instructions support two types of tail policy, specified by the vtype.vta
bit:
-
Tail-undisturbed (
vta = 0
) means the tail elements of the destination are preserved. -
Tail-agnostic (
vta = 1
) means that each tail element of the destination can be-
left undisturbed,
-
overwritten by ones, or,
-
for mask-producing instructions besides mask loads, overwritten by the result of the instruction as if
vl
were larger.
This behavior may vary element-by-element, and may vary across executions of the same instruction for the same inputs.
-
Some instructions have the same behavior regardless of vta
:
-
Instructions that produce a mask register have tail-agnostic policy regardless of
vta
. -
Instructions with no tail elements, like stores or those whose output is an X- or F-register, have the same behavior regardless of
vta
. -
Some instructions have tail elements whose particular values are guaranteed to be unchanged regardless of
vta
, for example, a vector load whose destination’s tail is already all ones.
RVV instructions can be considered masked or unmasked.
-
Masked instructions are those that use
v0
as an indicator of which body elements are active/inactive. -
Unmasked instructions are those that take all body elements as active, independent of
v0
.
The vm
bit in the instruction encoding is deasserted when v0
is treated as an additional input operand.
This includes the case of all masked instructions, as well as a few exceptional unmasked instructions (add/subtract with borrow/carry, merge, compress).
In a few cases, the V-extension spec confusingly refers to these exceptional cases as "masked",
whereas "encoded as if masked" might be clearer language.
The point is that the vm
bit alone cannot be used to distinguish between masked and unmasked.
Inactive elements' behavior depends on the mask policy, specified by the vtype.vma
bit:
-
Mask-undisturbed (
vma = 0
) means the inactive elements are preserved. -
Mask-agnostic (
vma = 1
) means that each inactive element can be-
left undisturbed or
-
overwritten by ones.
This behavior may vary element-by-element, and may vary across executions of the same instruction for the same inputs.
-
Some instructions have the same behavior regardless of vma
:
-
Unmasked instructions,
-
Instructions with all body elements active,
-
Reductions,
-
Stores,
-
Instructions whose output is an X- or F-register,
-
Instructions with no body elements (
vl = 0
orvstart >= vl
).
In typical applications, vtype
is known statically,
with the only extant exception being context save/restore code.
The current intrinsics API allows the programmer to statically specify SEW (vtype.vsew
) and LMUL (vtype.vlmul
),
as part of each intrinsic function name,
but tail and mask policies (vtype.vta
and vtype.vma
) are not directly exposed.
Here is a summary of the proposed policy suffixes:
-
All instructions, besides stores and those whose output is a mask or an X- or F-register, will have additional variants decorated by
tu
. -
All masked instructions will be distinguished from their unmasked counterparts by
m
ormu
. -
All masked instructions, besides stores, reductions, and those whose output is an X- or F-register, will have additional variants decorated by
mu
.
Here are the details:
Suffix | Masked? | Tail behavior | Mask behavior | Applicability | Extra arguments |
---|---|---|---|---|---|
(none) |
unmasked |
compiler-defined |
compiler-defined |
unmasked instructions |
(none) |
|
unmasked |
undisturbed |
compiler-defined |
unmasked instructions excluding stores and those whose output is a mask or an X- or F-register |
|
|
masked |
compiler-defined |
compiler-defined |
masked instructions |
|
|
masked |
compiler-defined |
undisturbed |
masked instructions excluding stores, reductions, and those whose output is an X- or F-register |
|
|
masked |
undisturbed |
compiler-defined |
masked instructions excluding stores and those whose output is a mask or an X- or F-register |
|
|
masked |
undisturbed |
undisturbed |
masked instructions excluding stores, reductions, and those whose output is a mask or an X- or F-register |
|
(*) Some instructions' destination vector is also an input operand independent of mask/tail policy, thus it is unnecessary to add an additional undisturbed
argument in these cases.
-
Segment loads are passed pointers to their destination vectors; these vectors can provide undisturbed elements. (In the case of the tuple API, some segment loads will need to be augmented with an
undisturbed
tuple.) -
The output vector of FMAs is also an input, so it can provide undisturbed elements.
-
vslideup.v{x,i}
also input their output vector (at least when 0 < OFFSET <= vl), so it can provide undisturbed elements.
This API proposal does not remove all redundancy, for example:
-
A vector load whose destination is already all ones is guaranteed to be unaffected by mask and tail policies. However, the contents of a vector register may not be statically known.
-
Tail policy is irrelevant in a reduction with SEW = VLEN = 32 on Zve32x. However, VLEN may not be statically known (only upper and lower bounds).
-
Tail policy is irrelevant in many cases when VL = VLMAX. However, VL and VLEN (which determines VLMAX) may not be statically known.
-
Masking and mask policy is irrelevant if all the mask bits are asserted. However, as just mentioned, the contents of a vector register may not be statically known.
-
Masking and mask policy is also irrelevant when
vl = 0
orvstart >= vl
. However,vl
andvstart
may not be statically known.
If the programmer can prove statically that the mask or tail policy is irrelevant, then they should use the policy suffix that lets the compiler decide the behavior. So while there may be API redundancy in these cases, there is no ambiguity in what the programmer should prefer.
The programmer may explicitly allow the compiler to determine the behavior, based on its cost model.
-
On some implementations, undisturbed behavior may be costlier than agnostic, so the compiler might prefer agnostic.
-
Or, an adjacent instruction may use undisturbed behavior, so the compiler would prefer undisturbed, even though it’s not necessary for correctness, in order to avoid changing
vtype
.
Additionally, static analysis might demonstrate that the programmer requested certain behavior when it was not necessary.
-
The programmer may have encountered one of the redundancies mentioned above but not realized it.
-
If the programmer performs a
_tumu
load followed by a_m
store and the data isn’t used afterwards, the compiler is free to perform both instructions with mask- and tail-agnostic policy.
Lastly, the compiler may decide to emulate the requested behavior by other means.
-
If the user requests a
_tu
load, the compiler could perform it instead with a masked load, mask-undisturbed and tail-agnostic policy, using a mask that is all ones up to VL and all zeros thereafter.
The recently added "policy" API (PR #137) gives programmer more control over tail and mask policy than the "legacy" API. This proposal differs from the "policy" API in several ways:
-
It more clearly decouples the architectural definition of agnosticism from the intrinsics definition. That is, the compiler enforces mask-undisturbed or tail-undisturbed behavior if the programmer requests it, but otherwise the compiler is free to choose the policy. The previous proposal did not clearly distinguish these two senses of agnosticism.
-
It reduces the number of special cases.
-
It decreases the number of characters in the policy suffixes.
Unfortunately, the unsuffixed and _m
-suffixed intrinsics in the new proposal collide with those in the legacy API,
leading to some incompatibilities:
-
In the legacy API, FMAs default to tail-undisturbed.
In this proposal, for all instructions, tail policy defaults to compiler-defined unless explicitly specified to be undisturbed. Programs that relied on the old behavior will still compile but may get incorrect results unless updated with
tu
suffixes. It appears that the legacy API was apparently recently changed to default to tail-agnostic --- although this API change hasn’t yet been incorporated in toolchains --- so perhaps this incompatibility is already tolerated. -
In the legacy API, intrinsics with
_m
always take an extra argument calledmaskedoff
. Except for some apparently undocumented special cases, the programmer can request mask-undisturbed behavior, and specify the undisturbed elements, by passing a vector tomaskedoff
. (The compiler may create a copy if necessary.) The programmer can also express agnosticism to mask policy by passing the result of avundefined_*()
intrinsic tomaskedoff
. In LLVM, the behavior always seems to be mask-undisturbed and tail-agnostic: the distinction seems to be that passingvundefined_*()
may enable the compiler to avoid an unnecessary copy or register pressure.In this proposal, intrinsics with
_m
allow the compiler to decide mask and tail behavior, and have noundisturbed
(ormaskedoff
) argument. Programs written for the old API will generate compiler errors due to argument mismatch. -
In the legacy API, reductions, scalar-to-vector moves, slides, and perhaps other instructions always take an extra argument called
dest
. The programmer can request tail-undisturbed behavior, and specify the undisturbed elements, by passing a vector todest
. (The compiler may create a copy if necessary.) The programmer can also express agnosticism to tail policy by passing the result of avundefined_*()
intrinsic tomaskedoff
. In LLVM, the behavior always seems to be mask-undisturbed, althoughvundisturbed_*()
does seem to switch to tail agnostic.In this proposal, only the
_tu
and_tum
reductions accept an extraundisturbed
argument.