nick-knight/policy-intrinsics.adoc Secret

## policy-intrinsics.adoc

      
    Raw
  

              policy-intrinsics.adoc
            
          
    Tail and Mask Policies for RVV Intrinsics


Table of Contents

Architectural Background

Tail Policy
Masking and Mask Policy


Intrinsics Proposal

Compiler Optimization
Comparison With Current API


Update: this proposal was approved by the RVV intrinsics task group and implemented in v0.11 of the API.
In the same revision, a __riscv_ prefix was added to every intrinsic function.
Aside from fixing errors, I will not make changes to the rest of this document.


Note: our community recently added a new "policy" API to the RVV intrinsics, to support tail and mask policies
(PR #137).
This proposal concerns an alternative way to support this functionality.
The proposed API is arguably simpler, but it introduces some incompatibilites with the original ("legacy") API.


Architectural Background


You can skip this section if you are already familiar with RVV mask/tail policies,
although you should probably review the distinction, made below, between "masked" and "unmasked" instructions,
because this definition is somewhat inconsistent in the V-extension specification.


Tail Policy


RVV instructions support two types of tail policy, specified by the vtype.vta bit:


Tail-undisturbed (vta = 0) means the tail elements of the destination are preserved.


Tail-agnostic (vta = 1) means that each tail element of the destination can be


left undisturbed,


overwritten by ones, or,


for mask-producing instructions besides mask loads, overwritten by the result of the instruction as if vl were larger.


This behavior may vary element-by-element, and may vary across executions of the same instruction for the same inputs.


Some instructions have the same behavior regardless of vta:


Instructions that produce a mask register have tail-agnostic policy regardless of vta.


Instructions with no tail elements, like stores or those whose output is an X- or F-register, have the same behavior regardless of vta.


Some instructions have tail elements whose particular values are guaranteed to be unchanged regardless of vta, for example, a vector load whose destination’s tail is already all ones.


Masking and Mask Policy


RVV instructions can be considered masked or unmasked.


Masked instructions are those that use v0 as an indicator of which body elements are active/inactive.


Unmasked instructions are those that take all body elements as active, independent of v0.


The vm bit in the instruction encoding is deasserted when v0 is treated as an additional input operand.
This includes the case of all masked instructions, as well as a few exceptional unmasked instructions (add/subtract with borrow/carry, merge, compress).
In a few cases, the V-extension spec confusingly refers to these exceptional cases as "masked",
whereas  "encoded as if masked" might be clearer language.
The point is that the vm bit alone cannot be used to distinguish between masked and unmasked.


Inactive elements' behavior depends on the mask policy, specified by the vtype.vma bit:


Mask-undisturbed (vma = 0) means the inactive elements are preserved.


Mask-agnostic (vma = 1) means that each inactive element can be


left undisturbed or


overwritten by ones.


This behavior may vary element-by-element, and may vary across executions of the same instruction for the same inputs.


Some instructions have the same behavior regardless of vma:


Unmasked instructions,


Instructions with all body elements active,


Reductions,


Stores,


Instructions whose output is an X- or F-register,


Instructions with no body elements (vl = 0 or vstart >= vl).


Intrinsics Proposal


In typical applications, vtype is known statically,
with the only extant exception being context save/restore code.
The current intrinsics API allows the programmer to statically specify SEW (vtype.vsew) and LMUL (vtype.vlmul),
as part of each intrinsic function name,
but tail and mask policies (vtype.vta and vtype.vma) are not directly exposed.


Here is a summary of the proposed policy suffixes:


All instructions, besides stores and those whose output is a mask or an X- or F-register, will have additional variants decorated by tu.


All masked instructions will be distinguished from their unmasked counterparts by m or mu.


All masked instructions, besides stores, reductions, and those whose output is an X- or F-register, will have additional variants decorated by mu.


Here are the details:


Suffix
Masked?
Tail behavior
Mask behavior
Applicability
Extra arguments


(none)
unmasked
compiler-defined
compiler-defined
unmasked instructions
(none)


_tu
unmasked
undisturbed
compiler-defined
unmasked instructions excluding stores and those whose output is a mask or an X- or F-register
undisturbed (*)


_m
masked
compiler-defined
compiler-defined
masked instructions
mask


_mu
masked
compiler-defined
undisturbed
masked instructions excluding stores, reductions, and those whose output is an X- or F-register
undisturbed (*), mask


_tum
masked
undisturbed
compiler-defined
masked instructions excluding stores and those whose output is a mask or an X- or F-register
undisturbed (*), mask


_tumu
masked
undisturbed
undisturbed
masked instructions excluding stores, reductions, and those whose output is a mask or an X- or F-register
undisturbed (*), mask


(*) Some instructions' destination vector is also an input operand independent of mask/tail policy, thus it is unnecessary to add an additional undisturbed argument in these cases.


Segment loads are passed pointers to their destination vectors; these vectors can provide undisturbed elements. (In the case of the tuple API, some segment loads will need to be augmented with an undisturbed tuple.)


The output vector of FMAs is also an input, so it can provide undisturbed elements.


vslideup.v{x,i} also input their output vector (at least when 0 < OFFSET <= vl), so it can provide undisturbed elements.


This API proposal does not remove all redundancy, for example:


A vector load whose destination is already all ones is guaranteed to be unaffected by mask and tail policies. However, the contents of a vector register may not be statically known.


Tail policy is irrelevant in a reduction with SEW = VLEN = 32 on Zve32x. However, VLEN may not be statically known (only upper and lower bounds).


Tail policy is irrelevant in many cases when VL = VLMAX. However, VL and VLEN (which determines VLMAX) may not be statically known.


Masking and mask policy is irrelevant if all the mask bits are asserted. However, as just mentioned, the contents of a vector register may not be statically known.


Masking and mask policy is also irrelevant when vl = 0 or vstart >= vl. However, vl and vstart may not be statically known.


If the programmer can prove statically that the mask or tail policy is irrelevant,
then they should use the policy suffix that lets the compiler decide the behavior.
So while there may be API redundancy in these cases,
there is no ambiguity in what the programmer should prefer.


Compiler Optimization


The programmer may explicitly allow the compiler to determine the behavior, based on its cost model.


On some implementations, undisturbed behavior may be costlier than agnostic,
so the compiler might prefer agnostic.


Or, an adjacent instruction may use undisturbed behavior, so the compiler would prefer undisturbed,
even though it’s not necessary for correctness, in order to avoid changing vtype.


Additionally, static analysis might demonstrate that the programmer requested certain behavior when it was not necessary.


The programmer may have encountered one of the redundancies mentioned above but not realized it.


If the programmer performs a _tumu load followed by a _m store and the data isn’t used afterwards,
the compiler is free to perform both instructions with mask- and tail-agnostic policy.


Lastly, the compiler may decide to emulate the requested behavior by other means.


If the user requests a _tu load, the compiler could perform it instead with a masked load, mask-undisturbed and tail-agnostic policy, using a mask that is all ones up to VL and all zeros thereafter.


Comparison With Current API


The recently added "policy" API (PR #137) gives programmer more control over tail and mask policy than the "legacy" API.
This proposal differs from the "policy" API in several ways:


It more clearly decouples the architectural definition of agnosticism from the intrinsics definition. That is, the compiler enforces mask-undisturbed or tail-undisturbed behavior if the programmer requests it, but otherwise the compiler is free to choose the policy. The previous proposal did not clearly distinguish these two senses of agnosticism.


It reduces the number of special cases.


It decreases the number of characters in the policy suffixes.


Unfortunately, the unsuffixed and _m-suffixed intrinsics in the new proposal collide with those in the legacy API,
leading to some incompatibilities:


In the legacy API, FMAs default to tail-undisturbed.

In this proposal, for all instructions, tail policy defaults to compiler-defined unless explicitly specified to be undisturbed.
Programs that relied on the old behavior will still compile but may get incorrect results unless updated with tu suffixes.
It appears that the legacy API was apparently recently changed to default to tail-agnostic --- although this API change hasn’t yet been incorporated in toolchains --- so perhaps this incompatibility is already tolerated.


In the legacy API, intrinsics with _m always take an extra argument called maskedoff.
Except for some apparently undocumented special cases,
the programmer can request mask-undisturbed behavior,
and specify the undisturbed elements,
by passing a vector to maskedoff.
(The compiler may create a copy if necessary.)
The programmer can also express agnosticism to mask policy by passing the result of a vundefined_*() intrinsic to maskedoff.
In LLVM, the behavior always seems to be mask-undisturbed and tail-agnostic:
the distinction seems to be that passing vundefined_*() may enable the compiler to avoid an unnecessary copy or register pressure.

In this proposal, intrinsics with _m allow the compiler to decide mask and tail behavior, and have no undisturbed (or maskedoff) argument.
Programs written for the old API will generate compiler errors due to argument mismatch.


In the legacy API, reductions, scalar-to-vector moves, slides, and perhaps other instructions always take an extra argument called dest.
The programmer can request tail-undisturbed behavior,
and specify the undisturbed elements,
by passing a vector to dest.
(The compiler may create a copy if necessary.)
The programmer can also express agnosticism to tail policy by passing the result of a vundefined_*() intrinsic to maskedoff.
In LLVM, the behavior always seems to be mask-undisturbed, although vundisturbed_*() does seem to switch to tail agnostic.

In this proposal, only the _tu and _tum reductions accept an extra undisturbed argument.
Suffix	Masked?	Tail behavior	Mask behavior	Applicability	Extra arguments
(none)	unmasked	compiler-defined	compiler-defined	unmasked instructions	(none)
`_tu`	unmasked	undisturbed	compiler-defined	unmasked instructions excluding stores and those whose output is a mask or an X- or F-register	`undisturbed` (*)
`_m`	masked	compiler-defined	compiler-defined	masked instructions	`mask`
`_mu`	masked	compiler-defined	undisturbed	masked instructions excluding stores, reductions, and those whose output is an X- or F-register	`undisturbed` (*), `mask`
`_tum`	masked	undisturbed	compiler-defined	masked instructions excluding stores and those whose output is a mask or an X- or F-register	`undisturbed` (*), `mask`
`_tumu`	masked	undisturbed	undisturbed	masked instructions excluding stores, reductions, and those whose output is a mask or an X- or F-register	`undisturbed` (*), `mask`