eopXD/rounding_mode_and_exception_csr.adoc

## rounding_mode_and_exception_csr.adoc

      
    Raw
  

              rounding_mode_and_exception_csr.adoc
            
          
    Intrinsics Support for Rounding Modes and Exception Flags


Introduction


Most RISC-V vector (RVV) instructions access control and status registers (CSRs),
which can be understood as additional instruction operands
that were implemented separately due to encoding constraints.
The assembly programmer must manually access these CSRs.
The C-languge RVV intrinsics do have any such encoding constraints,
so the intrinsics programmer can enjoy a simpler programming model.
The current intrinsics proposal hides the CSRs behind the API,
letting the programmer specify the semantics at a higher level and
leaving the compiler to generate CSR accesses as needed.


For the application programmer, the relevant CSRs are vl;
vsew, vlmul, vta, and vma (fields of vtype);
frm and fflags (fields of fcsr);
and vxrm and vxsat (fields of vcsr).
(The application programmer should not need to access vstart,
or the vill field of vtype.)
The current intrinsics proposal addresses the first five,


vl: treated it as an extra integer operand to each intrinsic where it is relevant.


vsew, vlmul, vta, vma: statically encoded in the intrinsics


but not the last four, frm, fflags, vxrm, and vxsat.
This document aims to fill in this gap.


The proposals for these four can be considered and implemented separately.
In fact, we suggest that fflags and vxsat support be treated as a lower priority,
and possibly deferred to a later release as it won’t break compatibility
for the current out-going experimental intrinsics.


Support for frm


Most vector floating-point arithmetic instructions read the
"dynamic" floating-point rounding mode CSR, frm.
This CSR is actually defined in the F-extension, a prerequisite for the V-extension.


RISC-V does not specify a default value for frm.
IEEE-754 (a.k.a. IEC 60559) specifies the default rounding mode to be "roundTiesToEven",
called "RNE" in RISC-V terminology.
The C language specifies that there is a default rounding mode,
but it is implementation defined;
if the implementation defines __STDC_IEC_559__,
then it promises to follow the IEEE-754 requirements,
including the default of RNE.
The GNU and LLVM C implementations, for RISC-V targets,
both appear to define this macro and use RNE by default.
In scopes where the C programmer enables access to the C floating-point environment (fenv),
via #pragma STDC FENV_ACCESS ON,
then the default rounding mode can be changed dynamically via fesetround.
(The behavior of fenv.h functions is undefined when FENV_ACCESS is off.)


The existing proposal for vector intrinsics does not expose frm to the programmer,
does not specify what rounding mode is used by default,
and does not specify how the intrinsics interact with fenv.
To address these omissions, we propose two classes of intrinsics,
called implicit- and explicit-frm.


The implicit-frm intrinsics are those in the existing proposal,
which make no explicit mention of floating-point rounding mode.
They behave like any C-language floating-point expressions,
using the default rounding mode when FENV_ACCESS is off,
and using the fenv dynamic rounding mode when FENV_ACCESS is on.


The explicit-frm intrinsics are new to this proposal.
They add _rne, _rtz, _rdn, _rup, and _rmm suffixes,
and the behavior is as if frm was set to the indicated rounding mode.
However, the fenv dynamic rounding mode is not affected by these intrinsics
(regardless of FENV_ACCESS).


vfloat32m1_t vfadd_vv_f32m1 (vfloat32m1_t op1, vfloat32m1_t op2, size_t vl);
+ vfloat32m1_t vfadd_vv_f32m1_rne (vfloat32m1_t op1, vfloat32m1_t op2, size_t vl);
+ vfloat32m1_t vfadd_vv_f32m1_rtz (vfloat32m1_t op1, vfloat32m1_t op2, size_t vl);
+ vfloat32m1_t vfadd_vv_f32m1_rdn (vfloat32m1_t op1, vfloat32m1_t op2, size_t vl);
+ vfloat32m1_t vfadd_vv_f32m1_rup (vfloat32m1_t op1, vfloat32m1_t op2, size_t vl);
+ vfloat32m1_t vfadd_vv_f32m1_rmm (vfloat32m1_t op1, vfloat32m1_t op2, size_t vl);


Commentary


The RISC-V "scalar" floating-point extensions (Zfh, F, D, Q)
actually support both static and dynamic rounding modes.
A static rounding mode is part of the instruction encoding,
whereas a dynamic rounding mode is taken from the frm CSR.
Interestingly, despite the possibility of using static rounding,
both GNU and LLVM compilers appear to prefer generating
scalar floating-point instructions using dynamic rounding,
relying on the C runtime initialization to set frm to RNE.
Presumably, this is to enable (or avoid breaking) codes
that use non-fenv means, like inline assembly,
to change the rounding mode.
Vector floating point instructions do not support static rounding:
they only support dynamic rounding.
The recommendation for implementers is that the implicit-frm intrinsics do not modify frm
(e.g., explicitly set it to RNE to ensure the default).
This should ensure behavior is uniform with other C floating-point.


It is arguable that there is redundency in having both implicit- and explicit-frm intrinsics.
From a functionality standpoint, we could have just the implicit-frm versions,
using fenv to select the desired rounding mode.
In our experience, both GNU and LLVM are highly conservative
when compiling codes with FENV_ACCESS on.
It is not clear whether the conservatism is due to a fundamental obstacle
(e.g., dependence on global state),
or just the fact that very few programmers use this functionality
so improving it is low priority.
The intended use-cases are as follows:


The explicit-frm intrinsics are intended to be used when FENV_ACCESS is off,
to enable more aggressive optimization while still providing the programmer
with control over the rounding mode.
Using explicit-frm intrinsics when FENV_ACCESS is on will still work correctly,
but is expected to lead to extra saving/restoring of frm,
that could be avoided by using fenv functionality and implicit-frm.
Similarly, when FENV_ACCESS is off, mixing explicit-frm intrinsics
with implicit-frm instrinsics or scalar C floating-point,
we expect the compiler to save/restore frm to preserve "default".


The implicit-frm intrinsics are intended to be used regardless of FENV_ACCESS.
They are provided when FENV_ACCESS is on for the (few) programmers
who are already using fenv.
And they are provided when FENV_ACCESS is off for the (vast majority of) programmers
who prefer the default rounding mode.


The redundancy is between the implicit-frm intrinsics and
the _rne (explicit-frm) instrinsics in the case FENV_ACCESS is off.


Support for vxrm


All vector floating-point arithmetic instructions read
the fixed-point rounding mode CSR, vxrm.


RISC-V does not specify a default value for vxrm,
and the C language does not mention fixed-point rounding modes.


The existing intrinsics proposal does not expose vxrm to the programmer,
nor does it specify what fixed-point rounding mode is used by default.


Unlike the case of frm, we propose removal of the existing "implicit-vxrm" intrinsics,
replacing them with explicit-vxrm versions.
These add _rnu, _rne, _rdn, and _rod suffixes,
and the behavior is as if vxrm was set appropriately.


- vint32m1_t vsadd_vv_i32m1 (vint32m1_t op1, vint32m1_t op2, size_t vl);
+ vint32m1_t vsadd_vv_i32m1_rnu (vint32m1_t op1, vint32m1_t op2, size_t vl);
+ vint32m1_t vsadd_vv_i32m1_rne (vint32m1_t op1, vint32m1_t op2, size_t vl);
+ vint32m1_t vsadd_vv_i32m1_rdn (vint32m1_t op1, vint32m1_t op2, size_t vl);
+ vint32m1_t vsadd_vv_i32m1_rod (vint32m1_t op1, vint32m1_t op2, size_t vl);


Commentary


We decided to exclude the implicit-vxrm intrinsics because,
unlike the case of floating-point,
there is no established default fixed-point rounding mode in the C language.
Moreover, there is no fixed-point analogue of fenv (“vxenv”?).
If there arises user demand for these constructions,
they could be added in a future extension:
this proposal is forward compatible.


Support for fflags


The support of exception flags fflags is covered here, for the completeness of this proposal addressing all the CSRs not supported right now in the experimental intrinsics. However unlike rounding mode, we tend to believe that this functionality is less popular in usage. Since the following proposal is forward compatible and with concern to the tight schedule binded with the open souce compiler (LLVM and GCC) releases, we suggest to address the feature in proceeeding release and not in the v1.0.


When FENV_ACCESS is off, the current floating-point intrinsics behave just like its scalar version. Additionally specified in the C standard (FIXME: citation here), with FENV_ACCESS OFF, the user is not allowed to either set, get, or test the exception flag of the floating-point environment.


When FENV_ACCESS is on, the current floating-point intrinsics and its scalar version will interact with the float-point exception flags through fenv.h and the header’s provided interface of fegetexcept, fesetexcept, etc… Essentially just like the rounding mode case, the users are interacting with a global status.


Standing upon the current behavior, we propose an explicit version of the vecrtor float-point intrinsic with interface such that users can provide the fflag to be set before the vector floating-point operation, and have the exception flag be updated into the same variable provided. The semantics are that any bits asserted in the underlying architectural register (fcsr.fflags) are instead asserted in the fflags argument. The programmer can zero-initialize *fflags to zero to detect which bits are asserted. Note that a null pointer provided into the function will be undefined behavior.


vfloat32m1_t vfadd_vv_f32m1 (vfloat32m1_t op1, vfloat32m1_t op2, size_t vl);
+ vfloat32m1_t vfadd_vv_f32m1 (vfloat32m1_t op1, vfloat32m1_t op2, size_t *fflag, size_t vl);


The main intent of such design is to decouple the users' control of the exception flag with the existing floating-point environment interface. This enables control of the exception flags without enabling FENV_ACCESS and allows the compiler to optimize out redundant save/restore for maintaining the exception flags inside the floating-point environment, since users are assumed to not engage in any operations to the exception flags under the circumstance.


On the other hand when FENV_ACCESS is on, the explicit exception flag interface is expected to not modify the current flags in the floating-point environment. This design is consistent with our proposal to the rounding modes.


/* save current fflags */
/* set as programmer requested */
vfmul.vv
/* get updated fflags for programmer */
/* restore previously saved fflags*/


Commentary


The explicit exception flag intrinsics are intended to be used when FENV_ACCESS is off, where users want control to exception flag but also want high performance. When FENV_ACCESS is on and the mixed usage of the explicit intrinsics with its implicit companions or scalar C code, the exception flag in the environment is expected to be unchanged regardless of the intrinsics, which will require additional save/restore to the flags and thus affect performance.


Integrating upon the rounding mode proposal, we expect an additional fflags parameter upon the explicit rounding mode intrinsics. With the semantic mentioned, this is forward compatible and consistent in terms of the interaction with the floating-point environment.


(FIXME: This is comment in general, don’t know where to put this) When users decide to use the RVV intrinsics, the implicit decision behind this is the program will no longer be portable across different targets. The floating-point environment is an interface designed to be compatible with different target with access to rounding mode and exception flags. Under the usage of RVV intrinsics, which portablility is no longer the case, we recommend users to use the explicit version of intrinsics for best performance.


Support for vxsat


Just as mentioned in the support for fflags, this feature is forward compatible and can be added in a future release rather than v1.0.


RISC-V does not specify a default value for vxsat, and the C language does not mention fixed-point rounding modes. The existing intrinsics proposal does not expose vxsat to the programmer.


Just like the explicit-fflag instrinsics, we propose an additional parameter provided into the function to set the saturation bit before execution, and receive the bit result in the same parameter.


vint32m1_t vsadd_vv_i32m1_rne (vint32m1_t op1, vint32m1_t op2, size_t vl);
+ vint32m1_t vsadd_vv_i32m1_rne (vint32m1_t op1, vint32m1_t op2, size_t *vxsat, size_t vl);


Commentary


We suggest not to implement getter and setters because this will introduce a possible use case of pairing up the getter/setter with the implicit intrinsics. Such use case requires maintenance of a global status and all fixed-point intrinsics will be labeled with a side effect, which prohibits any optimization and hurts performance.