mchaput/RISC-V.md

## RISC-V.md

      
    Raw
  

              RISC-V.md
            
          
    Foreward

This document was originally written several years ago. At the time I was working as an execution core verification engineer at Arm. The following points are coloured heavily by working in and around the execution cores of various processors. Apply a pinch of salt; points contain varying degrees of opinion.
It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling.
Mostly based upon the RISC-V ISA spec v2.0. Some updates have been made for v2.2
Original Foreword: Some Opinion

The RISC-V ISA has pursued minimalism to a fault. There is a large emphasis on minimizing instruction count, normalizing encoding, etc. This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns) and a requirement for superfluous instructions which impacts code density both in terms of size and number of instructions.
Consider the following C code, for example:
int readidx(int *p, size_t idx)
{ return p[idx]; }

This is a simple case of array indexing, a very common operation. Consider the compilation of this for x86_64:
mov eax, [rdi+rsi*4]
ret

or ARM:
ldr r0, [r0, r1, lsl #2]
bx lr // return

Meanwhile, the required code for RISC-V:
# apologies for any syntax nits - there aren't any online risc-v
# compilers
slli a1, a1, 2
add a0, a1, a1
lw a0, a0, 0
jalr r0, r1, 0 // return

RISC-V's simplifications make the decoder (i.e. CPU frontend) easier, at the expense of executing more instructions. However, scaling the width of a pipeline is a hard problem, while the decoding of slightly (or highly) irregular instructions is well understood (the primary difficulty arises when determining the length of an instruction is nontrivial - x86 is a particularly bad case of this with its numerous prefixes).
The simplification of an instruction set should not be pursued to its limits. A register + shifted register memory operation is not a complicated instruction; it is a very common operation in programs, and very easy for a CPU to implement performantly. If a CPU is not capable of implementing the instruction directly, it can break it down into its constituent operations with relative ease; this is a much easier problem than fusing sequences of simple operations.
We should distinguish the "Complex" instructions of CISC CPUs - complicated, rarely used, and universally low performance, from the "Featureful" instructions common to both CISC and RISC CPUs, which combine a small sequence of operations, are commonly used, and high performance.
The Middling


Highly unconstrained extensibility. While this is a goal of RISC-V, it is also a recipe for a fragmented, incompatible ecosystem and will have to be managed with extreme care.
Same instruction (JALR) used for both calls, returns and register-indirect branches (requires extra decode for branch prediction)

Call: Rd = R1
Return: Rd = R0, Rs = R1
Indirect branch: Rd = R0, Rs ≠ R1
(Weirdo branch: Rd ≠ R0, Rd ≠ R1)


Variable length encoding not self synchronizing (This is common - e.g x86 and Thumb-2 both have this issue - but it causes various problems both with implementation and security e.g. return-oriented-programming attacks)
RV64I requires sign extension of all 32-bit values. This produces unnecessary top-half toggling or requires special accomodation of the upper half of registers. Zero extension is preferable (as it reduces toggling, and can generally be optimized by tracking an "is zero" bit once the upper half is known to be zero)
Multiply is optional - while fast multipliers occupy non-negligible area on tiny implementations, small multipliers can be created which consume little area, and it is possible to make extensive re-use of the existing ALU for a multiple-cycle multiplications.
LR/SC has a strict eventual forward progress requirement for a limited subset of uses. While this constraint is quite tight, it does potentially pose some problems for small implementations (particularly those without cache)

This appears to be a substitute for a CAS instruction, see comments on that


FP sticky bits and rounding mode are in the same register. This requires serialization of the FP pipe if a RMW operation is performed to change rounding mode
FP Instructions are encoded for 32, 64 and 128-bit precision, but not 16-bit (which is significantly more common in hardware than 128-bit)

This could be easily rectified - size encoding 2'b10 is free
Update: V2.2 has a decimal FP extension placeholder, but no half-precision placeholder. The mind kinda boggles.


How FP values are represented in the FP register file is unspecified but observable (by load/store)

Emulator authors will hate you
VM migration may become impossible
Update: V2.2 requires NaN boxing wider values


The Bad


No condition codes, instead compare-and-branch instructions. This is not problematic by itself, but rather in its implications:

Decreased encoding space in conditional branches due to requirement to encode one or two register specifiers
No conditional selects (useful for highly unpredictable branches)
No add with carry/subtract with carry or borrow
(Note that this is still better than ISAs which write flags to a GPR and then branch upon the resulting flags)


Highly precise counters seem to be required by the user level ISA. In practice, exposing these to applications is a great vector for sidechannel attacks
Multiply and divide are part of the same extension, and it appears that if one is implemented the other must be also. Multiply is significantly simpler than divide, and common on most CPUs even where divide is not
No atomic instructions in the base ISA. Multi-core microcontrollers are increasingly common, and LL/SC type atomics inexpensive (only 1 bit of CPU state required for minimal single CPU implementations).
LR/SC are in the same extension as more complicated atomic instructions, which limits implementation flexibility for small implementations
General (non LR/SC) atomics do not include a CAS primitive

The motivation is to avoid the need for an instruction which reads 5 registers (Addr, CmpHi:CmpLo, SwapHi:SwapLo), but this is likely to impose less overhead on the implementation than the guaranteed-forward-progress LR/SC which is provided to replace it


Atomic instructions are provided which operate on 32-bit and 64-bit quantities, but not 8 or 16-bit
For RV32I, no way to tranfer a DP FP value between the integer and FP register files except through memory
e.g. RV32I 32-bit ADD and RV64I 64-bit ADD share encodings, and RVI64 adds a different ADD.W encoding. This is needless complication for a CPU which implements both instructions - it would have been preferable to add a new 64-bit encoding instead
No MOV instruction. The MV assembler alias is implemted as MV rD, rS -> ADDI rD, rS, 0. MOV optimization is commonly performed by high-end processors (especially out-of-order); recognizing RISC-V's canonical MV requires oring a 12-bit immediate

Absent a MOV instruction, ADD rD, rS, r0 would actually be a preferable canonical MOV as it is easier to decode and CPUs normally have special case logic for recognizing the zero register


The Ugly


JAL wastes 5 bits encoding the link register, which will always be R1 (or R0 for branches)

This means that RV32I has 21-bit branch displacements (insufficient for large applications - e.g. web browsers - without using multiple instruction sequences and/or branch islands)
This is a regression from the v1.0 ISA!


Despite great effort being expended on a uniform encoding, load/store instructions are encoded differently (register vs immediate fields swapped)

It seems orthogonality of destination register encoding was preferred over orthogonality of encoding two highly related instructions. This choice seems a little odd given that address generation is the more timing critical operation.


No loads with register offsets (Rbase+Roffset) or indexes (Rbase+Rindex << Scale).
FENCE.I implies full synchronization of instruction cache with all preceding stores, fenced or unfenced. Implementations will need to either flush entire I$ on fence, or snoop both D$ and the store buffer
In RV32I, reading the 64-bit counters requires reading upper half twice, comparing and branching in case a carry occurs between the lower and upper half during a read operation

Normally 32-bit ISAs include a "read pair of special registers" instruction to avoid this issue


No architecturally defined "hint" encoding space. Hint encodings are those which execute as NOPs on current processors but which have some behavior on later varients

Common examples of pure "NOP hints" are things like spinlock yields.
More complicated hints have also been implemented (i.e. those which have visible side effects on new processors; for example, the x86 bounds checking instructions are encoded in hint space so that binaries remain backwards compatible)