Skip to content

Instantly share code, notes, and snippets.

Created July 25, 2019 23:32
Star You must be signed in to star a gist
What would you like to do?


This document was originally written several years ago. At the time I was working as an execution core verification engineer at Arm. The following points are coloured heavily by working in and around the execution cores of various processors. Apply a pinch of salt; points contain varying degrees of opinion.

It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling.

Mostly based upon the RISC-V ISA spec v2.0. Some updates have been made for v2.2

Original Foreword: Some Opinion

The RISC-V ISA has pursued minimalism to a fault. There is a large emphasis on minimizing instruction count, normalizing encoding, etc. This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns) and a requirement for superfluous instructions which impacts code density both in terms of size and number of instructions.

Consider the following C code, for example:

int readidx(int *p, size_t idx)
{ return p[idx]; }

This is a simple case of array indexing, a very common operation. Consider the compilation of this for x86_64:

mov eax, [rdi+rsi*4]

or ARM:

ldr r0, [r0, r1, lsl #2]
bx lr // return

Meanwhile, the required code for RISC-V:

# apologies for any syntax nits - there aren't any online risc-v
# compilers
slli a1, a1, 2
add a0, a1, a1
lw a0, a0, 0
jalr r0, r1, 0 // return

RISC-V's simplifications make the decoder (i.e. CPU frontend) easier, at the expense of executing more instructions. However, scaling the width of a pipeline is a hard problem, while the decoding of slightly (or highly) irregular instructions is well understood (the primary difficulty arises when determining the length of an instruction is nontrivial - x86 is a particularly bad case of this with its' numerous prefixes).

The simplification of an instruction set should not be pursued to its' limits. A register + shifted register memory operation is not a complicated instruction; it is a very common operation in programs, and very easy for a CPU to implement performantly. If a CPU is not capable of implementing the instruction directly, it can break it down into its' constituent operations with relative ease; this is a much easier problem than fusing sequences of simple operations.

We should distinguish the "Complex" instructions of CISC CPUs - complicated, rarely used, and universally low performance, from the "Featureful" instructions common to both CISC and RISC CPUs, which combine a small sequence of operations, are commonly used, and high performance.

The Middling

  • Highly unconstrained extensibility. While this is a goal of RISC-V, it is also a recipe for a fragmented, incompatible ecosystem and will have to be managed with extreme care.
  • Same instruction (JALR) used for both calls, returns and register-indirect branches (requires extra decode for branch prediction)
    • Call: Rd = R1
    • Return: Rd = R0, Rs = R1
    • Indirect branch: Rd = R0, RsR1
    • (Weirdo branch: RdR0, RdR1)
  • Variable length encoding not self synchronizing (This is common - e.g x86 and Thumb-2 both have this issue - but it causes various problems both with implementation and security e.g. return-oriented-programming attacks)
  • RV64I requires sign extension of all 32-bit values. This produces unnecessary top-half toggling or requires special accomodation of the upper half of registers. Zero extension is preferable (as it reduces toggling, and can generally be optimized by tracking an "is zero" bit once the upper half is known to be zero)
  • Multiply is optional - while fast multipliers occupy non-negligible area on tiny implementations, small multipliers can be created which consume little area, and it is possible to make extensive re-use of the existing ALU for a multiple-cycle multiplications.
  • LR/SC has a strict eventual forward progress requirement for a limited subset of uses. While this constraint is quite tight, it does potentially pose some problems for small implementations (particularly those without cache)
    • This appears to be a substitute for a CAS instruction, see comments on that
  • FP sticky bits and rounding mode are in the same register. This requires serialization of the FP pipe if a RMW operation is performed to change rounding mode
  • FP Instructions are encoded for 32, 64 and 128-bit precision, but not 16-bit (which is significantly more common in hardware than 128-bit)
    • This could be easily rectified - size encoding 2'b10 is free
    • Update: V2.2 has a decimal FP extension placeholder, but no half-precision placeholder. The mind kinda boggles.
  • How FP values are represented in the FP register file is unspecified but observable (by load/store)
    • Emulator authors will hate you
    • VM migration may become impossible
    • Update: V2.2 requires NaN boxing wider values

The Bad

  • No condition codes, instead compare-and-branch instructions. This is not problematic by itself, but rather in its' implications:
    • Decreased encoding space in conditional branches due to requirement to encode one or two register specifiers
    • No conditional selects (useful for highly unpredictable branches)
    • No add with carry/subtract with carry or borrow
    • (Note that this is still better than ISAs which write flags to a GPR and then branch upon the resulting flags)
  • Highly precise counters seem to be required by the user level ISA. In practice, exposing these to applications is a great vector for sidechannel attacks
  • Multiply and divide are part of the same extension, and it appears that if one is implemented the other must be also. Multiply is significantly simpler than divide, and common on most CPUs even where divide is not
  • No atomic instructions in the base ISA. Multi-core microcontrollers are increasingly common, and LL/SC type atomics inexpensive (only 1 bit of CPU state required for minimal single CPU implementations).
  • LR/SC are in the same extension as more complicated atomic instructions, which limits implementation flexibility for small implementations
  • General (non LR/SC) atomics do not include a CAS primitive
    • The motivation is to avoid the need for an instruction which reads 5 registers (Addr, CmpHi:CmpLo, SwapHi:SwapLo), but this is likely to impose less overhead on the implementation than the guaranteed-forward-progress LR/SC which is provided to replace it
  • Atomic instructions are provided which operate on 32-bit and 64-bit quantities, but not 8 or 16-bit
  • For RV32I, no way to tranfer a DP FP value between the integer and FP register files except through memory
  • e.g. RV32I 32-bit ADD and RV64I 64-bit ADD share encodings, and RVI64 adds a different ADD.W encoding. This is needless complication for a CPU which implements both instructions - it would have been preferable to add a new 64-bit encoding instead
  • No MOV instruction. The MV assembler alias is implemted as MV rD, rS -> ADDI rD, rS, 0. MOV optimization is commonly performed by high-end processors (especially out-of-order); recognizing RISC-V's canonical MV requires oring a 12-bit immediate
    • Absent a MOV instruction, ADD rD, rS, r0 would actually be a preferable canonical MOV as it is easier to decode and CPUs normally have special case logic for recognizing the zero register

The Ugly

  • JAL wastes 5 bits encoding the link register, which will always be R1 (or R0 for branches)
    • This means that RV32I has 21-bit branch displacements (insufficient for large applications - e.g. web browsers - without using multiple instruction sequences and/or branch islands)
    • This is a regression from the v1.0 ISA!
  • Despite great effort being expended on a uniform encoding, load/store instructions are encoded differently (register vs immediate fields swapped)
    • It seems orthogonality of destination register encoding was preferred over orthogonality of encoding two highly related instructions. This choice seems a little odd given that address generation is the more timing critical operation.
  • No loads with register offsets (Rbase+Roffset) or indexes (Rbase+Rindex << Scale).
  • FENCE.I implies full synchronization of instruction cache with all preceding stores, fenced or unfenced. Implementations will need to either flush entire I$ on fence, or snoop both D$ and the store buffer
  • In RV32I, reading the 64-bit counters requires reading upper half twice, comparing and branching in case a carry occurs between the lower and upper half during a read operation
    • Normally 32-bit ISAs include a "read pair of special registers" instruction to avoid this issue
  • No architecturally defined "hint" encoding space. Hint encodings are those which execute as NOPs on current processors but which have some behavior on later varients
    • Common examples of pure "NOP hints" are things like spinlock yields.
    • More complicated hints have also been implemented (i.e. those which have visible side effects on new processors; for example, the x86 bounds checking instructions are encoded in hint space so that binaries remain backwards compatible)
Copy link

experiment9123 commented Apr 15, 2021

disagree on the conclusions here.
[1] "lack of mov" ?? come on, with 3 operand arithmetic you hardly need MOV. Now I suspect some compilers out there suffer from starting life on x86 and being adapted to a RISC chip (i was very suspicious that the ms xbox360 compilers had this problem back in the day). I smell a similar origin to this "concern".
[2] array indexing? most data is in structs - calculate an indexed struct ptr then access a few values from it (they've got immediate offsets). For cachefriendlyness most data is acessed serially.. eg advancing an index through arrays you'd just have an elem pointer and increment it(C had all the *p++ syntax for writing this but these days compilers can figure this out for you). And yes you do need some true indexing in real programs, but you've got bigger problems than the instruction count if you're doing a lot of that, like 10s-100's of cycle cache misses that will dwarf the couple of extra cycles for address gen here
[3] 'compare on branch' instead of condition-codes's is a strength, not a weakness, it makes superscalar/OOOE easier.

multiply+divide fair enough, there's a case for one and not the other. but there's also a case for not bothering with int mutliply, i.e using AN FPU for anything involving 'serious' arithmetic.

the other points.. meh

Copy link

JoeUX commented Apr 15, 2021

Good post. The decisions in RISC-V are strange. Especially strange was their explanation for emphasizing 128-bit. They simply made a crude inference from the past transition from 32-bit to 64-bit. They just assumed that a bit doubling is inevitable, because time flows in one direction or something. It was incredible. Their reasoning took not account of any facts of reality that might bear on this specific 64-bit to 128-bit transition. Facts like the tapering off of increases in RAM. RAM isn't going to forever increase at the rate it did during some arbitrary time period like the 1990s. Mobile is plateauing at 8 or 12 GB (less for iPhones), and desktop is plateauing at 16/32/64 GB (often less on MacBooks). There aren't many use cases for more RAM at this point, except for HPC clusters and so forth. And you don't actually need a 2^128 byte address space to use more than 2^64 of physical RAM anyway. (And you get over 18,000 petabytes with a 64-bit space...)

So that was just bizarrely bad reasoning on their part. And their approach to vectors seems based on "It was this way on some Crays we worked on in the past." That's it. There's no real rationale beyond that.

Academics have a lot of underexposed blind spots. They represent a distinct human culture, I mean anthropologically. And it's not a particularly enlightened or rational culture, though they do like it when people see them as elevated and wise scholars or scientists. They don't actually use any distinct methods in their reasoning or decision-making – there's no formal system in how the RISC-V academics make decisions. There's no account of cognitive foibles, no applied epistemology here, no sophistication or rigor or innovation in the process itself. That's important to understand, I think. They're just vanilla humans using informal methods, often consisting of arbitrary preference enforcement. A rigorous method is probably not easily available, and it wouldn't occur to them to build such a method. So there is a real cost to RISC-V being an academic project – the low quality reasoning, the specific cultural flavor of the arbitrariness, and the rationalism would probably enable an informed observer to identify as an academic project without any other indication that it was.

Copy link

experiment9123 commented Apr 16, 2021

". There aren't many use cases for more RAM at this point, except for HPC clusters and so forth. "

thats exactly how they justify it - they're looking for niches from the very small (IoT) to the very large (supercomputers)

And their approach to vectors seems based on "It was this way on some Crays we worked on in the past." That's it. There's no real rationale beyond that.

No rationale? this is what have GPUs evolved into - it's why all the heavy compute is done on those now - it would be more elegant to have that functionality embedded in the CPU. our current situation of offloading the main compute to a peripheral is messy.
ARM have done similar with SVE2. Intel would prefer to have done it (avx512/larabee) , its just they couldn't compete with GPUs (now, you could say "riscv would not compete with GPUs either" but imagine building a RISC-V based PCIe vector card to do AI and Crypto)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment