mbitsnbites/vector-extensions-for-tiny-risc.md

## vector-extensions-for-tiny-risc.md

      
    Raw
  

              vector-extensions-for-tiny-risc.md
            
          
    Vector extensions for tiny RISC machines

This is an idea for adding vector processing support to tiny RISC style microcontroller CPU:s (e.g. similar to FEMTORV32).
It could be especially beneficial for CPU:s without a pipeline, where each instruction normally takes 2+ clock
cycles to complete.
The purpose of vector processing is to reduce the number of clock cycles required for each operation:

Loop logic overhead is reduced (increment, compare, branch) as several data elements are processed in each loop iteration.
Non-pipelined implementation: Instruction fetch & decode cycles are eliminated while iterating over a vector.
Pipelined in-order implementation: RAW hazard stalls are eliminated, keeping the execute pipeline busy during vector operations.

In essence, this proposal can provide a performance boost of 2x or better for certain workloads, with nearly zero added
hardware complexity.
Note: This proposal should be considerably easier to implement in hardware than the RISC-V V extension, for instnace.
Required architectural additions

New instructions

New instructions are added that behave exactly the same way as regular scalar instructions (e.g. ADD, LW),
except that they act on vector registers instead of scalar registers.
Ideally the new instructions are encoded in the same way as the scalar counterparts, except for a few bits that are
flipped in the instruction word. This makes decoding simpler than if completely new instruction encodings are used.
For RISC-V implementations that only support 32-bit instructions (e.g. no compressed instructions), a simple solution
is to encode the vector mode in the two least significant bits (according to The RISC-V Instruction Set Manual
Volume I: Unprivileged ISA, 20191213,
§26.3). The bits could be used for encoding operand types as follows:


Bits 1..0
Destination
Source 1
Source 2


00
Vector
Vector
Vector


01
Vector
Vector
Scalar


10
Vector
Scalar
Vector


11
Scalar
Scalar
Scalar


Vector length register

A vector length ("VL") register needs to be added. Its size will typically be at most 5 bits (supporting vector lengths
up to 32 elements). Writing to the register can be done in one of the following ways:

Add a setvl instruction that copies an integer value to the VL register. For instance for RISC-V, there is already
a VSETVL instruction in the V extension that could be used for this purpose.
Alternatively, one of the existing scalar registers could be designated as a mirror of the VL register. For instance
for RISC-V, when any instruction writes to the X31 register the VL register could be written to simultaneously.
Note that in this case a min instruction can come in handy for low-overhead vector length calculation in vector
loops. RISC-V has a min instruction in the B extesion.

Vector iteration logic

A vector element counter needs to be added to the implementation. Its size is the same as the vector length register.
Whenever a vector (or scalar) operation is started the counter is set to zero, and on each subsequent clock cycle the
counter is incremented until it reaches the value of the VL register.
A new "VECTOR OPERATION ACTIVE" state needs to be added. For a pipelined implementation this state will stall all stages
before the execute stage. For a non-pipelined implementation it will be a specific state after DECODE and before the
next FETCH.
The vector element counter is used for indexing vector elements within a single vector register.
Vector register file

An extra register file

A natural choice is to add a new register file dedicated to vector registers. It could contain as many vector registers as
the scalar register file (e.g. 32 registers), since the vector instructions typically has the same register address size
(e.g. 5 bits) as the scalar instructions. However, since vector registers may require lots of hardware resources, and most
programs do not need as many vector registers as scalar registers, it could be desirable to only have 1/2 or 1/4 of the
number of scalar registrs. E.g. 8 vector registers is sufficient for most applications.
The length of each vector register is an important but flexible design decision. Anything between 4 and 32 vector elements
per vector register should be fine.
Ultra-light version: Virtual registers

To reduce hardware resource usage (e.g. FPGA LUT:s), vector registers can instead be mapped on top of the existing scalar
register file. Thus a virtual vector register could be [R8,R9,R10,R11], for instance.
A couple of options exist for how to map virtual vector registers to scalar registers:

Overlapping: V[n][k] -> R[n+k]
Segmented: V[n][k] -> R[n<<N | k]

It is also possible to extend the scalar register file, e.g. from 32 registers to 64 registers. The upper registers
would only be accesible by vector instructions (this is how it is done in Simple-V
in libreSOC, for instance).
Benefits:

Easy to mix scalar and vector data and operations.
Easy to perform vector element manipulation and permutations (by modifying the underlying scalar registers).
Accelerate certain scalar operations (e.g. push many scalar registers to the stack).

Cons:

Conflicts with scalar register allocation.
Not as many vector registers / elements as with a dedicated VRF.

Memory address generation

You usually want to treat loads/stores as a special case for vector operations. For instance, scalar base index + stride
can be more useful than vector base index + offset at times. Here, stride can be [0, 4, 8, 12, ...] for instance.
Generating a series of linearly incrementing addresses on-the-fly requires slightly more hardware:

An address offset calculator is required (a registered adder). It is cleared to zero on the first vector cycle, and incremented by a fixed value on each consecutive cycle. The increment value is the stride operand of the instruction (typically the second source operand), which should be a scalar value (usually an immediate value).
An extra MUX is most likely required for making the selection of scalar offset vs stride offset.

Looking back at the suggested operand type encoding above, stride based operations could be used for imemdiate operand instructions (e.g. LW and ADDI for RISC-V) when the Vector <- Scalar, Vector operand configuration is used (i.e. the scalar immediate operand is turned into a vector operand by the stride address generator).
Bits 1..0	Destination	Source 1	Source 2
00	Vector	Vector	Vector
01	Vector	Vector	Scalar
10	Vector	Scalar	Vector
11	Scalar	Scalar	Scalar