mbitsnbites/StreamingCPU.md

## StreamingCPU.md

      
    Raw
  

              StreamingCPU.md
            
          
    Idea

Minimize logic in each pipeline stage to minimize design complexity and maxmize
clock speed (roughly the same as for MIPS, but more extreme).

No speculative branches.

Use branch delay slots.


No operand forwarding.

All instructions have the same latency.
I.e. every instruction has trailing "delay slots".


No data hazard resolution.

Exception: Cache misses (if applicable) cause the entire pipeline to
stall.


Possible optimization to reduce the number of delay slots: Partition the
execution part of the pipeline into several pipelines where each execute
pipeline has its own register file (e.g. integer + float + fixed point). Would
require a straight forward way to transfer data between register files.
Pipeline

       Branch                Write back
   _______________    _______________________
  /               \  /                       \
 v                 \v                         \
 PC -> IF -> ID -> RF -> EX1 -> ... -> EXn -> WB
       ^                         ^
       |                         v
    ICache                     DCache

Branch - 2 delay slots:

BN branch if register is negative, PC+immediate
BP branch if register is positive, PC+immediate
BA branch always, PC+immediate
J jump always, register address

Compare - set all bits of register to 1/0 if true/false:

SEQ, SNE, SLT, SLTU, etc.

Conditionals:

Conditional write-back is simple to implement.
E.g. "discard result of next instruction if not true".

Registers:

16 GP registers (per pipeline, e.g. integer + float?).
Possibly only SIMD registers?
Size? 32/64/more bits?

Instruction encoding

Use fixed size 32 bit instruction words.
Pros:

Makes it easier to keep a constant stream of instructions (one per clock).

Cons:

Loading 32-bit or 64-bit immediate values is cumbersome without proper operand forwarding.

Suggestion:
 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
+-----------+---+-------+-------+-------+-----------------------+
|Op         |1 1| Rd    | Ra    | Rb    | ? (shift/mask/func?)  | <- ALU
+-----------+---+-------+-------+-------+-----------------------+

+-----------+---+-------+-------+-------------------------------+
|Op         |1 0| Rd    | Ra    | Imm16                         | <- Load+ALU
+-----------+---+-------+-------+-------------------------------+

+-----------+---+-------+-------+-------+-----------------------+
|Op         |0 1| Imm4  | Ra    | Rb    | Imm12                 | <- Store
+-----------+---+-------+-------+-------+-----------------------+

+-----------+---+-------+-------+-------------------------------+
|Op         |0 0| Imm4  | Ra    | Imm16                         | <- Branch
+-----------+---+-------+-------+-------------------------------+

Also consider VLIW.