Skip to content

Instantly share code, notes, and snippets.

@sam-falvo
Created August 28, 2016 03:13
Show Gist options
  • Save sam-falvo/e3034bac3c314dd95239bd93f52a3e07 to your computer and use it in GitHub Desktop.
Save sam-falvo/e3034bac3c314dd95239bd93f52a3e07 to your computer and use it in GitHub Desktop.
Presentation material for SVFIG 2016-Aug-27 meeting.
Intel Decodes CISC to RISC,
Can We Decode RISC-V to MISC?
Samuel A. Falvo II
2016-Aug-27
Yes.
IF you are super-careful with instruction set design
and are careful with how you implement your MISC instruction decoder.
BACKGROUND
I need a 64-bit RISC-V CPU to power the Kestrel-3.
I need it NOW if I'm to make the RISC-V Workshop this-coming November,
because I still need to implement external RAM controller and bus bridges
at the very least. I also need to tweak MGIA and/or replace it with CGIA.
Problem is, naive implementations are too big to fit in the iCE40HX-4K FPGA.
Tried to implement the following micro-architectures:
* 5-stage pipeline
- 32-bit instruction fetch Wishbone port.
- 64-bit data read/write Wishbone port.
- Requires additional 64b-to-16b Wishbone bus bridge.
- Never made it past instruction fetch logic in Verilog.
* Hardware microsequencing
- Two attempts made.
- Each attempt failed due to loss of intellectual control
over the design; things just got too complex too quickly.
- First attempt was based on 6502's decoder PLA circuitry.
- Second attempt was an attempt at horizontal microcode.
It didn't even make it to Verilog prototype.
* Stack-ops decode.
- Crude prototype to serve as a proof-of-concept only.
- Resulted in S64X7 64-bit MISC CPU.
- Implemented in three days, sans interrupt support.
- Initial design NOT intended to be fast, only correct.
STACK-OPS DECODE EXAMPLES
ADDI Xd, Xs, 1234
--------------------------------------
Xs RA! RF@ LIT(1234) + Xd RF! NEXT
JAL Xd, 1234
---------------------------------------
PC@ Xd RF! PC@ LIT(1234) + PC! NEXT
CSRRS Xd, Xs, mtime
------------------------------------
LIT(mtime) CSR@ Xd RF! Xs RA! RF@
LIT(mtime) CSR@ OR LIT(mtime) CSR! NEXT
DATA FLOW/PIPELINE LAYOUT
+-------------------+
+-->| Instruction Fetch |
| +-------------------+
| |
| | (RISC-V instruction)
| V
| +-------------------+
| | StackOp Decoder |
| +-------------------+
| ^
| | (Complete MISC program implementing RISC-V semantics)
| V
| +-------------------+ +---------------+
+---| MISC Core |<----->| Register File |
+-------------------+ +---------------+
^
|
V
+-------------------+
| Data Memory I/O |
+-------------------+
So, yes, it IS POSSIBLE to do this.
BUT, with a naive MISC instruction decoder,
performance varies from 1/8th to 1/16th
MISC instruction execution rate.
Which is surprisingly quick, when you think about it!
That is competitive with the TMS9900 microarchitecture!
Just not fast enough to meet my requirements.
I need about 4x this instruction execution rate
to meet my desired usability goals.
SPEED UP TECHNIQUE: Customized MISC Instructions
You've already seen an example:
- Asymmetric register I/O
- Register reads needs two cycles.
- Register writes needs one cycle.
RA! ( reg# - ) RF@ ( - n )
RF! ( n reg# - )
It's long been a staple stack CPU design technique.
See, e.g., WISC-16 CPU in:
Stack Computers, A New Wave
-- Philip Koopman.
https://users.ece.cmu.edu/~koopman/stack_computers
But, this has limitations. Eventually, you'll run out of opcode space.
SPEED UP TECHNIQUE: Macro-Op Fusion
In English, "decode and execute entire combinations of instructions as one."
This is NOT superscalar execution. We still have only one execution engine.
However, with a sufficiently complex instruction decoding, you can come
very close to superscalar performance.
With a 4-way super-scalar architecture, and assuming straight-line code,
you can expect an *average* 4x performance boost. Ergo, with the same
constraints in a macro-op fused execution engine, you can achieve close
to the same performance boost.
(This is *particularly* true with a stack CPU, such as a MISC core!)
MACRO-OP FUSION EXAMPLE
ADDI Xd, Xs, 1234
--------------------------------------
Xs RA! RF@ LIT(1234) + Xd RF! NEXT
------ === ---------------- =========
(1) (2) (3) (4)
JAL Xd, 1234
---------------------------------------
PC@ Xd RF! PC@ LIT(1234) + PC! NEXT
--- ====== --------------- ==========
(1) (2) (3) (4)
CSRRS Xd, Xs, mtime
------------------------------------
LIT(mtime) CSR@ Xd RF! Xs RA! RF@
--------------- ====== ------ ===
(1) (2) (3) (4)
LIT(mtime) CSR@ OR LIT(mtime) CSR! NEXT
------------------ --------------------
(5) (6)
Throughput is 4 cycles on average; 6 cycles in the worst-case.
So, combination of customized instructions PLUS macro-op fusion
yields an *average* performance of 4 clocks per instruction.
To put this into perspective, the Motorola 68000 had an average
throughput of 12 cycles per instruction. While the fastest
instructions required only 4 clocks per instruction, MOST
instructions varied between 8 and 16 clocks, depending on the
addressing mode used and the general class of instruction.
(8+16)/2 = 12. :)
On average, despite all this complexity, a MISC implementation
of the RISC-V would still be about 50% faster (depending on workload)
than an MC68000 at the same clock speed.
RESULTS SO FAR
I developed the S64X7 MISC core. You can find it up on Github at:
https://github.com/sam-falvo/S64X7
Features:
* 64-bit wide stacks, address bus, and data path
* 6 data stack entries
* 5 return stack entries -- my first *true* Forth CPU!
* One instruction per clock! No macro-op fusion!
I never got far enough to implement macro-op fusion;
I wanted something that worked correctly but slowly first.
However, I never went beyond the S64X7 design because,
out of 7680 logic cells in an iCE40HX-8K, it uses almost 6200 cells!
This is too big to fit in a HX-4K device, and besides,
I don't know if enough room exists for the 64b-to-16b bus bridge.
It almost certainly won't have enough room for macro-op fusion logic.
(But the core does work!)
CONCLUSION
Can you use MISC to implement a RISC in a reasonable time-frame?
YES!!
Will it be small enough to fit in a reasonable-sized FPGA?
YES!!
Will it fit in an iCE40HX-4K or -8K device?
4K: NO, it's just too big even in the simplest configuration.
8K: YES, but only barely.
Will it fit in a Xilinx FPGA?
Yes; with lots of room to spare.
But then again, for these FPGAs,
I'd just go back to using 5-stage pipeline design.
THE END
Q & A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment