Skip to content

Instantly share code, notes, and snippets.

Created August 28, 2016 03:13
Show Gist options
  • Save sam-falvo/e3034bac3c314dd95239bd93f52a3e07 to your computer and use it in GitHub Desktop.
Save sam-falvo/e3034bac3c314dd95239bd93f52a3e07 to your computer and use it in GitHub Desktop.
Presentation material for SVFIG 2016-Aug-27 meeting.
Intel Decodes CISC to RISC,
Can We Decode RISC-V to MISC?
Samuel A. Falvo II
IF you are super-careful with instruction set design
and are careful with how you implement your MISC instruction decoder.
I need a 64-bit RISC-V CPU to power the Kestrel-3.
I need it NOW if I'm to make the RISC-V Workshop this-coming November,
because I still need to implement external RAM controller and bus bridges
at the very least. I also need to tweak MGIA and/or replace it with CGIA.
Problem is, naive implementations are too big to fit in the iCE40HX-4K FPGA.
Tried to implement the following micro-architectures:
* 5-stage pipeline
- 32-bit instruction fetch Wishbone port.
- 64-bit data read/write Wishbone port.
- Requires additional 64b-to-16b Wishbone bus bridge.
- Never made it past instruction fetch logic in Verilog.
* Hardware microsequencing
- Two attempts made.
- Each attempt failed due to loss of intellectual control
over the design; things just got too complex too quickly.
- First attempt was based on 6502's decoder PLA circuitry.
- Second attempt was an attempt at horizontal microcode.
It didn't even make it to Verilog prototype.
* Stack-ops decode.
- Crude prototype to serve as a proof-of-concept only.
- Resulted in S64X7 64-bit MISC CPU.
- Implemented in three days, sans interrupt support.
- Initial design NOT intended to be fast, only correct.
ADDI Xd, Xs, 1234
Xs RA! RF@ LIT(1234) + Xd RF! NEXT
JAL Xd, 1234
PC@ Xd RF! PC@ LIT(1234) + PC! NEXT
CSRRS Xd, Xs, mtime
LIT(mtime) CSR@ Xd RF! Xs RA! RF@
LIT(mtime) CSR@ OR LIT(mtime) CSR! NEXT
+-->| Instruction Fetch |
| +-------------------+
| |
| | (RISC-V instruction)
| V
| +-------------------+
| | StackOp Decoder |
| +-------------------+
| ^
| | (Complete MISC program implementing RISC-V semantics)
| V
| +-------------------+ +---------------+
+---| MISC Core |<----->| Register File |
+-------------------+ +---------------+
| Data Memory I/O |
So, yes, it IS POSSIBLE to do this.
BUT, with a naive MISC instruction decoder,
performance varies from 1/8th to 1/16th
MISC instruction execution rate.
Which is surprisingly quick, when you think about it!
That is competitive with the TMS9900 microarchitecture!
Just not fast enough to meet my requirements.
I need about 4x this instruction execution rate
to meet my desired usability goals.
SPEED UP TECHNIQUE: Customized MISC Instructions
You've already seen an example:
- Asymmetric register I/O
- Register reads needs two cycles.
- Register writes needs one cycle.
RA! ( reg# - ) RF@ ( - n )
RF! ( n reg# - )
It's long been a staple stack CPU design technique.
See, e.g., WISC-16 CPU in:
Stack Computers, A New Wave
-- Philip Koopman.
But, this has limitations. Eventually, you'll run out of opcode space.
In English, "decode and execute entire combinations of instructions as one."
This is NOT superscalar execution. We still have only one execution engine.
However, with a sufficiently complex instruction decoding, you can come
very close to superscalar performance.
With a 4-way super-scalar architecture, and assuming straight-line code,
you can expect an *average* 4x performance boost. Ergo, with the same
constraints in a macro-op fused execution engine, you can achieve close
to the same performance boost.
(This is *particularly* true with a stack CPU, such as a MISC core!)
ADDI Xd, Xs, 1234
Xs RA! RF@ LIT(1234) + Xd RF! NEXT
------ === ---------------- =========
(1) (2) (3) (4)
JAL Xd, 1234
PC@ Xd RF! PC@ LIT(1234) + PC! NEXT
--- ====== --------------- ==========
(1) (2) (3) (4)
CSRRS Xd, Xs, mtime
LIT(mtime) CSR@ Xd RF! Xs RA! RF@
--------------- ====== ------ ===
(1) (2) (3) (4)
LIT(mtime) CSR@ OR LIT(mtime) CSR! NEXT
------------------ --------------------
(5) (6)
Throughput is 4 cycles on average; 6 cycles in the worst-case.
So, combination of customized instructions PLUS macro-op fusion
yields an *average* performance of 4 clocks per instruction.
To put this into perspective, the Motorola 68000 had an average
throughput of 12 cycles per instruction. While the fastest
instructions required only 4 clocks per instruction, MOST
instructions varied between 8 and 16 clocks, depending on the
addressing mode used and the general class of instruction.
(8+16)/2 = 12. :)
On average, despite all this complexity, a MISC implementation
of the RISC-V would still be about 50% faster (depending on workload)
than an MC68000 at the same clock speed.
I developed the S64X7 MISC core. You can find it up on Github at:
* 64-bit wide stacks, address bus, and data path
* 6 data stack entries
* 5 return stack entries -- my first *true* Forth CPU!
* One instruction per clock! No macro-op fusion!
I never got far enough to implement macro-op fusion;
I wanted something that worked correctly but slowly first.
However, I never went beyond the S64X7 design because,
out of 7680 logic cells in an iCE40HX-8K, it uses almost 6200 cells!
This is too big to fit in a HX-4K device, and besides,
I don't know if enough room exists for the 64b-to-16b bus bridge.
It almost certainly won't have enough room for macro-op fusion logic.
(But the core does work!)
Can you use MISC to implement a RISC in a reasonable time-frame?
Will it be small enough to fit in a reasonable-sized FPGA?
Will it fit in an iCE40HX-4K or -8K device?
4K: NO, it's just too big even in the simplest configuration.
8K: YES, but only barely.
Will it fit in a Xilinx FPGA?
Yes; with lots of room to spare.
But then again, for these FPGAs,
I'd just go back to using 5-stage pipeline design.
Q & A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment