Created
August 28, 2016 03:13
-
-
Save sam-falvo/e3034bac3c314dd95239bd93f52a3e07 to your computer and use it in GitHub Desktop.
Presentation material for SVFIG 2016-Aug-27 meeting.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Intel Decodes CISC to RISC, | |
Can We Decode RISC-V to MISC? | |
Samuel A. Falvo II | |
2016-Aug-27 | |
Yes. | |
IF you are super-careful with instruction set design | |
and are careful with how you implement your MISC instruction decoder. | |
BACKGROUND | |
I need a 64-bit RISC-V CPU to power the Kestrel-3. | |
I need it NOW if I'm to make the RISC-V Workshop this-coming November, | |
because I still need to implement external RAM controller and bus bridges | |
at the very least. I also need to tweak MGIA and/or replace it with CGIA. | |
Problem is, naive implementations are too big to fit in the iCE40HX-4K FPGA. | |
Tried to implement the following micro-architectures: | |
* 5-stage pipeline | |
- 32-bit instruction fetch Wishbone port. | |
- 64-bit data read/write Wishbone port. | |
- Requires additional 64b-to-16b Wishbone bus bridge. | |
- Never made it past instruction fetch logic in Verilog. | |
* Hardware microsequencing | |
- Two attempts made. | |
- Each attempt failed due to loss of intellectual control | |
over the design; things just got too complex too quickly. | |
- First attempt was based on 6502's decoder PLA circuitry. | |
- Second attempt was an attempt at horizontal microcode. | |
It didn't even make it to Verilog prototype. | |
* Stack-ops decode. | |
- Crude prototype to serve as a proof-of-concept only. | |
- Resulted in S64X7 64-bit MISC CPU. | |
- Implemented in three days, sans interrupt support. | |
- Initial design NOT intended to be fast, only correct. | |
STACK-OPS DECODE EXAMPLES | |
ADDI Xd, Xs, 1234 | |
-------------------------------------- | |
Xs RA! RF@ LIT(1234) + Xd RF! NEXT | |
JAL Xd, 1234 | |
--------------------------------------- | |
PC@ Xd RF! PC@ LIT(1234) + PC! NEXT | |
CSRRS Xd, Xs, mtime | |
------------------------------------ | |
LIT(mtime) CSR@ Xd RF! Xs RA! RF@ | |
LIT(mtime) CSR@ OR LIT(mtime) CSR! NEXT | |
DATA FLOW/PIPELINE LAYOUT | |
+-------------------+ | |
+-->| Instruction Fetch | | |
| +-------------------+ | |
| | | |
| | (RISC-V instruction) | |
| V | |
| +-------------------+ | |
| | StackOp Decoder | | |
| +-------------------+ | |
| ^ | |
| | (Complete MISC program implementing RISC-V semantics) | |
| V | |
| +-------------------+ +---------------+ | |
+---| MISC Core |<----->| Register File | | |
+-------------------+ +---------------+ | |
^ | |
| | |
V | |
+-------------------+ | |
| Data Memory I/O | | |
+-------------------+ | |
So, yes, it IS POSSIBLE to do this. | |
BUT, with a naive MISC instruction decoder, | |
performance varies from 1/8th to 1/16th | |
MISC instruction execution rate. | |
Which is surprisingly quick, when you think about it! | |
That is competitive with the TMS9900 microarchitecture! | |
Just not fast enough to meet my requirements. | |
I need about 4x this instruction execution rate | |
to meet my desired usability goals. | |
SPEED UP TECHNIQUE: Customized MISC Instructions | |
You've already seen an example: | |
- Asymmetric register I/O | |
- Register reads needs two cycles. | |
- Register writes needs one cycle. | |
RA! ( reg# - ) RF@ ( - n ) | |
RF! ( n reg# - ) | |
It's long been a staple stack CPU design technique. | |
See, e.g., WISC-16 CPU in: | |
Stack Computers, A New Wave | |
-- Philip Koopman. | |
https://users.ece.cmu.edu/~koopman/stack_computers | |
But, this has limitations. Eventually, you'll run out of opcode space. | |
SPEED UP TECHNIQUE: Macro-Op Fusion | |
In English, "decode and execute entire combinations of instructions as one." | |
This is NOT superscalar execution. We still have only one execution engine. | |
However, with a sufficiently complex instruction decoding, you can come | |
very close to superscalar performance. | |
With a 4-way super-scalar architecture, and assuming straight-line code, | |
you can expect an *average* 4x performance boost. Ergo, with the same | |
constraints in a macro-op fused execution engine, you can achieve close | |
to the same performance boost. | |
(This is *particularly* true with a stack CPU, such as a MISC core!) | |
MACRO-OP FUSION EXAMPLE | |
ADDI Xd, Xs, 1234 | |
-------------------------------------- | |
Xs RA! RF@ LIT(1234) + Xd RF! NEXT | |
------ === ---------------- ========= | |
(1) (2) (3) (4) | |
JAL Xd, 1234 | |
--------------------------------------- | |
PC@ Xd RF! PC@ LIT(1234) + PC! NEXT | |
--- ====== --------------- ========== | |
(1) (2) (3) (4) | |
CSRRS Xd, Xs, mtime | |
------------------------------------ | |
LIT(mtime) CSR@ Xd RF! Xs RA! RF@ | |
--------------- ====== ------ === | |
(1) (2) (3) (4) | |
LIT(mtime) CSR@ OR LIT(mtime) CSR! NEXT | |
------------------ -------------------- | |
(5) (6) | |
Throughput is 4 cycles on average; 6 cycles in the worst-case. | |
So, combination of customized instructions PLUS macro-op fusion | |
yields an *average* performance of 4 clocks per instruction. | |
To put this into perspective, the Motorola 68000 had an average | |
throughput of 12 cycles per instruction. While the fastest | |
instructions required only 4 clocks per instruction, MOST | |
instructions varied between 8 and 16 clocks, depending on the | |
addressing mode used and the general class of instruction. | |
(8+16)/2 = 12. :) | |
On average, despite all this complexity, a MISC implementation | |
of the RISC-V would still be about 50% faster (depending on workload) | |
than an MC68000 at the same clock speed. | |
RESULTS SO FAR | |
I developed the S64X7 MISC core. You can find it up on Github at: | |
https://github.com/sam-falvo/S64X7 | |
Features: | |
* 64-bit wide stacks, address bus, and data path | |
* 6 data stack entries | |
* 5 return stack entries -- my first *true* Forth CPU! | |
* One instruction per clock! No macro-op fusion! | |
I never got far enough to implement macro-op fusion; | |
I wanted something that worked correctly but slowly first. | |
However, I never went beyond the S64X7 design because, | |
out of 7680 logic cells in an iCE40HX-8K, it uses almost 6200 cells! | |
This is too big to fit in a HX-4K device, and besides, | |
I don't know if enough room exists for the 64b-to-16b bus bridge. | |
It almost certainly won't have enough room for macro-op fusion logic. | |
(But the core does work!) | |
CONCLUSION | |
Can you use MISC to implement a RISC in a reasonable time-frame? | |
YES!! | |
Will it be small enough to fit in a reasonable-sized FPGA? | |
YES!! | |
Will it fit in an iCE40HX-4K or -8K device? | |
4K: NO, it's just too big even in the simplest configuration. | |
8K: YES, but only barely. | |
Will it fit in a Xilinx FPGA? | |
Yes; with lots of room to spare. | |
But then again, for these FPGAs, | |
I'd just go back to using 5-stage pipeline design. | |
THE END | |
Q & A | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment