sam-falvo/prez.txt

## prez.txt

                     Intel Decodes CISC to RISC,

                    Can We Decode RISC-V to MISC?

                         Samuel A. Falvo II
                            2016-Aug-27


Yes.

IF you are super-careful with instruction set design

and are careful with how you implement your MISC instruction decoder.


BACKGROUND


I need a 64-bit RISC-V CPU to power the Kestrel-3.

I need it NOW if I'm to make the RISC-V Workshop this-coming November,
because I still need to implement external RAM controller and bus bridges
at the very least.  I also need to tweak MGIA and/or replace it with CGIA.


Problem is, naive implementations are too big to fit in the iCE40HX-4K FPGA.


Tried to implement the following micro-architectures:

* 5-stage pipeline

  - 32-bit instruction fetch Wishbone port.
  - 64-bit data read/write Wishbone port.
  - Requires additional 64b-to-16b Wishbone bus bridge.
  - Never made it past instruction fetch logic in Verilog.

* Hardware microsequencing

  - Two attempts made.

  - Each attempt failed due to loss of intellectual control
    over the design; things just got too complex too quickly.

  - First attempt was based on 6502's decoder PLA circuitry.

  - Second attempt was an attempt at horizontal microcode.
    It didn't even make it to Verilog prototype.

* Stack-ops decode.

  - Crude prototype to serve as a proof-of-concept only.
  - Resulted in S64X7 64-bit MISC CPU.
  - Implemented in three days, sans interrupt support.
  - Initial design NOT intended to be fast, only correct.


STACK-OPS DECODE EXAMPLES


    ADDI  Xd, Xs, 1234
    --------------------------------------
    Xs RA! RF@  LIT(1234)  +  Xd RF!  NEXT


    JAL   Xd, 1234
    ---------------------------------------
    PC@ Xd RF!   PC@ LIT(1234) + PC!   NEXT


    CSRRS  Xd, Xs, mtime
    ------------------------------------
    LIT(mtime) CSR@  Xd RF!   Xs RA! RF@
    LIT(mtime) CSR@ OR LIT(mtime) CSR!  NEXT


DATA FLOW/PIPELINE LAYOUT


            +-------------------+
        +-->| Instruction Fetch |
        |   +-------------------+
        |             |
        |             |  (RISC-V instruction)
        |             V
        |   +-------------------+
        |   | StackOp Decoder   |
        |   +-------------------+
        |             ^
        |             |  (Complete MISC program implementing RISC-V semantics)
        |             V
        |   +-------------------+       +---------------+
        +---| MISC Core         |<----->| Register File |
            +-------------------+       +---------------+
                      ^
                      |
                      V
            +-------------------+
            | Data Memory I/O   |
            +-------------------+


    So, yes, it IS POSSIBLE to do this.

    BUT, with a naive MISC instruction decoder,
    performance varies from 1/8th to 1/16th
    MISC instruction execution rate.

    Which is surprisingly quick, when you think about it!
    That is competitive with the TMS9900 microarchitecture!

    Just not fast enough to meet my requirements.
    I need about 4x this instruction execution rate
    to meet my desired usability goals.


SPEED UP TECHNIQUE: Customized MISC Instructions


You've already seen an example:

  - Asymmetric register I/O
  - Register reads needs two cycles.
  - Register writes needs one cycle.

    RA!  ( reg# - )      RF@  ( - n )
    RF!  ( n reg# - )


It's long been a staple stack CPU design technique.
See, e.g., WISC-16 CPU in:

    Stack Computers, A New Wave
    -- Philip Koopman.
       https://users.ece.cmu.edu/~koopman/stack_computers


But, this has limitations.  Eventually, you'll run out of opcode space.


SPEED UP TECHNIQUE: Macro-Op Fusion


In English, "decode and execute entire combinations of instructions as one."

This is NOT superscalar execution.  We still have only one execution engine.
However, with a sufficiently complex instruction decoding, you can come
very close to superscalar performance.


With a 4-way super-scalar architecture, and assuming straight-line code,
you can expect an *average* 4x performance boost.  Ergo, with the same
constraints in a macro-op fused execution engine, you can achieve close
to the same performance boost.

(This is *particularly* true with a stack CPU, such as a MISC core!)


MACRO-OP FUSION EXAMPLE


    ADDI  Xd, Xs, 1234
    --------------------------------------
    Xs RA! RF@  LIT(1234)  +  Xd RF!  NEXT
    ------ ===  ---------------- =========
    (1)    (2)  (3)              (4)


    JAL   Xd, 1234
    ---------------------------------------
    PC@ Xd RF!   PC@ LIT(1234) + PC!   NEXT
    --- ======   --------------- ==========
    (1) (2)      (3)             (4)


    CSRRS  Xd, Xs, mtime
    ------------------------------------
    LIT(mtime) CSR@  Xd RF!   Xs RA! RF@
    ---------------  ======   ------ ===
    (1)              (2)      (3)    (4)

    LIT(mtime) CSR@ OR LIT(mtime) CSR! NEXT
    ------------------ --------------------
    (5)                (6)


Throughput is 4 cycles on average; 6 cycles in the worst-case.

So, combination of customized instructions PLUS macro-op fusion
yields an *average* performance of 4 clocks per instruction.

To put this into perspective, the Motorola 68000 had an average
throughput of 12 cycles per instruction.  While the fastest
instructions required only 4 clocks per instruction, MOST
instructions varied between 8 and 16 clocks, depending on the
addressing mode used and the general class of instruction.
(8+16)/2 = 12.  :)

On average, despite all this complexity, a MISC implementation
of the RISC-V would still be about 50% faster (depending on workload)
than an MC68000 at the same clock speed.


RESULTS SO FAR


I developed the S64X7 MISC core.  You can find it up on Github at:

    https://github.com/sam-falvo/S64X7

Features:

* 64-bit wide stacks, address bus, and data path
* 6 data stack entries
* 5 return stack entries -- my first *true* Forth CPU!
* One instruction per clock!  No macro-op fusion!

I never got far enough to implement macro-op fusion;
I wanted something that worked correctly but slowly first.

However, I never went beyond the S64X7 design because,
out of 7680 logic cells in an iCE40HX-8K, it uses almost 6200 cells!
This is too big to fit in a HX-4K device, and besides,
I don't know if enough room exists for the 64b-to-16b bus bridge.
It almost certainly won't have enough room for macro-op fusion logic.

(But the core does work!)


CONCLUSION

Can you use MISC to implement a RISC in a reasonable time-frame?

    YES!!

Will it be small enough to fit in a reasonable-sized FPGA?

    YES!!

Will it fit in an iCE40HX-4K or -8K device?

    4K: NO, it's just too big even in the simplest configuration.
    8K: YES, but only barely.

Will it fit in a Xilinx FPGA?

    Yes; with lots of room to spare.
    But then again, for these FPGAs,
    I'd just go back to using 5-stage pipeline design.


                            THE END

                             Q & A

	Intel Decodes CISC to RISC,

	Can We Decode RISC-V to MISC?

	Samuel A. Falvo II
	2016-Aug-27



















	Yes.

	IF you are super-careful with instruction set design

	and are careful with how you implement your MISC instruction decoder.



















	BACKGROUND





	I need a 64-bit RISC-V CPU to power the Kestrel-3.

	I need it NOW if I'm to make the RISC-V Workshop this-coming November,
	because I still need to implement external RAM controller and bus bridges
	at the very least. I also need to tweak MGIA and/or replace it with CGIA.


	Problem is, naive implementations are too big to fit in the iCE40HX-4K FPGA.



	Tried to implement the following micro-architectures:

	* 5-stage pipeline

	- 32-bit instruction fetch Wishbone port.
	- 64-bit data read/write Wishbone port.
	- Requires additional 64b-to-16b Wishbone bus bridge.
	- Never made it past instruction fetch logic in Verilog.

	* Hardware microsequencing

	- Two attempts made.

	- Each attempt failed due to loss of intellectual control
	over the design; things just got too complex too quickly.

	- First attempt was based on 6502's decoder PLA circuitry.

	- Second attempt was an attempt at horizontal microcode.
	It didn't even make it to Verilog prototype.

	* Stack-ops decode.

	- Crude prototype to serve as a proof-of-concept only.
	- Resulted in S64X7 64-bit MISC CPU.
	- Implemented in three days, sans interrupt support.
	- Initial design NOT intended to be fast, only correct.





	STACK-OPS DECODE EXAMPLES



	ADDI Xd, Xs, 1234
	--------------------------------------
	Xs RA! RF@ LIT(1234) + Xd RF! NEXT


	JAL Xd, 1234
	---------------------------------------
	PC@ Xd RF! PC@ LIT(1234) + PC! NEXT


	CSRRS Xd, Xs, mtime
	------------------------------------
	LIT(mtime) CSR@ Xd RF! Xs RA! RF@
	LIT(mtime) CSR@ OR LIT(mtime) CSR! NEXT





	DATA FLOW/PIPELINE LAYOUT


	+-------------------+
	+-->\| Instruction Fetch \|
	\| +-------------------+
	\| \|
	\| \| (RISC-V instruction)
	\| V
	\| +-------------------+
	\| \| StackOp Decoder \|
	\| +-------------------+
	\| ^
	\| \| (Complete MISC program implementing RISC-V semantics)
	\| V
	\| +-------------------+ +---------------+
	+---\| MISC Core \|<----->\| Register File \|
	+-------------------+ +---------------+
	^
	\|
	V
	+-------------------+
	\| Data Memory I/O \|
	+-------------------+





	So, yes, it IS POSSIBLE to do this.

	BUT, with a naive MISC instruction decoder,
	performance varies from 1/8th to 1/16th
	MISC instruction execution rate.

	Which is surprisingly quick, when you think about it!
	That is competitive with the TMS9900 microarchitecture!

	Just not fast enough to meet my requirements.
	I need about 4x this instruction execution rate
	to meet my desired usability goals.







	SPEED UP TECHNIQUE: Customized MISC Instructions



	You've already seen an example:

	- Asymmetric register I/O
	- Register reads needs two cycles.
	- Register writes needs one cycle.

	RA! ( reg# - ) RF@ ( - n )
	RF! ( n reg# - )



	It's long been a staple stack CPU design technique.
	See, e.g., WISC-16 CPU in:

	Stack Computers, A New Wave
	-- Philip Koopman.
	https://users.ece.cmu.edu/~koopman/stack_computers


	But, this has limitations. Eventually, you'll run out of opcode space.












	SPEED UP TECHNIQUE: Macro-Op Fusion



	In English, "decode and execute entire combinations of instructions as one."

	This is NOT superscalar execution. We still have only one execution engine.
	However, with a sufficiently complex instruction decoding, you can come
	very close to superscalar performance.


	With a 4-way super-scalar architecture, and assuming straight-line code,
	you can expect an average 4x performance boost. Ergo, with the same
	constraints in a macro-op fused execution engine, you can achieve close
	to the same performance boost.

	(This is particularly true with a stack CPU, such as a MISC core!)








	MACRO-OP FUSION EXAMPLE



	ADDI Xd, Xs, 1234
	--------------------------------------
	Xs RA! RF@ LIT(1234) + Xd RF! NEXT
	------ === ---------------- =========
	(1) (2) (3) (4)


	JAL Xd, 1234
	---------------------------------------
	PC@ Xd RF! PC@ LIT(1234) + PC! NEXT
	--- ====== --------------- ==========
	(1) (2) (3) (4)


	CSRRS Xd, Xs, mtime
	------------------------------------
	LIT(mtime) CSR@ Xd RF! Xs RA! RF@
	--------------- ====== ------ ===
	(1) (2) (3) (4)

	LIT(mtime) CSR@ OR LIT(mtime) CSR! NEXT
	------------------ --------------------
	(5) (6)


	Throughput is 4 cycles on average; 6 cycles in the worst-case.

	So, combination of customized instructions PLUS macro-op fusion
	yields an average performance of 4 clocks per instruction.

	To put this into perspective, the Motorola 68000 had an average
	throughput of 12 cycles per instruction. While the fastest
	instructions required only 4 clocks per instruction, MOST
	instructions varied between 8 and 16 clocks, depending on the
	addressing mode used and the general class of instruction.
	(8+16)/2 = 12. :)

	On average, despite all this complexity, a MISC implementation
	of the RISC-V would still be about 50% faster (depending on workload)
	than an MC68000 at the same clock speed.






	RESULTS SO FAR


	I developed the S64X7 MISC core. You can find it up on Github at:

	https://github.com/sam-falvo/S64X7

	Features:

	* 64-bit wide stacks, address bus, and data path
	* 6 data stack entries
	* 5 return stack entries -- my first true Forth CPU!
	* One instruction per clock! No macro-op fusion!

	I never got far enough to implement macro-op fusion;
	I wanted something that worked correctly but slowly first.

	However, I never went beyond the S64X7 design because,
	out of 7680 logic cells in an iCE40HX-8K, it uses almost 6200 cells!
	This is too big to fit in a HX-4K device, and besides,
	I don't know if enough room exists for the 64b-to-16b bus bridge.
	It almost certainly won't have enough room for macro-op fusion logic.

	(But the core does work!)





	CONCLUSION

	Can you use MISC to implement a RISC in a reasonable time-frame?

	YES!!

	Will it be small enough to fit in a reasonable-sized FPGA?

	YES!!

	Will it fit in an iCE40HX-4K or -8K device?

	4K: NO, it's just too big even in the simplest configuration.
	8K: YES, but only barely.

	Will it fit in a Xilinx FPGA?

	Yes; with lots of room to spare.
	But then again, for these FPGAs,
	I'd just go back to using 5-stage pipeline design.









	THE END

	Q & A