zeldin/softcore_shootout.md

## softcore_shootout.md

      
    Raw
  

              softcore_shootout.md
            
          
    This is a comparison between an AVR softcore and the picorv32 when used in
a real application.  The numbers are for the entire application, not for the cores in isolation.
Only minimal changes were made to the HDL and firmware code to adapt to the new core (the
peripherals were updated to handle a 32-bit bus, and the interrupt handling was adapted to
picorv32, otherwise the code is identical) and the same optimization flags (-Os -mrelax)
were used with both cores.
The following were the relevant design constraints:

Synthetizable on an iCE40 HX8K
fCPU = 17 MHz
CPU memory size: 12 K (8K program + 4K data on AVR)

Numbers in parentesis are compared to the reference (AVR) value.  Lower is better for all metrics except fCPU_MAX.


Metric
AVR
picorv32 RV32I
picorv32 RV32IC
picorv32 RV32E
picorv32 RV32EC


LUTs
5508 / 7680
3531 / 7680 (0.64)
3789/ 7680 (0.69)
3438/ 7680 (0.62)
3816/ 7680 (0.69)


BRAMs
28 / 32
32 / 32 (1.14)
32 / 32 (1.14)
32 / 32 (1.14)
32 / 32 (1.14)


FW size
7594 + 3092 = 10686
** 12412 (1.16) **
9828 (0.92)
12096 (1.13)
9648 (0.90)


fCPU_MAX
24.52 MHz
24.11 MHz (0.98)
19.24 MHz (0.78)
23.14 MHz (0.94)
18.40 MHz (0.75)


CPI
1.5
4 (2.66)
4.5 (3.00)
4 (2.66)
4 (2.66)


Runtime
3437690 µs
4848353 µs (1.40)
4985692 µs (1.45)
4808335 µs (1.40)
4930170 µs (1.43)


Conclusions:

picorv32 uses much fewer LUTs than the AVR core.  On the downside, it uses more BRAMs, which are a scarce resource on the HX8K.
Code density of RV32 is worse than AVR, unless the "C" extension is used, then it is better.
The RV32I build failed to meet the memory size design constraint.
In order to reclaim the performance lost due to the lower CPI, it would probably be necessary to change the design to double the fCPU to 34 MHz.  However this is larger than the reported fCPU_MAX.
It's rather surprising that the fCPU_MAX reported by nextpnr is lower for picorv32 than for the AVR core, considering that the picorv32 is claiming to be designed for high frequencies.  Maybe using the lookahead memory interface can help increase the max clock?

Part 2:
Next the VexRiscv core was tested.  This more complicated to build due
to the Verilog code being autogenerated by a Scala program. In return it supports many options to balance
performance against core size.  It turns out that by enabling all "bypass" and "early" options, a core with
a size and CPI similar to that of the AVR, while allowing for a higher frequency, can be obtained.  A full
barrel shifter and an interative mul/div plugin with unroll factor 2 was also included in the cores tested below.
The runtime figure is for an fCPU of 34 MHz.


Metric
AVR
VexRiscv RV32IM
VexRiscv RV32IMC
VexRiscv RV32EM
VexRiscv RV32EMC


LUTs
5508 / 7680
4796 / 7680 (0.87)
4982/ 7680 (0.90)
4805/ 7680 (0.87)
5004/ 7680 (0.91)


BRAMs
28 / 32
32 / 32 (1.14)
32 / 32 (1.14)
32 / 32 (1.14)
32 / 32 (1.14)


FW size
7594 + 3092 = 10686
12056 (1.12)
9604 (0.90)
11780 (1.10)
9444 (0.88)


fCPU_MAX
24.52 MHz
40.78 MHz (1.66)
41.70 MHz (1.70)
40.69 MHz (1.66)
39.48 MHz (1.61)


CPI
1.5
1.57 (1.04)
1.57 (1.04)
1.57 (1.04)
1.57 (1.04)


Runtime
3437690 µs
3159388 µs (0.92)
3169157 µs (0.92)
3154022 µs (0.92)
3165122 µs (0.92)


Conclusions:

By doubling fCPU from 17 MHz to 34 MHz, a net performance gain was achieved compared to AVR.  FW size (when using RVC) and FPGA resource utilization (apart from BRAMs) is also slightly better.
Multiplication of 8 bit numbers is slower than AVR since the MulDivIterativePlugin takes 16 cycles at unroll factor 2 (corresponding to 8 AVR cycles, due to the higher fCPU), while the AVR has single cycle multiplication.  Increasing the unroll factor above 2 would have a negative effect on fCPU_MAX.
The VexRiscv core supports the standard RISC-V interrupt model, which means that GCC's attribute((interrupt)) can be used, resulting in a smaller code footprint than on picorv32.
Higher fCPU_MAX could be acheveied with a CPU memory which is a power of 2 in size.  However the application does not fit in 8K and there is not enough BRAM in the HX8K to make it 16K (since a few BRAMs are needed for other purposes).
Metric	AVR	picorv32 RV32I	picorv32 RV32IC	picorv32 RV32E	picorv32 RV32EC
LUTs	5508 / 7680	3531 / 7680 (0.64)	3789/ 7680 (0.69)	3438/ 7680 (0.62)	3816/ 7680 (0.69)
BRAMs	28 / 32	32 / 32 (1.14)	32 / 32 (1.14)	32 / 32 (1.14)	32 / 32 (1.14)
FW size	7594 + 3092 = 10686	12412 (1.16)	9828 (0.92)	12096 (1.13)	9648 (0.90)
fCPU_MAX	24.52 MHz	24.11 MHz (0.98)	19.24 MHz (0.78)	23.14 MHz (0.94)	18.40 MHz (0.75)
CPI	1.5	4 (2.66)	4.5 (3.00)	4 (2.66)	4 (2.66)
Runtime	3437690 µs	4848353 µs (1.40)	4985692 µs (1.45)	4808335 µs (1.40)	4930170 µs (1.43)
Metric	AVR	VexRiscv RV32IM	VexRiscv RV32IMC	VexRiscv RV32EM	VexRiscv RV32EMC
LUTs	5508 / 7680	4796 / 7680 (0.87)	4982/ 7680 (0.90)	4805/ 7680 (0.87)	5004/ 7680 (0.91)
BRAMs	28 / 32	32 / 32 (1.14)	32 / 32 (1.14)	32 / 32 (1.14)	32 / 32 (1.14)
FW size	7594 + 3092 = 10686	12056 (1.12)	9604 (0.90)	11780 (1.10)	9444 (0.88)
fCPU_MAX	24.52 MHz	40.78 MHz (1.66)	41.70 MHz (1.70)	40.69 MHz (1.66)	39.48 MHz (1.61)
CPI	1.5	1.57 (1.04)	1.57 (1.04)	1.57 (1.04)	1.57 (1.04)
Runtime	3437690 µs	3159388 µs (0.92)	3169157 µs (0.92)	3154022 µs (0.92)	3165122 µs (0.92)
Metric	picorv32 RV32EC w/ look-ahead
LUTs	3965 / 7680 (0.72)
BRAMs	32 / 32 (1.14)
FW size	9624 (0.90)
fCPU_MAX	40.02 MHz (1.63)
CPI	4 (2.66)
Runtime (fCPU = 34 MHz)	4050206 µs (1.18)