Skip to content

Instantly share code, notes, and snippets.

@zeldin
Last active August 29, 2023 18:32
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zeldin/3b9e88c8462d4de752b8e12c50847d35 to your computer and use it in GitHub Desktop.
Save zeldin/3b9e88c8462d4de752b8e12c50847d35 to your computer and use it in GitHub Desktop.
Comparison between AVR and RISC-V softcore in a real application

This is a comparison between an AVR softcore and the picorv32 when used in a real application. The numbers are for the entire application, not for the cores in isolation. Only minimal changes were made to the HDL and firmware code to adapt to the new core (the peripherals were updated to handle a 32-bit bus, and the interrupt handling was adapted to picorv32, otherwise the code is identical) and the same optimization flags (-Os -mrelax) were used with both cores.

The following were the relevant design constraints:

  • Synthetizable on an iCE40 HX8K
  • fCPU = 17 MHz
  • CPU memory size: 12 K (8K program + 4K data on AVR)

Numbers in parentesis are compared to the reference (AVR) value. Lower is better for all metrics except fCPU_MAX.

Metric AVR picorv32 RV32I picorv32 RV32IC picorv32 RV32E picorv32 RV32EC
LUTs 5508 / 7680 3531 / 7680 (0.64) 3789/ 7680 (0.69) 3438/ 7680 (0.62) 3816/ 7680 (0.69)
BRAMs 28 / 32 32 / 32 (1.14) 32 / 32 (1.14) 32 / 32 (1.14) 32 / 32 (1.14)
FW size 7594 + 3092 = 10686 ** 12412 (1.16) ** 9828 (0.92) 12096 (1.13) 9648 (0.90)
fCPU_MAX 24.52 MHz 24.11 MHz (0.98) 19.24 MHz (0.78) 23.14 MHz (0.94) 18.40 MHz (0.75)
CPI 1.5 4 (2.66) 4.5 (3.00) 4 (2.66) 4 (2.66)
Runtime 3437690 µs 4848353 µs (1.40) 4985692 µs (1.45) 4808335 µs (1.40) 4930170 µs (1.43)

Conclusions:

  • picorv32 uses much fewer LUTs than the AVR core. On the downside, it uses more BRAMs, which are a scarce resource on the HX8K.
  • Code density of RV32 is worse than AVR, unless the "C" extension is used, then it is better.
  • The RV32I build failed to meet the memory size design constraint.
  • In order to reclaim the performance lost due to the lower CPI, it would probably be necessary to change the design to double the fCPU to 34 MHz. However this is larger than the reported fCPU_MAX.
  • It's rather surprising that the fCPU_MAX reported by nextpnr is lower for picorv32 than for the AVR core, considering that the picorv32 is claiming to be designed for high frequencies. Maybe using the lookahead memory interface can help increase the max clock?

Part 2:

Next the VexRiscv core was tested. This more complicated to build due to the Verilog code being autogenerated by a Scala program. In return it supports many options to balance performance against core size. It turns out that by enabling all "bypass" and "early" options, a core with a size and CPI similar to that of the AVR, while allowing for a higher frequency, can be obtained. A full barrel shifter and an interative mul/div plugin with unroll factor 2 was also included in the cores tested below. The runtime figure is for an fCPU of 34 MHz.

Metric AVR VexRiscv RV32IM VexRiscv RV32IMC VexRiscv RV32EM VexRiscv RV32EMC
LUTs 5508 / 7680 4796 / 7680 (0.87) 4982/ 7680 (0.90) 4805/ 7680 (0.87) 5004/ 7680 (0.91)
BRAMs 28 / 32 32 / 32 (1.14) 32 / 32 (1.14) 32 / 32 (1.14) 32 / 32 (1.14)
FW size 7594 + 3092 = 10686 12056 (1.12) 9604 (0.90) 11780 (1.10) 9444 (0.88)
fCPU_MAX 24.52 MHz 40.78 MHz (1.66) 41.70 MHz (1.70) 40.69 MHz (1.66) 39.48 MHz (1.61)
CPI 1.5 1.57 (1.04) 1.57 (1.04) 1.57 (1.04) 1.57 (1.04)
Runtime 3437690 µs 3159388 µs (0.92) 3169157 µs (0.92) 3154022 µs (0.92) 3165122 µs (0.92)

Conclusions:

  • By doubling fCPU from 17 MHz to 34 MHz, a net performance gain was achieved compared to AVR. FW size (when using RVC) and FPGA resource utilization (apart from BRAMs) is also slightly better.
  • Multiplication of 8 bit numbers is slower than AVR since the MulDivIterativePlugin takes 16 cycles at unroll factor 2 (corresponding to 8 AVR cycles, due to the higher fCPU), while the AVR has single cycle multiplication. Increasing the unroll factor above 2 would have a negative effect on fCPU_MAX.
  • The VexRiscv core supports the standard RISC-V interrupt model, which means that GCC's attribute((interrupt)) can be used, resulting in a smaller code footprint than on picorv32.
  • Higher fCPU_MAX could be acheveied with a CPU memory which is a power of 2 in size. However the application does not fit in 8K and there is not enough BRAM in the HX8K to make it 16K (since a few BRAMs are needed for other purposes).
@zeldin
Copy link
Author

zeldin commented Aug 9, 2020

Update:
With the CPU memory using the look-ahead interface, fCPU_MAX doubled, which allowed me to reach fCPU = 34 MHz (2x).
The result is still slower than the AVR though. Ideally I'd like to run the CPU core at the same speed as the rest of the design, which is 68 MHz...

Metric picorv32 RV32EC w/ look-ahead
LUTs 3965 / 7680 (0.72)
BRAMs 32 / 32 (1.14)
FW size 9624 (0.90)
fCPU_MAX 40.02 MHz (1.63)
CPI 4 (2.66)
Runtime (fCPU = 34 MHz) 4050206 µs (1.18)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment