Skip to content

Instantly share code, notes, and snippets.

@camel-cdr
Created July 28, 2023 23:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save camel-cdr/495937d851b39a85ac3be515e4dd132e to your computer and use it in GitHub Desktop.
Save camel-cdr/495937d851b39a85ac3be515e4dd132e to your computer and use it in GitHub Desktop.
rvv shishua
.global shishua_rvv # void shishua_rvv (uint64_t state[4], void *dest, size_t n)
shishua_rvv:
# load state (can easily be expanded to state[8] or state[16])
vsetvli t6, x0, e64, m2, ta, ma
ld a4, 0(a0)
vmv.v.x v0, a4
ld a4, 8(a0)
vmv.v.x v4, a4
ld a4, 16(a0)
vmv.v.x v8, a4
ld a4, 24(a0)
vmv.v.x v12, a4
li t0, 0xbf58476d1ce4e5b9 # splitmix mul 1
li t1, 0x94d049bb133111eb # splitmix mul 2
.macro rvv_rand_splitmix64 v, t
# mix in vid
vsetvli t6, x0, e8, m8, ta, ma
vid.v \t
vsetvli t6, x0, e16, m8, ta, ma
vadd.vv \v, \v, \t
vmul.vx \v, \v, t0
vid.v \t
vsetvli t6, x0, e64, m8, ta, ma
vadd.vv \v, \v, \t
vmul.vx \v, \v, t1
# warmup with splitmix64
vsrl.vi \t, \v, 30
vxor.vv \v, \v, \t
vmul.vx \v, \v, t0
vsrl.vi \t, \v, 27
vxor.vv \v, \v, \t
vmul.vx \v, \v, t1
vsrl.vi \t, \v, 31
vxor.vv \v, \v, \t
.endm
rvv_rand_splitmix64 v0, v16
rvv_rand_splitmix64 v8, v16
# almost shishua
1:
# shuffle
vsetvli x0, a2, e32, m8, ta, ma
vmv.x.s a4, v0
vslide1down.vx v16, v0, a4
vmv.x.s a4, v8
vslide1down.vx v24, v8, a4
# shift
vsetvli x0, x0, e64, m8, ta, ma
vsrl.vi v0, v0, 1
vsrl.vi v8, v8, 3
# add
vadd.vv v8, v8, v24
vxor.vv v24, v24, v0
vadd.vv v0, v0, v16
# store
vsetvli a3, x0, e8, m8, ta, ma
vse8.v v24, (a1)
add a1, a1, a3
sub a2, a2, a3
bnez a2, 1b
# reduce back to uint64_t state[4]
vsetvli t6, x0, e64, m8, ta, ma
vredsum.vs v0, v0, v0
vredxor.vs v8, v8, v8
vmv.x.s a4, v0
sd a4, 0(a0)
vmv.x.s a4, v4
sd a4, 8(a0)
vmv.x.s a4, v8
sd a4, 16(a0)
vmv.x.s a4, v12
sd a4, 24(a0)
ret
@camel-cdr
Copy link
Author

camel-cdr commented Jul 28, 2023

This is a proof of concept implementation of the shishua PRNG for RVV.

Overview

The difficult part here is that we don't have a fixed vector length, but still want to know the size of the state at compile time.

My solution to this is to use smaller LMULs to broadcast the state to the entire vector. This obviously doesn't work on its own, because many elements will have the same value. To fix this, vid is mixed into the initial vector state, which assigns each vector element it's index (0,1,2,3,...). This doesn't give us much, but we hash the result of that to disperse the state as much as possible, before we go into the main loop.

The main loop is slightly different from the reference shishua implementation, as we use a rotate on 32 bit elements instead of a full shuffle:

shishua_rvv

PractRand

Even when reducing the state from LMUL=8 to LMUL=1 (by 8 times), and using 16 bit shuffle and 32 bit add/shift, it managed to pass the first 64 GB of PractRand.

I wasn't able to execute this fast enough on my RISCV-V board, and qemu is also quite slow, so I could only measure up to 64 GB.

Edit: I was able to let it run a bit further and after 128 GB PractRand got suspicious, so I assume it will fail after a few more GB, but this applies to the massively a scaled down version. I also let a full scale version run up to 128 GB, and it didn't even get an unusual evaluation. I think it's fair to assume that the full scale version is pretty safe to use for non-cryptographic purposes, especially given that this is so close to the original shishua implementation.

Edit2: I was able to run a few more test, and LUML=2 with 32 bit addition passes to at least 256 GB without any anomalies, and LMUL=1 64 bit addition, passes to at least 64 GB without any anomalies. (I wasn't able to run the test longer).

Benchmark

My MangoPi MQ Pro, with a 1 GHz in order C906 core, which has a 128 bit vector length, but 256 bit ALU, managed to generate data at 1.1 GiB/s using 16 bit shuffle and 32 bit add/shift (because it doesn't support 64 bit elements).

The fastes other PRNG I could find for the platform was xoshiro256+, which got 0.5 GiB/s. I suspect that this implementation will be way faster on future CPUs, because the C906 seems to have a big overhead for vector instructions. (E.g. vadd, vmadd and vmulh take same amount of time, even though vadd could be implemented way faster)

Edit:

I was now able to run this on an C920 1.868 GHz out-of-order core, and I got 5 GiB/s while xoshiro256+ got 2.4 GiB/s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment