The implementation complexity of vcompress.vm
for large vector length and higher LMUL has been somewhat debated.
Existing implementations exhibit very poor scaling when dealing with larger operands:
VLEN | e8m1 | e8m2 | e8m4 | e8m8 | |
---|---|---|---|---|---|
c906 | 128 | 3 | 10 | 32 | 136 |
c908 | 128 | 3 | 10 | 32 | 139 |
c920 | 128 | 0.5 | 2.4 | 5.4 | 20.0 |
X60 | 256 | 3 | 10 | 32 | 139 |
bobcat | 256 | 32 | 64 | 132 | 260 |
x280 | 512 | 65 | 129 | 257 | 513 |
*bobcat: note that it was explicitly stated, that they didn't optimize the permutation instructions
*x280: the numbers are from llvm-mca. Since the x280 mostly targets AI workloads, there probably wasn't much incentive to optimize this.
Below is a proof of concept LMUL>=1 vcompress.vm implementation using existing LMUL=1 RVV primitives, that scales linearly with LMUL in execution time.
It can be used to implement any vcompress.vm
that is a multiple of the lane width using just in lane primitives.
Prelude
#include <stdio.h>
#include <stdint.h>
#define VLEN 4
#define VLEN_BITS 2
#define LMUL 8
// "LMUL=1" primitives, that should already exist in an hardware implementations
#define U(n) unsigned _BitInt(n)
typedef struct { U(8) at[VLEN]; } V1;
typedef U(VLEN) M1;
V1 vcompress(V1 s, M1 m) {
size_t i = 0, o = 0;
for (; i < VLEN; ++i) if ((m >> i) & 1) s.at[o++] = s.at[i];
for (; o < VLEN; ++o) s.at[o] = -1;
return s;
}
V1 vmv(U(8) x) { V1 s; for (size_t i = 0; i < VLEN; ++i) s.at[i] = x; return s; }
V1 vslidedown(V1 s, size_t off) { for (size_t i = 0; i < VLEN; ++i) s.at[i] = i+off < VLEN ? s.at[i+off] : -1; return s; }
V1 vslideup(V1 d, V1 s, size_t off) { for (size_t i = off; i < VLEN; ++i) d.at[i] = i < off ? -1 : s.at[i-off]; return d; }
U(VLEN_BITS+1) vcpop(M1 m) { return __builtin_popcountll(m); }
// implementing LMUL=8 vcompress using the above LMUL=1 primitives
typedef struct { V1 at[LMUL]; } Vx;
typedef struct { M1 at[LMUL]; } Mx;
Vx vcompress_mx(Vx vs, Mx m) {
Vx vd;
V1 tmp;
U(5) idx = 0;
U(VLEN_BITS+1) off = 0;
// Initially "clear" the destination, your hardware might have an easy
// way of doing this, otherwise could it could also be done in the main
// loop. For tail undistrurbed just copy vs.
for (size_t i = 0; i < LMUL; ++i)
vd.at[i] = vmv(-1);
// for every cycle 0 to LMUL
for (size_t i = 0; i < LMUL; ++i) {
tmp = vcompress(vs.at[i], m.at[i]);
vd.at[idx] = vslideup(vd.at[idx], tmp, off);
vd.at[idx+1] = vslidedown(tmp, off);
off += vcpop(m.at[i]);
idx += off >> VLEN_BITS; // upper bit
off = off << 1 >> 1; // clear upper bits
}
// In a real implementation, vcompress to tmp and writing to the
// destination may need to be pipelined, but that should be trivial. It
// might also be advantagous to add special hardware to split tmp
// between the two destination registers.
return vd;
}
Example test
void print_mx(Vx v) {
for (size_t i = 0; i < LMUL; ++i)
for (size_t j = 0; j < VLEN; ++j)
printf("%02x ", (uint8_t)v.at[i].at[j]);
puts("");
}
int main(void) {
Vx v = { 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 };
Mx m = { 0b1111, 0b0000, 0b1010, 0b0101, 0b1001, 0b0110, 0b1110, 0b0111 };
print_mx(v);
v = vcompress_mx(v, m);
print_mx(v);
}
Godbolt: https://godbolt.org/z/Gsvxrff4f
As you can see, this requires only existing vcompress.vm
, vslideup.vx
, vslidedown.vx
, and vcpop.m
LMUL=1 vector operations.
With a piplined vcompress and destination write, as well as special hardware to split the compressed result up in the two destination registers, this should allow for a ~LMUL+1 cycles vcompress.vm implementation.
One way to implement this on a high performance out-of-order design would be by splitting the instruction into uops. For that you would need to spread out LMUL=1 vslidedown.vm
, vslideup.vx
, vslidedown.vx
, and vcpop.m
support on separate execution units, such that the steps can execute in parallel. You would also need one hidden/temporary LMUL=1 vector register, and 3 hidden/temporary integer registers.
So e.g. for LMUL=4 vcompress.vm
:
EXU1 EXU2 EXU3 EXU4
cycle=1 vd[0] = vcompress(vs[0], m[0]); idx, off = advance_idx_and_off(idx, off, m[0])
cycle=2 tmp = vcompress(vs[1], m[1]); idx, off = advance_idx_and_off(idx, off, m[1])
cycle=3 tmp = vcompress(vs[2], m[2]); vd[idx] = vslideup(vd[idx], tmp, off); vd[idx+1] = vslidedown(tmp, off); idx, off = advance_idx_and_off(idx, off, m[2])
cycle=4 tmp = vcompress(vs[3], m[3]); vd[idx] = vslideup(vd[idx], tmp, off); vd[idx+1] = vslidedown(tmp, off); idx, off = advance_idx_and_off(idx, off, m[3])
cycle=5 vd[idx] = vslideup(vd[idx], tmp, off); vd[idx+1] = vslidedown(tmp, off);
The advance_idx_and_off
could also be further split into uops, and pipelined. The vslideup
and vslidedown
writes could also be implemented in a single custom uop, and run on a single execution unit, provided writing to two vector registers at once is possible for the implementation.
This design could also start executing subsequent operations on vd[i]
, once idx>i
, which should further improve performance, especially for LMUL=8 and LMUL=4.
Copyright 2024 Olaf Bernstein
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.