camel-cdr/README.md

## README.md

      
    Raw
  

              README.md
            
          
    The implementation complexity of vcompress.vm for large vector length and higher LMUL has been somewhat debated.
Existing implementations exhibit very poor scaling when dealing with larger operands:


VLEN
e8m1
e8m2
e8m4
e8m8


c906
128
3
10
32
136


c908
128
3
10
32
139


c920
128
0.5
2.4
5.4
20.0


X60
256
3
10
32
139


bobcat
256
32
64
132
260


x280
512
65
129
257
513


*bobcat: note that it was explicitly stated, that they didn't optimize the permutation instructions
*x280: the numbers are from llvm-mca. Since the x280 mostly targets AI workloads, there probably wasn't much incentive to optimize this.
Below is a proof of concept LMUL>=1 vcompress.vm implementation using existing LMUL=1 RVV primitives, that scales linearly with LMUL in execution time.
It can be used to implement any vcompress.vm that is a multiple of the lane width using just in lane primitives.

Prelude


#include <stdio.h>
#include <stdint.h>

#define VLEN 4
#define VLEN_BITS 2
#define LMUL 8

// "LMUL=1" primitives, that should already exist in an hardware implementations

#define U(n) unsigned _BitInt(n)
typedef struct { U(8) at[VLEN]; } V1;
typedef U(VLEN) M1;

V1 vcompress(V1 s, M1 m) {
	size_t i = 0, o = 0;
	for (; i < VLEN; ++i) if ((m >> i) & 1) s.at[o++] = s.at[i];
	for (; o < VLEN; ++o) s.at[o] = -1;
	return s;
}

V1 vmv(U(8) x) { V1 s; for (size_t i = 0; i < VLEN; ++i) s.at[i] = x; return s; }
V1 vslidedown(V1 s, size_t off) { for (size_t i = 0; i < VLEN; ++i) s.at[i] = i+off < VLEN ? s.at[i+off] : -1; return s; }
V1 vslideup(V1 d, V1 s, size_t off) { for (size_t i = off; i < VLEN; ++i) d.at[i] = i < off ? -1 : s.at[i-off]; return d; }
U(VLEN_BITS+1) vcpop(M1 m) { return __builtin_popcountll(m); }

// implementing LMUL=8 vcompress using the above LMUL=1 primitives

typedef struct { V1 at[LMUL]; } Vx;
typedef struct { M1 at[LMUL]; } Mx;

Vx vcompress_mx(Vx vs, Mx m) {
	Vx vd;
	V1 tmp;
	U(5) idx = 0;
	U(VLEN_BITS+1) off = 0;

	// Initially "clear" the destination, your hardware might have an easy
	// way of doing this, otherwise could it could also be done in the main
	// loop. For tail undistrurbed just copy vs.
	for (size_t i = 0; i < LMUL; ++i)
		vd.at[i] = vmv(-1);

	// for every cycle 0 to LMUL
	for (size_t i = 0; i < LMUL; ++i) {
		tmp = vcompress(vs.at[i], m.at[i]);
		vd.at[idx] = vslideup(vd.at[idx], tmp, off);
		vd.at[idx+1] = vslidedown(tmp, off);
		off += vcpop(m.at[i]);
		idx += off >> VLEN_BITS; // upper bit
		off = off << 1 >> 1; // clear upper bits
	}

	// In a real implementation, vcompress to tmp and writing to the
	// destination may need to be pipelined, but that should be trivial. It
	// might also be advantagous to add special hardware to split tmp
	// between the two destination registers.

	return vd;
}

Example test


void print_mx(Vx v) {
	for (size_t i = 0; i < LMUL; ++i)
	for (size_t j = 0; j < VLEN; ++j)
		printf("%02x ", (uint8_t)v.at[i].at[j]);
	puts("");
}

int main(void) {
	Vx v = { 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 };
	Mx m = { 0b1111, 0b0000, 0b1010, 0b0101, 0b1001, 0b0110, 0b1110, 0b0111  };
	print_mx(v);
	v = vcompress_mx(v, m);
	print_mx(v);
}

Godbolt: https://godbolt.org/z/Gsvxrff4f
As you can see, this requires only existing vcompress.vm, vslideup.vx, vslidedown.vx, and vcpop.m LMUL=1 vector operations.
With a piplined vcompress and destination write, as well as special hardware to split the compressed result up in the two destination registers, this should allow for a ~LMUL+1 cycles vcompress.vm implementation.
One way to implement this on a high performance out-of-order design would be by splitting the instruction into uops. For that you would need to spread out LMUL=1 vslidedown.vm, vslideup.vx, vslidedown.vx, and vcpop.m support on separate execution units, such that the steps can execute in parallel. You would also need one hidden/temporary LMUL=1 vector register, and 3 hidden/temporary integer registers.
So e.g. for LMUL=4 vcompress.vm:
        EXU1                             EXU2                                  EXU3                              EXU4
cycle=1 vd[0] = vcompress(vs[0], m[0]);                                                                          idx, off = advance_idx_and_off(idx, off, m[0])
cycle=2 tmp   = vcompress(vs[1], m[1]);                                                                          idx, off = advance_idx_and_off(idx, off, m[1])
cycle=3 tmp   = vcompress(vs[2], m[2]); vd[idx] = vslideup(vd[idx], tmp, off); vd[idx+1] = vslidedown(tmp, off); idx, off = advance_idx_and_off(idx, off, m[2])
cycle=4 tmp   = vcompress(vs[3], m[3]); vd[idx] = vslideup(vd[idx], tmp, off); vd[idx+1] = vslidedown(tmp, off); idx, off = advance_idx_and_off(idx, off, m[3])
cycle=5                                 vd[idx] = vslideup(vd[idx], tmp, off); vd[idx+1] = vslidedown(tmp, off);

The advance_idx_and_off could also be further split into uops, and pipelined. The vslideup and vslidedown writes could also be implemented in a single custom uop, and run on a single execution unit, provided writing to two vector registers at once is possible for the implementation.
This design could also start executing subsequent operations on vd[i], once idx>i, which should further improve performance, especially for LMUL=8 and LMUL=4.
License

Copyright 2024 Olaf Bernstein
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
	VLEN	e8m1	e8m2	e8m4	e8m8
c906	128	3	10	32	136
c908	128	3	10	32	139
c920	128	0.5	2.4	5.4	20.0
X60	256	3	10	32	139
bobcat	256	32	64	132	260
x280	512	65	129	257	513