Skip to content

Instantly share code, notes, and snippets.

@camel-cdr
Last active May 28, 2024 22:41
Show Gist options
  • Save camel-cdr/f2cc9cdf6ac9499f069357784f53b324 to your computer and use it in GitHub Desktop.
Save camel-cdr/f2cc9cdf6ac9499f069357784f53b324 to your computer and use it in GitHub Desktop.
Implement LMUL=8 vcompress.vm using existing LMUL=1 RVV primitives

The implementation complexity of vcompress.vm for large vector length and higher LMUL has been somewhat debated.

Existing implementations exhibit very poor scaling when dealing with larger operands:

VLEN e8m1 e8m2 e8m4 e8m8
c906 128 3 10 32 136
c908 128 3 10 32 139
c920 128 0.5 2.4 5.4 20.0
X60 256 3 10 32 139
bobcat 256 32 64 132 260
x280 512 65 129 257 513

*bobcat: note that it was explicitly stated, that they didn't optimize the permutation instructions

*x280: the numbers are from llvm-mca. Since the x280 mostly targets AI workloads, there probably wasn't much incentive to optimize this.

Below is a proof of concept LMUL>=1 vcompress.vm implementation using existing LMUL=1 RVV primitives, that scales linearly with LMUL in execution time. It can be used to implement any vcompress.vm that is a multiple of the lane width using just in lane primitives.

Prelude
#include <stdio.h>
#include <stdint.h>

#define VLEN 4
#define VLEN_BITS 2
#define LMUL 8

// "LMUL=1" primitives, that should already exist in an hardware implementations

#define U(n) unsigned _BitInt(n)
typedef struct { U(8) at[VLEN]; } V1;
typedef U(VLEN) M1;

V1 vcompress(V1 s, M1 m) {
	size_t i = 0, o = 0;
	for (; i < VLEN; ++i) if ((m >> i) & 1) s.at[o++] = s.at[i];
	for (; o < VLEN; ++o) s.at[o] = -1;
	return s;
}

V1 vmv(U(8) x) { V1 s; for (size_t i = 0; i < VLEN; ++i) s.at[i] = x; return s; }
V1 vslidedown(V1 s, size_t off) { for (size_t i = 0; i < VLEN; ++i) s.at[i] = i+off < VLEN ? s.at[i+off] : -1; return s; }
V1 vslideup(V1 d, V1 s, size_t off) { for (size_t i = off; i < VLEN; ++i) d.at[i] = i < off ? -1 : s.at[i-off]; return d; }
U(VLEN_BITS+1) vcpop(M1 m) { return __builtin_popcountll(m); }
// implementing LMUL=8 vcompress using the above LMUL=1 primitives

typedef struct { V1 at[LMUL]; } Vx;
typedef struct { M1 at[LMUL]; } Mx;

Vx vcompress_mx(Vx vs, Mx m) {
	Vx vd;
	V1 tmp;
	U(5) idx = 0;
	U(VLEN_BITS+1) off = 0;

	// Initially "clear" the destination, your hardware might have an easy
	// way of doing this, otherwise could it could also be done in the main
	// loop. For tail undistrurbed just copy vs.
	for (size_t i = 0; i < LMUL; ++i)
		vd.at[i] = vmv(-1);

	// for every cycle 0 to LMUL
	for (size_t i = 0; i < LMUL; ++i) {
		tmp = vcompress(vs.at[i], m.at[i]);
		vd.at[idx] = vslideup(vd.at[idx], tmp, off);
		vd.at[idx+1] = vslidedown(tmp, off);
		off += vcpop(m.at[i]);
		idx += off >> VLEN_BITS; // upper bit
		off = off << 1 >> 1; // clear upper bits
	}

	// In a real implementation, vcompress to tmp and writing to the
	// destination may need to be pipelined, but that should be trivial. It
	// might also be advantagous to add special hardware to split tmp
	// between the two destination registers.

	return vd;
}
Example test
void print_mx(Vx v) {
	for (size_t i = 0; i < LMUL; ++i)
	for (size_t j = 0; j < VLEN; ++j)
		printf("%02x ", (uint8_t)v.at[i].at[j]);
	puts("");
}

int main(void) {
	Vx v = { 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 };
	Mx m = { 0b1111, 0b0000, 0b1010, 0b0101, 0b1001, 0b0110, 0b1110, 0b0111  };
	print_mx(v);
	v = vcompress_mx(v, m);
	print_mx(v);
}

Godbolt: https://godbolt.org/z/Gsvxrff4f

As you can see, this requires only existing vcompress.vm, vslideup.vx, vslidedown.vx, and vcpop.m LMUL=1 vector operations.

With a piplined vcompress and destination write, as well as special hardware to split the compressed result up in the two destination registers, this should allow for a ~LMUL+1 cycles vcompress.vm implementation.

One way to implement this on a high performance out-of-order design would be by splitting the instruction into uops. For that you would need to spread out LMUL=1 vslidedown.vm, vslideup.vx, vslidedown.vx, and vcpop.m support on separate execution units, such that the steps can execute in parallel. You would also need one hidden/temporary LMUL=1 vector register, and 3 hidden/temporary integer registers.

So e.g. for LMUL=4 vcompress.vm:

        EXU1                             EXU2                                  EXU3                              EXU4
cycle=1 vd[0] = vcompress(vs[0], m[0]);                                                                          idx, off = advance_idx_and_off(idx, off, m[0])
cycle=2 tmp   = vcompress(vs[1], m[1]);                                                                          idx, off = advance_idx_and_off(idx, off, m[1])
cycle=3 tmp   = vcompress(vs[2], m[2]); vd[idx] = vslideup(vd[idx], tmp, off); vd[idx+1] = vslidedown(tmp, off); idx, off = advance_idx_and_off(idx, off, m[2])
cycle=4 tmp   = vcompress(vs[3], m[3]); vd[idx] = vslideup(vd[idx], tmp, off); vd[idx+1] = vslidedown(tmp, off); idx, off = advance_idx_and_off(idx, off, m[3])
cycle=5                                 vd[idx] = vslideup(vd[idx], tmp, off); vd[idx+1] = vslidedown(tmp, off);

The advance_idx_and_off could also be further split into uops, and pipelined. The vslideup and vslidedown writes could also be implemented in a single custom uop, and run on a single execution unit, provided writing to two vector registers at once is possible for the implementation.

This design could also start executing subsequent operations on vd[i], once idx>i, which should further improve performance, especially for LMUL=8 and LMUL=4.

License

Copyright 2024 Olaf Bernstein

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment