A few notes on XBitmanip
Looks fairly good to me; although I like PDEP/PEXT (and accordingly BDEP/BEXT) quite much, I'm a bit worried about the relatively large amount of control circuitry required to set up the butterfly muxes, but that concern is already mentioned in the spec draft so I'm not adding anything new.
I do have a few remarks though.
Problems with shift counts mod XLEN
One recurring issue with bit mask generation and shifts are the corner cases. For example, the usual
(1 << width) - 1
to create a bit mask with the lower "width" bits being 1 and the remaining bits 0 only works for width=0,...,XLEN-1, but not width=XLEN. On an architecture that implements shift distances mod-XLEN, the result for width=XLEN will be "0" instead of all-ones.
The proposed shift-ones instructions suffer from the same issue.
This is an issue when the width is given dynamically in a register and can be [0,XLEN] inclusive.
An example code fragment where this problem occurs is given in
https://fgiesen.wordpress.com/2018/02/20/reading-bits-in-far-too-many-ways-part-2/
"peekbits2_lsb". Note that it is necessary to use a table (width_to_mask_table) of the XLEN+1 possible mask values to express this portably. (Sadly, because so many architectures now use the mod-XLEN behavior, this will continue to be hard to use from code that tries to be portable.)
I'm not the only one running into this problem; it is quite common in decompressors, in one way or another. For example, zstd (github.com/facebook/zstd) runs into the exact same issue in one of its core bitstream decoding functions:
(linking to a specific commit here so it's stable). The code here wants to perform a right-shift by "XLEN - nbBits" (regMask=XLEN-1), but this fails in the nbBits=0 case; instead, they compute
((value >> 1) >> (XLEN - 1 - nbBits))
There's a separate "lookBitsFast" function (a few lines below) that is used when the decoder knows statically that nbBits >= 1.
It's worth mentioning that POWER/PowerPC have variable-distance shifts interpret their operand mod XLEN*2 instead of mod XLEN, which avoids this issue.
Likewise, x86 BEXTR / BZHI differ from other shift-type instructions in that they are not "mod XLEN", but instead use full 8-bit start/width fields. For example, doing BEXTR with a bitfield that has START=127 LEN=123 will return 0 for any input; the sequence given in section 5.4 won't.
Rephrasing the problem by thinking of BZHI as an algebraic operation, the problem with the
slo a0, zero, a2
c.and a0, a1
implementation is that it has no identity element, whereas the actual x86 BZHI does (any value in [XLEN,255] works).
(The counterargument is that while this is unfortunate, it would also be awkward to have the shift-in-ones instructions behave differently from the base shift instructions in this way. I'm not trying to offer a resolution here, just pointing out the problem.)
Load immediate repeating?
This might be too minor, but what about a "load immediate repeating" (LIR?) instruction that repeats the same 16-bit immediate value into all 16-bit halfwords of the output register? The motivation being that a large number of 64-bit constants used in SWAR-type algorithms are of this form, and loading them either takes a load from memory (with a constant pool) or a multi-instruction sequence, on 64-bit targets anyway.
This is useful both directly in the SWAR algorithms and for bit masks used by BEXT/BDEP when unpacking regular bit-packed values.
CLZ variant
One more minor tweak: consider giving CLZ/CLZW a 3-bit right-shift count k, where CLZ(x,k) returns CLZ(x) >> k. The motivation being similar to the masks for grev: the generalized CLZ with k=0 is regular leading zero bit count, k=3 counts the number of leading zero bytes (useful for string processing / memcmp type operations). These two cases (k=0 and k=3) are definitely the primary cases of interest, and admittedly the benefit is quite marginal and primarily motivated by the immediate field in the CLZ encoding being unused. Nevertheless, as motivation for how a leading zero byte count is useful sometimes:
// strlen-style main loop
// a0 = (8B-aligned) input pointer
// a1 = 0x0101010101010101
// a2 = 0x8080808080808080
next8:
c.ld a3, 8(a0)
c.addi a0, 8 // advance pointer
// the standard "hasless(x,1)" trick to find a zero byte
sub a4, a3, a1
andc a3, a2, a3
c.and a3, a4
// if a3 == 0, there was no zero byte found
c.beqz a3, next8
// if a3 != 0, then the lowest significant byte of a3 that
// is nonzero (0x80, in fact) was the location of the zero
// byte in the original loaded value; the higher bits contain
// garbage. (Therefore, this trick only works as described
// assuming little-endian.)
brev a3, a3
clz a3, a3, 3 // =number of trailing zero bytes in a3
c.add a0, a3 // a0 now points at the zero byte
Another use case is for memcmp-like scenarios like LZ string matching:
// memcmp-esque loop
// a0 = source A pointer
// a1 = source B pointer
// a2 = source A "shortly before end"
compare8:
bge a0, a2, careful_loop_near_end // not shown
c.ld a3, 0(a0)
c.addi a0, 8
c.ld a4, 0(a1)
c.addi a1, 8
beq a3, a4, compare8
found_mismatch:
c.addi a0, -8 // back up by 8 bytes
c.xor a3, a4 // mismatch bytes
brev a3, a3
clz a3, a3, 3
c.add a0, a3
Both of these are really marginal by themselves (they only save one shift), but if the encoding of CLZ has an unused immediate field, then I think defining a count leading zero bytes (or a count-leading-zeroes-and-shift) might be worthwhile.
(I'm not particularly wedded to this idea, but our codebase has separate "Clz", "Ctz", "ClzBytes" and "CtzBytes" functions, primarily because the "bytes" variants are substantially cheaper on machines that don't have a fast bit scanning instruction, and it turns out that nearly half of our uses of Clz/Ctz are either of the "Bytes" variety or followed immediately by a right-shift; I think this would be interesting to check on traces of real programs where the pertinent libc functions are optimized, to see what the dynamic instruction counts say.)