Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
ARM NEON PMOVMSKB substitute to turn 4 _interleaved_ predicate results over 128-bits to a single 64-bit value
uint64_t neonmovemask_bulk(uint8x16_t p0, uint8x16_t p1, uint8x16_t p2, uint8x16_t p3) {
const uint8x16_t bitmask1 = { 0x01, 0x10, 0x01, 0x10, 0x01, 0x10, 0x01, 0x10,
0x01, 0x10, 0x01, 0x10, 0x01, 0x10, 0x01, 0x10};
const uint8x16_t bitmask2 = { 0x02, 0x20, 0x02, 0x20, 0x02, 0x20, 0x02, 0x20,
0x02, 0x20, 0x02, 0x20, 0x02, 0x20, 0x02, 0x20};
const uint8x16_t bitmask3 = { 0x04, 0x40, 0x04, 0x40, 0x04, 0x40, 0x04, 0x40,
0x04, 0x40, 0x04, 0x40, 0x04, 0x40, 0x04, 0x40};
const uint8x16_t bitmask4 = { 0x08, 0x80, 0x08, 0x80, 0x08, 0x80, 0x08, 0x80,
0x08, 0x80, 0x08, 0x80, 0x08, 0x80, 0x08, 0x80};
uint8x16_t t0 = vandq_u8(p0, bitmask1);
uint8x16_t t1 = vbslq_u8(bitmask2, p1, t0);
uint8x16_t t2 = vbslq_u8(bitmask3, p2, t1);
uint8x16_t tmp = vbslq_u8(bitmask4, p3, t2);
uint8x16_t sum = vpaddq_u8(tmp, tmp);
return vgetq_lane_u64(vreinterpretq_u64_u8(sum), 0);
}
@zingaburga

This comment has been minimized.

Copy link

@zingaburga zingaburga commented Dec 25, 2019

One less operation, replacing AND+ADDP with SHRN (shift-right and narrow):

uint8x16_t t0 = vbslq_u8(vdupq_n_u8(0x55), p0, p1); // 01010101...
uint8x16_t t1 = vbslq_u8(vdupq_n_u8(0x55), p2, p3); // 23232323...
uint8x16_t combined = vbslq_u8(vdupq_n_u8(0x33), t0, t1); // 01230123...
int8x8_t sum = vshrn_n_s16(vreinterpretq_s16_u8(combined), 4);
return vget_lane_u64(vreinterpret_u64_s8(sum), 0);

Also uses half the number of vector constants, leaving more registers free for other stuff, or allows the use of MOVI for generating the constants, instead of loading from memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.