Skip to content

Instantly share code, notes, and snippets.

@rygorous
Last active August 29, 2015 14:22
Embed
What would you like to do?
Note on expanding a mask to vector lanes
// ---- ORIGINAL
// Turn the 8-bit mask into 8 packed bytes
const unsigned __int64 ONE_BYTES = 0x0101010101010101;
unsigned __int64 hit = _pdep_u64( frame.pMasks[i], ONE_BYTES);
unsigned __int64 miss = hit ^ (ONE_BYTES);
__m128i vhit = _mm_cvtsi64_si128(hit);
__m128i vmiss = _mm_cvtsi64_si128(miss);
__m128i vhit_mask = _mm_sub_epi8(vmiss,_mm_cvtsi64_si128(ONE_BYTES)); // 0 if miss, 0xff if hit
vhit_mask = _mm_cvtepi8_epi16(vhit_mask); // 0 or 0xffff
// ---- MODIFIED: (no need for PDEP, just straight SSE2 does the job)
__m128i bit_masks = _mm_setr_epi16(1,2,4,8, 16,32,64,128); // constant
__m128i vhit_temp = _mm_set1_epi16(frame.pMasks[i]);
__m128i vhit_mask = _mm_cmpeq_epi16(_mm_and_si128(vhit_temp, bit_masks), bit_masks); // -1 in lane i when bit i set in mask, 0 otherwise
// might just as well be summing words not bytes (still fits in 128 bits and the
// rsult gets up-converted to 16-bit anyway) so you can do:
__m128i vhit = _mm_srli_epi16(vhit_mask, 15); // 1 in lane i when bit i set in mask, 0 otherwise
__m128i vmiss = _mm_xor_si128(vhit, _mm_set1_epi16(1));
// but given what you want to do with these numbers, you can just as well leave lit lanes
// at -1, sum those and negate the result...
// Also note that these vhit/vmiss have 2-byte lanes, not 1-byte lanes.
// And yeah, if you do the compress-style output, all of this is unnecessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment