Note on expanding a mask to vector lanes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| // ---- ORIGINAL | |
| // Turn the 8-bit mask into 8 packed bytes | |
| const unsigned __int64 ONE_BYTES = 0x0101010101010101; | |
| unsigned __int64 hit = _pdep_u64( frame.pMasks[i], ONE_BYTES); | |
| unsigned __int64 miss = hit ^ (ONE_BYTES); | |
| __m128i vhit = _mm_cvtsi64_si128(hit); | |
| __m128i vmiss = _mm_cvtsi64_si128(miss); | |
| __m128i vhit_mask = _mm_sub_epi8(vmiss,_mm_cvtsi64_si128(ONE_BYTES)); // 0 if miss, 0xff if hit | |
| vhit_mask = _mm_cvtepi8_epi16(vhit_mask); // 0 or 0xffff | |
| // ---- MODIFIED: (no need for PDEP, just straight SSE2 does the job) | |
| __m128i bit_masks = _mm_setr_epi16(1,2,4,8, 16,32,64,128); // constant | |
| __m128i vhit_temp = _mm_set1_epi16(frame.pMasks[i]); | |
| __m128i vhit_mask = _mm_cmpeq_epi16(_mm_and_si128(vhit_temp, bit_masks), bit_masks); // -1 in lane i when bit i set in mask, 0 otherwise | |
| // might just as well be summing words not bytes (still fits in 128 bits and the | |
| // rsult gets up-converted to 16-bit anyway) so you can do: | |
| __m128i vhit = _mm_srli_epi16(vhit_mask, 15); // 1 in lane i when bit i set in mask, 0 otherwise | |
| __m128i vmiss = _mm_xor_si128(vhit, _mm_set1_epi16(1)); | |
| // but given what you want to do with these numbers, you can just as well leave lit lanes | |
| // at -1, sum those and negate the result... | |
| // Also note that these vhit/vmiss have 2-byte lanes, not 1-byte lanes. | |
| // And yeah, if you do the compress-style output, all of this is unnecessary. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment