Prefix sums
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| // Original; this is straight Kogge-Stone | |
| // the problem is that on Intel Haswell and later, there's only one | |
| // port (port 5) that handles shuffles, including PSLLDQ (_mm_slli_si128). | |
| // This code needs 4 cycles worth of port 5 work, which is not great if | |
| // you want to mix it with other work that is port 5-heavy. | |
| static inline __m128i prefix_sum_u8_orig(__m128i x) | |
| { | |
| x = _mm_add_epi8(x, _mm_slli_si128(x, 1)); | |
| x = _mm_add_epi8(x, _mm_slli_si128(x, 2)); | |
| x = _mm_add_epi8(x, _mm_slli_si128(x, 4)); | |
| x = _mm_add_epi8(x, _mm_slli_si128(x, 8)); | |
| return x; | |
| } | |
| static inline __m128i prefix_sum_u8_new(__m128i x) | |
| { | |
| // Modified form: this is Kogge-Stone within 64bit halves, | |
| // then does one final Sklansky-style reduction step to merge | |
| // the halves. This one use a SSSE3 instruction (the final PSHUFB) | |
| // but otherwise sticks with integer adds and shifts which don't | |
| // require port 5. | |
| // | |
| // This kind of approach is even more interesting when dealing with AVX2 | |
| // 256-bit vectors, because almost no operations can cross 128b boundaries, | |
| // so Sklansky-style is definitely the way to go for the final reduction. | |
| x = _mm_add_epi8(x, _mm_slli_epi64(x, 8)); | |
| x = _mm_add_epi8(x, _mm_slli_epi64(x, 16)); | |
| x = _mm_add_epi8(x, _mm_slli_epi64(x, 32)); | |
| x = _mm_add_epi8(x, _mm_shuffle_epi8(x, _mm_setr_epi8(-1,-1,-1,-1,-1,-1,-1,-1, 7,7,7,7,7,7,7,7))); | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment