Skip to content

Instantly share code, notes, and snippets.

@rygorous
Created July 27, 2018 01:16
Embed
What would you like to do?
Prefix sums
// Original; this is straight Kogge-Stone
// the problem is that on Intel Haswell and later, there's only one
// port (port 5) that handles shuffles, including PSLLDQ (_mm_slli_si128).
// This code needs 4 cycles worth of port 5 work, which is not great if
// you want to mix it with other work that is port 5-heavy.
static inline __m128i prefix_sum_u8_orig(__m128i x)
{
x = _mm_add_epi8(x, _mm_slli_si128(x, 1));
x = _mm_add_epi8(x, _mm_slli_si128(x, 2));
x = _mm_add_epi8(x, _mm_slli_si128(x, 4));
x = _mm_add_epi8(x, _mm_slli_si128(x, 8));
return x;
}
static inline __m128i prefix_sum_u8_new(__m128i x)
{
// Modified form: this is Kogge-Stone within 64bit halves,
// then does one final Sklansky-style reduction step to merge
// the halves. This one use a SSSE3 instruction (the final PSHUFB)
// but otherwise sticks with integer adds and shifts which don't
// require port 5.
//
// This kind of approach is even more interesting when dealing with AVX2
// 256-bit vectors, because almost no operations can cross 128b boundaries,
// so Sklansky-style is definitely the way to go for the final reduction.
x = _mm_add_epi8(x, _mm_slli_epi64(x, 8));
x = _mm_add_epi8(x, _mm_slli_epi64(x, 16));
x = _mm_add_epi8(x, _mm_slli_epi64(x, 32));
x = _mm_add_epi8(x, _mm_shuffle_epi8(x, _mm_setr_epi8(-1,-1,-1,-1,-1,-1,-1,-1, 7,7,7,7,7,7,7,7)));
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment